Explore

2022 Q1 Report

4. Models

Model Cards

Model Cards

IndicBERT v1

IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)

IndicTrans v0.3

IndicASR v1

IndicXLit v1

IndicBART v1

IndicBERT v1

Type

Languages Supported

Description

IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)

Benchmark Results

We evaluate IndicBERT model on a set of tasks as described in the

IndicGLUE page⁠

. Here are the results that we obtain:

IndicGLUE results

IndicGLUE results

Task

mBERT

XLM-R

IndicBERT

News Article Headline Prediction

89.58

95.52

95.87

Wikipedia Section Title Prediction

73.66

66.33

73.31

Cloze-style multiple-choice QA

39.16

27.98

41.87

Article Genre Classification

90.63

97.03

97.34

Named Entity Recognition (F1-score)

73.24

65.93

64.47

Cross-Lingual Sentence Retrieval Task

21.46

13.74

27.12

Average

64.62

61.09

66.66

There are no rows in this table

⁠

Additional Tasks

Additional Tasks

Task

Task Type

mBERT

XLM-R

IndicBERT

BBC News Classification

Genre Classification

60.55

75.52

74.6

IIT Product Reviews

Sentiment Analysis

74.57

78.97

71.32

IITP Movie Reviews

Sentiment Analaysis

56.77

61.61

59.03

Soham News Article

Genre Classification

80.23

87.6

78.45

Midas Discourse

Discourse Analysis

71.2

79.94

78.44

iNLTK Headlines Classification

Genre Classification

87.95

93.38

94.52

ACTSA Sentiment Analysis

Sentiment Analysis

48.53

59.33

61.18

Winograd NLI

Natural Language Inference

56.34

55.87

56.34

Choice of Plausible Alternative (COPA)

Natural Language Inference

54.92

51.13

58.33

Amrita Exact Paraphrase

Paraphrase Detection

93.81

93.02

93.75

Amrita Rough Paraphrase

Paraphrase Detection

83.38

82.2

84.33

Average

69.84

74.42

73.66

There are no rows in this table

⁠

* Note: all models have been restricted to a max_seq_length of 128.

Model Demo

https://huggingface.co/ai4bharat/indic-bert

Model Code

https://github.com/AI4Bharat/indic-bert

Publication

https://aclanthology.org/2020.findings-emnlp.445/

Training Setup

IndicBERT v1 is trained on IndicCorp v1, which has a total pretraining corpus size of 120GB containing 8.9 billion tokens across 12 languages.

License

⁠

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.