Skip to content
Gallery
Quarterly Reports
Share
Explore
2022 Q1 Report

icon picker
4. Models

Model Cards
Search
IndicBERT v1
IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)
IndicTrans v0.3
IndicASR v1
IndicXLit v1
IndicBART v1
IndicBERT v1
Type
Language Model
Languages Supported
Assamese
Bengali
English
Gujarati
Hindi
Kannada
Malayalam
Marathi
Oriya
Punjabi
Tamil
Telugu
Description
IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)
Benchmark Results
We evaluate IndicBERT model on a set of tasks as described in the . Here are the results that we obtain:
IndicGLUE results
Task
mBERT
XLM-R
IndicBERT
1
News Article Headline Prediction
89.58
95.52
95.87
2
Wikipedia Section Title Prediction
73.66
66.33
73.31
3
Cloze-style multiple-choice QA
39.16
27.98
41.87
4
Article Genre Classification
90.63
97.03
97.34
5
Named Entity Recognition (F1-score)
73.24
65.93
64.47
6
Cross-Lingual Sentence Retrieval Task
21.46
13.74
27.12
7
Average
64.62
61.09
66.66
There are no rows in this table
Additional Tasks
Task
Task Type
mBERT
XLM-R
IndicBERT
1
BBC News Classification
Genre Classification
60.55
75.52
74.6
2
IIT Product Reviews
Sentiment Analysis
74.57
78.97
71.32
3
IITP Movie Reviews
Sentiment Analaysis
56.77
61.61
59.03
4
Soham News Article
Genre Classification
80.23
87.6
78.45
5
Midas Discourse
Discourse Analysis
71.2
79.94
78.44
6
iNLTK Headlines Classification
Genre Classification
87.95
93.38
94.52
7
ACTSA Sentiment Analysis
Sentiment Analysis
48.53
59.33
61.18
8
Winograd NLI
Natural Language Inference
56.34
55.87
56.34
9
Choice of Plausible Alternative (COPA)
Natural Language Inference
54.92
51.13
58.33
10
Amrita Exact Paraphrase
Paraphrase Detection
93.81
93.02
93.75
11
Amrita Rough Paraphrase
Paraphrase Detection
83.38
82.2
84.33
12
Average
69.84
74.42
73.66
There are no rows in this table
* Note: all models have been restricted to a max_seq_length of 128.
Training Setup
IndicBERT v1 is trained on IndicCorp v1, which has a total pretraining corpus size of 120GB containing 8.9 billion tokens across 12 languages.
License
MIT License

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.