AI4Bharat
Share
Explore
AI4Bharat Public

Models

Model Cards
0
Search
IndicBERT v1
IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)
IndicTrans v0.3
IndicTrans is a Transformer-4x ( ~434M ) multilingual NMT model trained on Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary. We currently release two models - Indic to English and English to Indic and supports 11 indic languages.
IndicASR v1
IndicWav2Vec is a multilingual speech model pretrained on 40 Indian langauges. This model represents the largest diversity of Indian languages in the pool of multilingual speech models. We fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public benchmarks, namely MUCS, MSR and OpenSLR.
IndicXLit v1
IndicBART v1
IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.
IndicNLGSuite
IndicNLGSuite has a collection of models trained for five different tasks: Biography Generation, Headline Generation, Paraphrase Generation, Sentence Summarization, and Question Generation.
IndicBERT v1
Type
Language Model
Languages Supported
Assamese
Bengali
English
Gujarati
Hindi
Kannada
Malayalam
Marathi
Oriya
Punjabi
Tamil
Telugu
Description
IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)
Benchmark Results
We evaluate IndicBERT model on a set of tasks as described in the . Here are the results that we obtain:
IndicGLUE results
0
Task
mBERT
XLM-R
IndicBERT
1
News Article Headline Prediction
89.58
95.52
95.87
2
Wikipedia Section Title Prediction
73.66
66.33
73.31
3
Cloze-style multiple-choice QA
39.16
27.98
41.87
4
Article Genre Classification
90.63
97.03
97.34
5
Named Entity Recognition (F1-score)
73.24
65.93
64.47
6
Cross-Lingual Sentence Retrieval Task
21.46
13.74
27.12
7
Average
64.62
61.09
66.66
There are no rows in this table
Additional Tasks
0
Task
Task Type
mBERT
XLM-R
IndicBERT
1
BBC News Classification
Genre Classification
60.55
75.52
74.6
2
IIT Product Reviews
Sentiment Analysis
74.57
78.97
71.32
3
IITP Movie Reviews
Sentiment Analaysis
56.77
61.61
59.03
4
Soham News Article
Genre Classification
80.23
87.6
78.45
5
Midas Discourse
Discourse Analysis
71.2
79.94
78.44
6
iNLTK Headlines Classification
Genre Classification
87.95
93.38
94.52
7
ACTSA Sentiment Analysis
Sentiment Analysis
48.53
59.33
61.18
8
Winograd NLI
Natural Language Inference
56.34
55.87
56.34
9
Choice of Plausible Alternative (COPA)
Natural Language Inference
54.92
51.13
58.33
10
Amrita Exact Paraphrase
Paraphrase Detection
93.81
93.02
93.75
11
Amrita Rough Paraphrase
Paraphrase Detection
83.38
82.2
84.33
12
Average
69.84
74.42
73.66
There are no rows in this table
* Note: all models have been restricted to a max_seq_length of 128.
Training Setup
IndicBERT v1 is trained on IndicCorp v1, which has a total pretraining corpus size of 120GB containing 8.9 billion tokens across 12 languages.
License
MIT License

Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.