AI4Bharat

Explore

AI4Bharat

AI4Bharat Public

Models

Model Cards

Model Cards

IndicBERT v1

IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)

IndicTrans v0.3

IndicTrans is a Transformer-4x ( ~434M ) multilingual NMT model trained on Samanantar dataset which is the largest publicly available parallel corpora collection for Indic languages at the time of writing ( 14 April 2021 ). It is a single script model i.e we convert all the Indic data to the Devanagari script which allows for better lexical sharing between languages for transfer learning, prevents fragmentation of the subword vocabulary between Indic languages and allows using a smaller subword vocabulary. We currently release two models - Indic to English and English to Indic and supports 11 indic languages.

IndicASR v1

IndicWav2Vec is a multilingual speech model pretrained on 40 Indian langauges. This model represents the largest diversity of Indian languages in the pool of multilingual speech models. We fine-tune this model for downstream ASR for 9 languages and obtain state-of-the-art results on 3 public benchmarks, namely MUCS, MSR and OpenSLR.

IndicXLit v1

IndicBART v1

IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on 11 Indic languages and English.

IndicNLGSuite

IndicNLGSuite has a collection of models trained for five different tasks: Biography Generation, Headline Generation, Paraphrase Generation, Sentence Summarization, and Question Generation.

IndicBERT v1

Type

Language Model

Languages Supported

Assamese

Bengali

English

Gujarati

Hindi

Kannada

Malayalam

Marathi

Oriya

Punjabi

Tamil

Telugu

Description

IndicBERT is pre-trained with IndicNLP corpus which covers 12 Indian languages (including English)

Benchmark Results

We evaluate IndicBERT model on a set of tasks as described in the

IndicGLUE page⁠

. Here are the results that we obtain:

IndicGLUE results

IndicGLUE results

Task

mBERT

XLM-R

IndicBERT

News Article Headline Prediction

89.58

95.52

95.87

Wikipedia Section Title Prediction

73.66

66.33

73.31

Cloze-style multiple-choice QA

39.16

27.98

41.87

Article Genre Classification

90.63

97.03

97.34

Named Entity Recognition (F1-score)

73.24

65.93

64.47

Cross-Lingual Sentence Retrieval Task

21.46

13.74

27.12

Average

64.62

61.09

66.66

There are no rows in this table

⁠

Additional Tasks

Additional Tasks

Task

Task Type

mBERT

XLM-R

IndicBERT

BBC News Classification

Genre Classification

60.55

75.52

74.6

IIT Product Reviews

Sentiment Analysis

74.57

78.97

71.32

IITP Movie Reviews

Sentiment Analaysis

56.77

61.61

59.03

Soham News Article

Genre Classification

80.23

87.6

78.45

Midas Discourse

Discourse Analysis

71.2

79.94

78.44

iNLTK Headlines Classification

Genre Classification

87.95

93.38

94.52

ACTSA Sentiment Analysis

Sentiment Analysis

48.53

59.33

61.18

Winograd NLI

Natural Language Inference

56.34

55.87

56.34

Choice of Plausible Alternative (COPA)

Natural Language Inference

54.92

51.13

58.33

Amrita Exact Paraphrase

Paraphrase Detection

93.81

93.02

93.75

Amrita Rough Paraphrase

Paraphrase Detection

83.38

82.2

84.33

Average

69.84

74.42

73.66

There are no rows in this table

⁠

* Note: all models have been restricted to a max_seq_length of 128.

Model Demo

ai4bharat/indic-bert · Hugging Face

Model Code

GitHub - AI4Bharat/Indic-BERT-v1: Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT

Publication

IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages

Training Setup

IndicBERT v1 is trained on IndicCorp v1, which has a total pretraining corpus size of 120GB containing 8.9 billion tokens across 12 languages.

License

MIT License

⁠

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.