AI4Bharat
Share
Explore
IndicASR

Text Normalization for speech

IndicCorp Statistics for LM Training

Hindi
Vocab size 50k -
Your text file has 4202013424 words in total
It has 6792348 unique words
Your top-50000 words are 98.1308 percent of all words
Your most common word "के" occurred 174383875 times
The least common word in your top-k is "जाउंगी" with 1268 times
The first word with 1269 occurrences is "सुहैब" at place 4998
Bengali
Vocab size 50k -
Your text file has 1421370708 words in total
It has 7255906 unique words
Your top-50000 words are 94.4976 percent of all words
Your most common word "হাজার" occurred 13420488 times
The least common word in your top-k is "জমলে" with 1108 times
The first word with 1109 occurrences is "ব্রিকসের" at place 49996
Telugu
Vocab size 50k -
Your text file has 504285066 words in total
It has 6744453 unique words
Your top-50000 words are 88.2641 percent of all words
Your most common word "ఈ" occurred 6493097 times
The least common word in your top-k is "కనిపించినప్పుడు" with 687 times
The first word with 688 occurrences is "సేకరించాడు" at place 49988
Gujarati
Vocab size 50k -
Your text file has 654314870 words in total
It has 5115246 unique words
Your top-50000 words are 92.8772 percent of all words
Your most common word "છે" occurred 28282403 times
The least common word in your top-k is "લગાવીએ" with 618 times
The first word with 619 occurrences is "પાલનપોષણ" at place 49946
Tamil
Vocab size 50k -
Your text file has 660601684 words in total
It has 10664066 unique words
Your top-50000 words are 84.9398 percent of all words
Your most common word "இந்த" occurred 4422034 times
The least common word in your top-k is "ஸ்கரப்" with 1004 times
The first word with 1005 occurrences is "ஃபிடல்" at place 49978
Odia
Vocab size 50k -
Your text file has 71018901 words in total
It has 1168499 unique words
Your top-50000 words are 94.4467 percent of all words
Your most common word "ଓ" occurred 767261 times
The least common word in your top-k is "ଆଲିସା" with 53 times
The first word with 54 occurrences is "ଗ୍ରିନ" at place 49747
Marathi
Vocab size 50k -
Your text file has 571536994 words in total
It has 6105964 unique words
Your top-50000 words are 90.5154 percent of all words
Your most common word "आहे" occurred 14187400 times
The least common word in your top-k is "लिऑन" with 639 times
The first word with 640 occurrences is "वेलमध्ये" at place 49982
Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.