Skip to content
AI4Bharat
  • Pages
    • AI4Bharat Public
      • Seminars
      • Publications
      • People
      • Models
    • AI4Bharat Admin
      • Members
      • Planning
      • Licensing
      • Meity Timelines
      • Hiring
        • AI4Bharat Summer of Code
    • IndicMining
      • Meeting Minutes
      • NeurIPS dataset paper plan
    • IndicASR
      • RNN-T
      • Multilingual ASR
        • Analysis
      • Adaptation in End-to-End Speech Recognition
      • Data Augmentation
      • Text Normalization for speech
    • Shoonya
      • Documentation - User Manual
        • Welcome Page
        • User-Roles on Shoonya
        • Getting Started with Workflow
          • Manager Workflow
          • Language-Experts Workflow
            • Annotation Workflow
            • Collection Workflow
        • Terminology
        • FAQs and Feedback
      • Management Dashboard
        • Language Experts
        • Annotation Tasks
      • Reporting and Analytics
        • Projects DataExports
        • Task Details
    • Shoonya Development Document
      • Shoonya Workflow
      • Software Architecture Diagrams
      • Technology Used
      • Shoonya Code Structure
      • Shoonya Deployment
    • Shoonya Forms
      • Feature Suggestions
      • Report Bugs for Shoonya
      • User Feedbacks
      • Stats-collection Forms

Text Normalization for speech

IndicCorp Statistics for LM Training

Hindi
Vocab size 50k -
Your text file has 4202013424 words in total
It has 6792348 unique words
Your top-50000 words are 98.1308 percent of all words
Your most common word "के" occurred 174383875 times
The least common word in your top-k is "जाउंगी" with 1268 times
The first word with 1269 occurrences is "सुहैब" at place 4998
Bengali
Vocab size 50k -
Your text file has 1421370708 words in total
It has 7255906 unique words
Your top-50000 words are 94.4976 percent of all words
Your most common word "হাজার" occurred 13420488 times
The least common word in your top-k is "জমলে" with 1108 times
The first word with 1109 occurrences is "ব্রিকসের" at place 49996
Telugu
Vocab size 50k -
Your text file has 504285066 words in total
It has 6744453 unique words
Your top-50000 words are 88.2641 percent of all words
Your most common word "ఈ" occurred 6493097 times
The least common word in your top-k is "కనిపించినప్పుడు" with 687 times
The first word with 688 occurrences is "సేకరించాడు" at place 49988
Gujarati
Vocab size 50k -
Your text file has 654314870 words in total
It has 5115246 unique words
Your top-50000 words are 92.8772 percent of all words
Your most common word "છે" occurred 28282403 times
The least common word in your top-k is "લગાવીએ" with 618 times
The first word with 619 occurrences is "પાલનપોષણ" at place 49946
Tamil
Vocab size 50k -
Your text file has 660601684 words in total
It has 10664066 unique words
Your top-50000 words are 84.9398 percent of all words
Your most common word "இந்த" occurred 4422034 times
The least common word in your top-k is "ஸ்கரப்" with 1004 times
The first word with 1005 occurrences is "ஃபிடல்" at place 49978
Odia
Vocab size 50k -
Your text file has 71018901 words in total
It has 1168499 unique words
Your top-50000 words are 94.4467 percent of all words
Your most common word "ଓ" occurred 767261 times
The least common word in your top-k is "ଆଲିସା" with 53 times
The first word with 54 occurrences is "ଗ୍ରିନ" at place 49747
Marathi
Vocab size 50k -
Your text file has 571536994 words in total
It has 6105964 unique words
Your top-50000 words are 90.5154 percent of all words
Your most common word "आहे" occurred 14187400 times
The least common word in your top-k is "लिऑन" with 639 times
The first word with 640 occurrences is "वेलमध्ये" at place 49982
 
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.