Skip to content
AI4Bharat
  • Pages
    • AI4Bharat Public
      • Seminars
      • Publications
      • People
      • Models
    • AI4Bharat Admin
      • Members
      • Planning
      • Licensing
      • Meity Timelines
      • Hiring
        • AI4Bharat Summer of Code
    • IndicMining
      • Meeting Minutes
      • icon picker
        NeurIPS dataset paper plan
    • IndicASR
      • RNN-T
      • Multilingual ASR
        • Analysis
      • Adaptation in End-to-End Speech Recognition
      • Data Augmentation
      • Text Normalization for speech
    • Shoonya
      • Documentation - User Manual
        • Welcome Page
        • User-Roles on Shoonya
        • Getting Started with Workflow
          • Manager Workflow
          • Language-Experts Workflow
            • Annotation Workflow
            • Collection Workflow
        • Terminology
        • FAQs and Feedback
      • Management Dashboard
        • Language Experts
        • Annotation Tasks
      • Reporting and Analytics
        • Projects DataExports
        • Task Details
    • Shoonya Development Document
      • Shoonya Workflow
      • Software Architecture Diagrams
      • Technology Used
      • Shoonya Code Structure
      • Shoonya Deployment
    • Shoonya Forms
      • Feature Suggestions
      • Report Bugs for Shoonya
      • User Feedbacks
      • Stats-collection Forms

NeurIPS dataset paper plan

Goal - Release a 1 billion document set and 1 million query set dataset for approximate nearest neighbor search
Plan:
Aggregate and deduplicate all the indic sentence data we have over IndicCorp v1, v2, ..., and create one big collection of sentences while also maintaining language ID information.
Find the 16 exact nearest neighbours for as many queries (which we know have high quality semantic matches) as possible (under time and compute constraints). This could involve using some of the efficient exact nearest neighbour search index type which the FAISS python package provides.
Benchmark FAISS IVFPQ ANNS and DiskANN in terms of training time, performance, recall@16, etc.
Ask one research question.
Publish?


 
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.