Skip to content

AI4Bharat

Pages
- AI4Bharat Public
  Seminars
  Publications
  People
  Models
- AI4Bharat Admin
  Members
  Planning
  Licensing
  Meity Timelines
  Hiring
  AI4Bharat Summer of Code
- IndicMining
  Meeting Minutes
  NeurIPS dataset paper plan
- IndicASR
  RNN-T
  Multilingual ASR
  Analysis
  Adaptation in End-to-End Speech Recognition
  Data Augmentation
  Text Normalization for speech
- Shoonya
  Documentation - User Manual
  Welcome Page
  User-Roles on Shoonya
  Getting Started with Workflow
  Manager Workflow
  Language-Experts Workflow
  Annotation Workflow
  Collection Workflow
  Terminology
  FAQs and Feedback
  Management Dashboard
  Language Experts
  Annotation Tasks
  Reporting and Analytics
  Projects DataExports
  Task Details
- Shoonya Development Document
  Shoonya Workflow
  Software Architecture Diagrams
  Technology Used
  Shoonya Code Structure
  Shoonya Deployment
- Shoonya Forms
  Feature Suggestions
  Report Bugs for Shoonya
  User Feedbacks
  Stats-collection Forms

/

...

/

NeurIPS dataset paper plan

Share

Explore

NeurIPS dataset paper plan

⁠

https://neurips.cc/Conferences/2021/CallForDatasetsBenchmarks⁠

⁠

Goal - Release a 1 billion document set and 1 million query set dataset for approximate nearest neighbor search

Plan:

Aggregate and deduplicate all the indic sentence data we have over IndicCorp v1, v2, ..., and create one big collection of sentences while also maintaining language ID information.

Find the 16 exact nearest neighbours for as many queries (which we know have high quality semantic matches) as possible (under time and compute constraints). This could involve using some of the efficient exact nearest neighbour search index type which the FAISS python package provides.

Benchmark FAISS IVFPQ ANNS and DiskANN in terms of training time, performance, recall@16, etc.

Ask one research question.

Publish?

Share

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.