AI4Bharat
Share
Explore
IndicMining

icon picker
NeurIPS dataset paper plan

Goal - Release a 1 billion document set and 1 million query set dataset for approximate nearest neighbor search
Plan:
Aggregate and deduplicate all the indic sentence data we have over IndicCorp v1, v2, ..., and create one big collection of sentences while also maintaining language ID information.
Find the 16 exact nearest neighbours for as many queries (which we know have high quality semantic matches) as possible (under time and compute constraints). This could involve using some of the efficient exact nearest neighbour search index type which the FAISS python package provides.
Benchmark FAISS IVFPQ ANNS and DiskANN in terms of training time, performance, recall@16, etc.
Ask one research question.
Publish?


Share
 
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.