AI4Bharat

Explore

Gallery

AI4Bharat

IndicMining

Meeting Minutes

⁠
Mar 25
⁠
⁠

Main points:

Results of deduplication between SamV2 and SamV0.3

Current numbers vs Samanantar V0.3

Current numbers vs Samanantar V0.3

lang_pair

Samanantar V0.3

Current results

New pairs

as_en

49391

42771

42005

or_en

819974

267134

261904

pa_en

1730346

773817

757640

gu_en

2327874

1393423

1357353

mr_en

2467898

1557312

1524671

te_en

3896471

2010167

1964440

ml_en

4381343

2377905

2312760

ta_en

3184025

2644994

2598131

kn_en

3305039

2273129

2223731

bn_en

4733044

4197306

4116348

hi_en

4712970

5468996

5391802

There are no rows in this table

⁠

Using the "(s, t) and (u,v) are duplicates iff (((s == u) && (t == v)) || ((s == v) && (t == u)))" definition for deduplication, we found that there was very little overlap between current results and Samanantar V0.3 IndicCorp data.

2. Hypothesis for why there is very little overlap: As we had discussed in previous meetings, even though the Samanantar paper stated that the true cosine distances of the 16 approximate nearest neighbour candidates were recomputed and an argmax was used to isolate and maintain the notional nearest neighbour (while discarding the other 15), while examining the code that was being used by Mayank, etc., we had found that that argmax step, although performed, was never used for the downstream selection. This must have been a bug. This probably did not affect the quality of the mined bitext much because we had visually noticed during earlier margin experiments that the 16 candidate nearest neighbours were usually all paraphrasals of each other. So the pairs obtained in Samanantar v1 (0.2, 0.3, etc.) were still good bitext pairs. So what we hypothesise is happening is that the same query s would have found (more or less) the same 16 nearest neighbour candidates both this, and last time. But while selecting which candidate to retain over 15 others, the targets retained will differ this time.

The new pairs are more due to difference in the way we take a target sentence:

Samanantar v0.3 : Pick the top ranked target sentence from result of faiss index.

our method : Pick the target sentence which has max LAS among top 16 pairs got form faiss index.

If SamV0.3 found (s, t1) in a particular language pair, SamV2 would have found (s, t2) with t1 == t2 only 1/16 of the time on average.

3. Discussion with Gowtham and Sumanth earlier today: The latest version of IndicCorp is not a strict superset of the previous version of IndicCorp because of

spammy news sites being removed as sources, certain sites being inaccessible, archived, etc. this time around.

An OSCAR dataset was included under the label IndicCorp (along with true IndicCorp) during the previous mining cycles and was used to arrive at the numbers in

this spreadsheet⁠

The numbers we have listed as v1 Counts in the tables above are line counts from the sourcewise-splits from the

AI4B webpage⁠

. Notice that these numbers are smaller than the ones in the spreadsheet. This is because deduplication had removed some pairs.

4. Because of the (s, t1) vs (s, t2) issue, we need to do a one-sided comparison/equality check instead of the two-sided definition of equality we used the get the New pairs numbers in the above table. We are planning to compute the following sets and examine their counts to validate the hypothesis that this is the issue. te_en is one pair in which there is a marked drop. So we examine

A: Sam V0.3 Telugu queries among all of the obtained te_en bitext

B: Sam V2 Telugu queries among all of the obtained te_en bitext

C: A \ B

D: new IndicCorp’s Telugu sentences

E: English targets of [C ∩ D] from Samv0.3

F: new Indiccorp’s English sentences

G: E ∩ F

5. While eyeballing the downloaded SamanantarV0.3 dataset (which was accompanied by LAS metadata), we noticed that there were pairs with LAS even as low as 0.67 and 0.72. So it cannot have been the case that they used a strict 0.8 LAS filter.

6. Pratyush sir suggested that we merge all the versions of IndiCorp we have access to and OSCAR, do deduplication of sentences and obtain large canonical sets of monolingual data, and then redo the same mining operation we did now on this larger set to get as much throughput as possible. We can also retain the entire subset of the 16 candidates which satisfy the LAS threshold since they are all good paraphrasals of the query. There is no point in discarding all but one. This could be useful paraphrasal data.

7. Discussion about the NeurIPS dataset, benchmark track is detailed on the other

page⁠

8. Pausing the Data collection for indic LABSE as we are planning to mine bitext with larger dataset.

Gallery

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.

Meeting Minutes

⁠Mar 25⁠⁠

⁠
Mar 25
⁠
⁠