Results of deduplication between SamV2 and SamV0.3
Current numbers vs Samanantar V0.3
lang_pair
Samanantar V0.3
Current results
New pairs
lang_pair
Samanantar V0.3
Current results
New pairs
1
as_en
49391
42771
42005
2
or_en
819974
267134
261904
3
pa_en
1730346
773817
757640
4
gu_en
2327874
1393423
1357353
5
mr_en
2467898
1557312
1524671
6
te_en
3896471
2010167
1964440
7
ml_en
4381343
2377905
2312760
8
ta_en
3184025
2644994
2598131
9
kn_en
3305039
2273129
2223731
10
bn_en
4733044
4197306
4116348
11
hi_en
4712970
5468996
5391802
There are no rows in this table
Using the "(s, t) and (u,v) are duplicates iff (((s == u) && (t == v)) || ((s == v) && (t == u)))" definition for deduplication, we found that there was very little overlap between current results and Samanantar V0.3 IndicCorp data.
2. Hypothesis for why there is very little overlap: As we had discussed in previous meetings, even though the Samanantar paper stated that the true cosine distances of the 16 approximate nearest neighbour candidates were recomputed and an argmax was used to isolate and maintain the notional nearest neighbour (while discarding the other 15), while examining the code that was being used by Mayank, etc., we had found that that argmax step, although performed, was never used for the downstream selection. This must have been a bug. This probably did not affect the quality of the mined bitext much because we had visually noticed during earlier margin experiments that the 16 candidate nearest neighbours were usually all paraphrasals of each other. So the pairs obtained in Samanantar v1 (0.2, 0.3, etc.) were still good bitext pairs. So what we hypothesise is happening is that the same query s would have found (more or less) the same 16 nearest neighbour candidates both this, and last time. But while selecting which candidate to retain over 15 others, the targets retained will differ this time.
The new pairs are more due to difference in the way we take a target sentence:
Samanantar v0.3 : Pick the top ranked target sentence from result of faiss index.
our method : Pick the target sentence which has max LAS among top 16 pairs got form faiss index.
If SamV0.3 found (s, t1) in a particular language pair, SamV2 would have found (s, t2) with t1 == t2 only 1/16 of the time on average.
3. Discussion with Gowtham and Sumanth earlier today: The latest version of IndicCorp is not a strict superset of the previous version of IndicCorp because of
spammy news sites being removed as sources, certain sites being inaccessible, archived, etc. this time around.
An OSCAR dataset was included under the label IndicCorp (along with true IndicCorp) during the previous mining cycles and was used to arrive at the numbers in
. Notice that these numbers are smaller than the ones in the spreadsheet. This is because deduplication had removed some pairs.
4. Because of the (s, t1) vs (s, t2) issue, we need to do a one-sided comparison/equality check instead of the two-sided definition of equality we used the get the New pairs numbers in the above table. We are planning to compute the following sets and examine their counts to validate the hypothesis that this is the issue. te_en is one pair in which there is a marked drop. So we examine
A: Sam V0.3 Telugu queries among all of the obtained te_en bitext
B: Sam V2 Telugu queries among all of the obtained te_en bitext
C: A \ B
D: new IndicCorp’s Telugu sentences
E: English targets of [C ∩ D] from Samv0.3
F: new Indiccorp’s English sentences
G: E ∩ F
5. While eyeballing the downloaded SamanantarV0.3 dataset (which was accompanied by LAS metadata), we noticed that there were pairs with LAS even as low as 0.67 and 0.72. So it cannot have been the case that they used a strict 0.8 LAS filter.
6. Pratyush sir suggested that we merge all the versions of IndiCorp we have access to and OSCAR, do deduplication of sentences and obtain large canonical sets of monolingual data, and then redo the same mining operation we did now on this larger set to get as much throughput as possible. We can also retain the entire subset of the 16 candidates which satisfy the LAS threshold since they are all good paraphrasals of the query. There is no point in discarding all but one. This could be useful paraphrasal data.
7. Discussion about the NeurIPS dataset, benchmark track is detailed on the other