deleted reviews submission P04rZjcgF7

Paper Decision

Edit

DecisionProgram Chairs (

metzler@google.com⁠

zhangruqing@ict.ac.cn⁠

ziyanjiang528@gmail.com⁠

a.c.yates@uva.nl⁠

+1 more⁠

)29 May 2024, 11:55Everyone

Decision: Accept

Comment:

Decision: Accept

The authors propose a new loss function and training regime in the lineage of DSI models which is document-context aware. The chairs appreciate that the authors also produce an ablation study on moving parts of their proposed architecture.

The chairs decided to accept this paper on the grounds that the proposed methodology is novel and performant in its proposed evaluation framework. The reviewers tend to converge to a similar opinion. The chairs do share the concerns expressed by some reviewers about that evaluation framework: the dataset (see osKE's review), data augmentation regime, the model backbone, etc.

More precisely, reviewers point to the fact that the most recent model setting used dates back from 2022 (NCI). They essentially indicate that this loss function and its training regime would benefit form being connected to more recent endeavors (as cited by the reviewers).

For the camera-ready version, we ask the authors to consider the following changes

use another word than context (reviewer Z55a).

have another pass at the text to check (e.g. with a syntax-checker) for typos for the camera-ready version (reviewer Z55a).

clarify the methodology (reviewer Z55a and osKE). We can provide the authors with an additional page to describe the implementation in detail. Providing code, if possible, is also greatly encouraged.

−＝≡

Review of Context Aware Contrastive Learning Approach for Generative Retrieval

Edit

Official ReviewReviewer osKE (

Ronak Pradeep⁠

)27 May 2024, 15:50 (modified: 31 May 2024, 11:55)Everyone

Revisions⁠

⁠

Review:

Summary

The paper introduces a method to enhance the learning of context-aware representations in GR models. The approach, inspired by dense retrieval methods, incorporates a margin-based contrastive loss of the encoder’s output representations and a curriculum-based learning strategy to optimize contrastive losses effectively and better model query document representations. The proposed method is evaluated using the Natural Questions dataset, demonstrating improvements in effectiveness metrics over some models.

Strengths

Novelty - The integration of context-aware contrastive learning with generative retrieval models is an interesting step that addresses key limitations in GR systems. The paper presents a method that hopes to enhance the semantic understanding of queries and documents.

Effectiveness - The proposed CALM model demonstrates consistent improvements across various metrics, such as R@1, R@10, and MRR compared to their baseline implementation.

Curriculum Learning - The use of curriculum learning to progressively increase the difficulty of negative samples is a noteworthy introduction to this paradigm.

Weaknesses

Clarity - Some sections, particularly 3.7, contain incomplete sentences and unclear descriptions, making it difficult for me to fully understand the proposed methods. For instance, phrases like “For contrastive learning experimented with two types of negatives, first one where we have 9 randomly selected negatives. And for curriculum learning in initial training data negatives are ones ... ". and "The decoder query representation at 𝑖th ... Where $D^i_q$ is the encoder representation at the 𝑖th step" were confusing to me too. This citation "[15] Rodrigo Nogueira, Jimmy Lin, and AI Epistemic. 2019. From doc2query to" is botched. Improving the clarity and coherence of these sections would enhance the overall readability and comprehension of the paper, ensuring that readers can fully grasp the proposed techniques.

Baseline Comparisons - The choice of baselines for comparison should be a lot more comprehensive. Including recent and popular models and a broader range of retrieval techniques would provide a stronger validation of the proposed method’s effectiveness. Firstly, the exact setting of NQ is never mentioned. Assuming it is NQ100K, then Pradeep et al. [1] too report similar scores from NCI/BM25 as Table 2, and their model from ~ a year back scores 70.7 on NQ100K which is significantly more than the reported CALM method without any of these additional objectives. Additionally, many other settings explored in their paper largely improve over CALM. Given their model achieves this with only data choice modifications - Using FirstP Passages + DaQ + D2Q-in domain (keeping the T5 model consistent), I don’t see why all this was ignored in this paper and the scores were significantly worse.

Loss Function Combinations - The paper introduces several loss functions, like margin loss, listwise loss, and context-aware label mapping loss, but does not specify the final combination used during training. Explicitly detailing the final combination and rationale behind choosing specific loss functions would clarify the training methodology.

Other datasets? - More emphasis could be placed on how the model generalizes to completely unseen datasets or domains beyond the NQ dataset. This is a small-scale dataset and findings don't often translate as seen by the literature in the generative search paradigm. Assessing the model’s effectiveness on a wider range of datasets would provide a better understanding of its generalization capabilities.

Given this paper misses out on proper benchmarking and ignores a lot of the recent findings in generative retrieval while having a pretty unclear methodology and insufficient datasets, I am leaning heavily toward rejection.

[1] Ronak Pradeep, Kai Hui, Jai Gupta, Adam D. Lelkes, Honglei Zhuang, Jimmy Lin, Donald Metzler, Vinh Q. Tran. “How Does Generative Retrieval Scale to Millions of Passages?”

Rating: -2: Ok but not good enough - rejection

Confidence: 5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature

−＝≡

Review: Context Aware Contrastive Learning Approach for Generative Retrieval

Edit

Official ReviewReviewer VDM6 (

Hao Sun⁠

)25 May 2024, 12:19 (modified: 31 May 2024, 11:55)Everyone

Revisions⁠

⁠

Review:

Strengths

Novel Approach: The integration of context-aware contrastive learning with generative retrieval models is a novel contribution that addresses significant limitations in current methods. The introduction of Context Aware Label Mapping (CALM) is particularly innovative.

Thorough Evaluation: The authors conduct extensive experiments using the Natural Questions dataset, providing robust evidence of the effectiveness of their approach. The evaluation covers various aspects, including the impact of curriculum learning and the comparison of different loss functions.

Clear Methodology: The paper provides a detailed and clear description of the proposed techniques, making it easier for other researchers to understand and potentially replicate the study. The use of margin-based contrastive loss and the curriculum-based learning strategy are well-explained.

Weaknesses

Limited Dataset: While the experiments on the Natural Questions dataset are comprehensive, the evaluation would be stronger if additional datasets were included. This would help demonstrate the generalizability of the proposed method across different types of retrieval tasks.

Scalability Concerns: The approach may face scalability issues due to the computational complexity of contrastive learning, especially when dealing with large-scale datasets. The paper does not thoroughly address potential scalability challenges and their solutions.

Rating: 2: Good paper, accept

Confidence: 4: The reviewer is confident but not absolutely certain that the evaluation is correct

−＝≡

Review

Edit

Official ReviewReviewer Z55a (

Vinh Q. Tran⁠

)24 May 2024, 07:05 (modified: 31 May 2024, 11:55)Everyone

Revisions⁠

⁠

Review:

This paper is motivated by the fact that while generative retrieval seeks to have richer query-document relationships, it is not fully modeled in the generative retrieval process which only learns mappings of queries to docids, omitting document representations all together. To fix this, this paper weeks to introduce document representations into various parts of the generative retrieval (GR) encoder-decoder model, via multiple auxiliary contrastive losses that pushes various hidden states of the model closer to real document representations (typically not at all modeled in GR.)

Strengths:

Originality: Altering the loss to promote the encoder representation of the query to be more similar to its representation of a document, and for the decoder's representation of a docid to be closer to a document, is an interesting idea. Intuitively, this should make the mapping from query->docid more natural if both sides are tending towards some document representation.

The idea of training a normal GR model first, then sampling it to construct negatives is novel in the setting of GR and mirrors offline preference optimization techniques like Direct Preference Optimization.

The authors present results that improves over strong baselines.

The paper presents solid experiments and ablations that helps the reader understand the contribution of the different components.

Weaknesses

Quality & Clarity: Perhaps the biggest weakness of this paper is the presentation, relevant literature, and writing.

The paper is missing some important references, such as, Pradeep et al. "How Does Generative Retrieval Scale to Millions of Passages?", Sun et al. "Learning to Tokenize for Generative Retrieval", and Zeng et al. "Scalable and Effective Generative Information Retrieval" amongst others, which has implications for this work. In many ways learned document ids, via quantization approaches may effectively bake document representations into the learning process.This would make the L_tok itself effectively model query-document closeness.

The use of "Context Aware" is quite confusing in this paper, as "Context" typically refers to what an LLM is attending to (in its context window), or context surrounding something else. The authors here use context aware but the "context" refers to the document that corresponds to the query and "aware" is implemented indirectly as augmentations of the loss.

The use of LMHead is nonstandard and this acronym is never defined. IIUC this is the output projection before the softmax. Moreover, it is typically considered part of the Decoder especially if the embedding table is considered part of the Encoder (which you implicitly do here.) Equation (2) is missing the softmax. The use of superscripts generally is not well defined.

The authors seek to apply contrastive learning to an encoder-decoder during GR training, and do so at various different hidden representations of the encoder-decoder. More motivation could be provided. Some particular points:

In Section 3.4, its somewhat intuitive why you might want to promote the encoding of the query to be similar to the encoding of a document. How this is applied to the decoder is confusing. Intuitively this might be to promote the hidden states of the decoder for a docid to be closer to a document embedding, but the passage seems to discuss making the decoder's representation of a query closer to a document. However, the decoder is never trained to decode queries, only docids, so I'm unsure why this is the case.

Eq (3) is defined as "is the encoder representation at the 𝑖th step.", however the encoder representation for the query is defined for all steps of the decoder. The section then refers to a "query decoder" but this GR model only ever sees queries on the input / encoder side. Equation (4) is described to sum over all positions but the summation is over |d| which was defined earlier as the docid.

In Section 3.6, the motivation and connection between decoding docids to product quantization is fair, however it seems like every token in the docid is encouraged to be similar to a document representation, rather than any sequential decomposition of the document representation (as semantic ids are). Clearer explanation here is needed.

There are multiple typos throughout.

One of the auxiliary losses Equation 12 / 13 doesn't seem to appear in the overall loss in Equation 17. If this was tried then not used in the final technique, then it could be stately clearly.

Significance: The significance leans weak in this work. The approach is very complicated, but yields a relatively small improvement (+2pt) over the fair baseline and even smaller <0.5pt over NCI. Pradeep et al. 2023 (above) showed that data augmentation alone can create more lift than those reported here.

Rating: 1: Marginally above acceptance threshold

Confidence: 5: The reviewer is absolutely certain that the evaluation is correct and very familiar with the relevant literature

Want to print your doc?
This is not the way.

Try clicking the ··· in the right corner or using a keyboard shortcut (

CtrlP

) instead.