evaluation methodologies — factuality, attribution, diversity, etc.
Open
benchmarks and data sets - what’s missing?
Open
feedback loops - what are the best implicit and explicit means for gathering feedback w/ generative ir systems (esp those that summarize multiple pieces of content into a unified summary)?
Open
completness, exclusivity, relevance ordering
Open
can we quantify how much the model hallucinate on the prompt (”faithfulness”?) VS on implicit knowledge corpus at training time (”factuality”?)?
Open
multimodality as a solution to hallucination?
Open
what is the good term: hallucination / truthfulness / honesty / factuality? Does it matter to agree on a term?
Open
how much can we fit in the prompt
Open
What is the way forward for answer generation? Longer prompts / LLM memory / more of the same / scaling data and params?
Open
x
What’s next for Generative Document Retrieval? Do we need to change docID representations / loss function / make it multimodal /
can we implement in practice?
Open
x
how do we handle several relevant documents per query?
Open
how do we handle truths that require hops between documents?
Open
how do we handle different truths?
Open
can we remove all the trivia facts that the LLM learned, so that it can’t hallucinate and only answers given the prompt?
Open
Is there anything we can do to protect against prompt injections?
Open
dynamic corpora - how to handle new documents, document updates, document deletions
Open
how to design approciate docids?
Open
how can we express uncertainty estimates? can we use the logits as proxies for factuality?
Open
is factuality related to robustness to changes in the LLM’s temperature (i.e. randomness)?
Open