Share
Explore

Legal-Bench RAG as a Baseline

From UK Statutory Retrieval to Contract-Based RAG Evaluation
My initial research focused on Retrieval-Augmented Generation (RAG) for UK statutory legal texts. The goal was to explore how retrieval systems could support accurate legal question-answering over legislation from . This domain has high practical value — covering legal rights, obligations, and procedures — and I successfully implemented a prototype pipeline using ColBERT as a reranker.
Key finding: ColBERT significantly improved retrieval quality while reducing the number of retrieved documents needed by over 80%, demonstrating both efficiency and accuracy improvements in legal RAG.
⚠️ Challenges with Statutory RAG Research
However, further development in this direction was limited due to two key factors:
Lack of evaluation data: There is currently no large-scale, annotated QA dataset for UK statutory law that allows for consistent benchmarking. Creating such a dataset would require legal expertise, time, and funding.
Lack of community baselines: There is no established benchmark for statutory RAG to compare against. This limits the ability to validate and publish findings with credible comparisons.
Given the time constraints of a PhD and limited legal annotation support, I made a strategic and methodologically sound shift to a better-supported legal domain.

🧭 Shift to Contracts:

The field of contract-based legal NLP has seen major advancements in recent years, including the release of several high-quality, expert-annotated datasets used by the community. These are already integrated into a standardized benchmark called LegalBench-RAG.
LegalBench-RAG is a benchmark for evaluating the retrieval step of RAG systems. It contains 6,889 legal questions, each linked to exact text spans in real-world legal contracts and policies. It is constructed from the following datasets:
image.png
These datasets are:
Publicly available
Professionally annotated
Published in top venues (ACL Anthology, EMNLP, etc.)
🧠 My Contribution Going Forward:
Now that I’ve shifted to contract-based RAG, I will replicate LegalBench-RAG’s evaluation, but using an improved retrieval framework.
LegalBench-RAG found that their best setup (RCTS + no reranker) only achieved ~14.38% Precision@1 and ~84% Recall@64 on the easiest dataset, PrivacyQA — and lower on CUAD and MAUD.
However, in my prior work on UK legislation, I found that ColBERT, a dense late interaction model, significantly enhanced retrieval performance. I plan to test this reranker on the LegalBench-RAG datasets and then:

📐 Build a new framework:

Keep the same datasets (CUAD, MAUD, ContractNLI, PrivacyQA)
Improve the chunking (semantic-aware)
Improve the retriever (dense + legal embeddings)
Improve the reranker (ColBERT or fine-tuned legal cross-encoder)
Keep evaluation consistent (Precision@k, Recall@k)

🔗 Links to Datasets

LegalBench-RAG benchmark:

Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.