NanoRTEB / NanoLegalSummarization
Overview
NanoLegalSummarization is an English legal paraphrase retrieval task from NanoRTEB. The query is a short plain-English summary of a contract or terms-of-service clause, and the relevant documents are the formal legal text passages that express the same obligation, permission, or policy. Some queries have multiple positives. BM25, dense retrieval, and reranking_hybrid are all competitive, with reranking_hybrid the strongest overall because the task needs both shared legal terminology and semantic paraphrase matching.
Details
What the Original Data Measures
Plain English Summarization of Contracts introduced a corpus of contract clauses paired with plain-English summaries. The original task focuses on making legal text easier to understand.
RTEB repurposes the alignment as retrieval. The simplified summary becomes the query, and the formal clause text becomes the document to retrieve. This tests whether a model can connect informal explanations to legal language.
Observed Data Profile
The Nano split contains 200 queries, 438 documents, and 345 positive qrel rows. Queries average 103.06 characters, while documents average 606.16 characters. Positives per query average 1.73, with a minimum of 1, a median of 1, and a maximum of 11. Fifty-six queries, or 28.0%, have multiple positives.
Example summaries describe location data collection, permission to modify a game but not distribute hacked clients, deletion of inactive virtual goods, unilateral terms changes, and access to files stored in a cloud service.
BM25 Evaluation Profile
The BM25 candidate subset uses the full 438-document pool and reaches nDCG@10 of 0.5678, hit@10 of 0.7800, and recall@100 of 0.8667. BM25 is strong because summaries and clauses often share names, legal concepts, service terms, or domain words.
Its limitation is paraphrase. Informal summaries can express clauses in lay language, such as describing liability, termination, or content access without using the exact legal phrasing found in the document.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses the full 438-document pool and reaches nDCG@10 of 0.5861, hit@10 of 0.7850, and recall@100 of 0.9159. Dense retrieval slightly improves over BM25 on early rank quality and clearly improves recall@100.
This indicates that semantic similarity helps bridge plain-English explanations and formal legal text. It is useful when the query summarizes the effect of a clause rather than quoting it.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 13 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.6085, hit@10 of 0.8100, and recall@100 of 0.9246. Hybrid retrieval is the strongest profile on all reported metrics.
The result reflects the task structure. Sparse matching preserves service names and legal terms, while dense retrieval captures paraphrase between plain-English summaries and formal clauses.
Metric Interpretation for Model Researchers
Because some queries have multiple positives, nDCG@10 rewards placing several matching clauses early, hit@10 measures whether at least one relevant clause appears in the first ten, and recall@100 measures broader candidate availability.
For NanoLegalSummarization, nDCG@10 is the most useful top-rank signal. A good model should retrieve the correct clause family, not just one document with a shared service name.
Query and Relevance Type Tendencies
Queries are plain-English legal summaries, often short and informal. Relevant documents are formal contract or terms-of-service passages. They may include permissions, restrictions, data-use policies, account termination rules, or liability language.
Relevance is semantic equivalence between the summary and clause. A nearby clause from the same contract can be wrong if it covers a different right or obligation.
Representative Failure Modes
Common failures include retrieving the right contract but wrong clause, overmatching service names, missing informal paraphrases, and confusing similar policy areas such as data collection, content license, and account termination. BM25 can miss lay paraphrases; dense retrieval can blur neighboring legal provisions.
Training Data That May Help
Useful training data includes legal simplification, clause-summary pairs, contract passage retrieval, terms-of-service QA, and hard negatives from nearby clauses in the same document. Evaluation summaries, clauses, and qrels should be excluded.
Model Improvement Notes
Models should represent obligations, permissions, rights, and restrictions rather than only legal keywords. Hard negatives should come from the same contract and share names or legal terms while expressing a different policy. Hybrid retrieval is the best first-stage profile for this split.
Example Data
| Query | Positive document |
| this service may collect use and share location data. [53 chars] | apple and our partners and licensees may collect use and share precise location data including the real time geographic location of your apple computer or device. where available location based services may use gps bluetooth and your ip address along with crowd sourced wi fi hotspot and cell tower locations and other technologies to determine your devices approximate location. unless you provide consent this location data is collected anonymously in a form that does not personally identify you and is used by apple and our partners and licensees to provide and improve location based products and services. for example your device may share its geographic location with application providers when you opt in to their location services. [740 chars] |
| you may mod the game but don t distribute hacked clients. [57 chars] | if you ve bought the game you may play around with it and modify it. we d appreciate it if you didn t use this for griefing though and remember not to distribute the changed versions of our software. basically mods or plugins or tools are cool you can distribute those hacked versions of the game client or server are not you can t distribute those. [349 chars] |
| if you haven t played for a year you mess up or we mess up we can delete all of your virtual goods. we don t have to give them back. we might even discontinue some virtual goods entirely but we ll give you 60 days advance notice if that happens. [245 chars] | we may cancel suspend or terminate your account and your access to your trading items virtual money virtual goods the content or the services in our sole discretion and without prior notice including if a your account is inactive i e not used or logged into for one year b you fail to comply with these terms c we suspect fraud or misuse by you of trading items virtual money virtual goods or other content d we suspect any other unlawful activity associated with your account or e we are acting to protect the services our systems the app any of our users or the reputation of niantic tpc or tpci. we have no obligation or responsibility to and will not reimburse or refund you for any trading items virtual money or virtual goods lost due to such cancellation suspension or termination. you acknowledge that niantic is not required to provide a refund for any reason and that you will not receive money or other compensation for unused virtual money and virtual goods when your account is closed wh... [1,000 / 1,441 chars] |
Source Reference Table
| Title | Year | Type | URL |
| Plain English Summarization of Contracts | 2019 | task paper | https://aclanthology.org/W19-2201/ |
| mteb/legal_summarization | dataset card | https://huggingface.co/datasets/mteb/legal_summarization | |
| Introducing RTEB: A New Standard for Retrieval Evaluation | 2025 | benchmark article | https://huggingface.co/blog/rteb |
Dataset Information
| Field | Value |
| Nano set | NanoRTEB |
| Backing dataset | NanoRTEB |
| Task / split | NanoLegalSummarization |
| Hugging Face dataset | hakari-bench/NanoRTEB |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 438 |
| Positive qrels | 345 |
| Positives / query avg | 1.73 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 11 |
| Multi-positive queries | 56 (28.00%) |
| Query length avg chars | 103.06 |
| Document length avg chars | 606.16 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.5678 | 0.7800 | 0.8667 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5861 | 0.7850 | 0.9159 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.6085 | 0.8100 | 0.9246 | top-100 |