NanoMMTEB-v2 / ailastatutes

Overview

NanoMMTEB-v2 / ailastatutes is an English legal statute-retrieval task from AILA 2019. Each query is a long Indian legal fact pattern, and the retriever must return the statutory provisions that apply to the situation. The Nano split has 50 queries, 82 statute documents, and 217 positive qrel rows. Every query has multiple positives, averaging 4.34 applicable statutes. Current diagnostics show dense retrieval as the strongest top-rank profile, reranking_hybrid close behind, and BM25 weaker because legal relevance is often implied by facts, procedure, and legal effect rather than repeated statute wording.

Details

What the Original Data Measures

The FIRE 2019 AILA track introduced Artificial Intelligence for Legal Assistance tasks for Indian legal materials. The statute retrieval task gives systems factual legal scenarios and asks them to identify relevant statutory provisions from a pool of frequently cited provisions. The Zenodo release and the MTEB dataset packaging expose this as a retrieval benchmark for embedding evaluation.

This task measures legal issue spotting and fact-to-statute retrieval. A model must connect facts such as conviction, appeal, dowry death, conspiracy, registration, public servant status, or evidence rules to the statute text that governs the legal issue.

Observed Data Profile

The Nano split contains 50 long scenario queries, 82 statute documents, and 217 positive qrel rows. Every query has multiple positives: the average is 4.34 positives per query, with a minimum of 2, median of 4.5, and maximum of 5. Queries average 3,038.42 characters, while statute documents average 1,972.63 characters.

The queries are legal narratives or case summaries, often much longer than the documents. The documents usually contain a statute title and provision text. Examples include attempt to murder, dowry death, criminal conspiracy, compulsory registration of documents, and legal exceptions or procedural provisions.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains all 82 documents per query and achieves nDCG@10 = 0.2070, hit@10 = 0.6600, and recall@100 = 1.0000. Since the corpus has only 82 documents, recall@100 is saturated by every candidate subset and should not be read as evidence of strong ranking quality.

BM25 is the weakest top-rank profile. Long fact patterns contain many narrative terms, party names, dates, procedural history, and case-specific facts that do not necessarily appear in the statute text. Relevant provisions may be implied by legal effect rather than direct word overlap.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains all 82 documents per query and achieves nDCG@10 = 0.2725, hit@10 = 0.7600, and recall@100 = 1.0000. Dense retrieval is the strongest observed top-rank profile for this task.

This pattern fits the benchmark: the model benefits from semantic similarity between legal facts and statute concepts. A scenario about death soon after marriage, a criminal appeal, public-servant sanction, or property registration may not repeat exact section language, but it is semantically tied to the applicable provision.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset also contains all 82 documents per query and achieves nDCG@10 = 0.2557, hit@10 = 0.7400, and recall@100 = 1.0000. It is stronger than BM25 but slightly below dense retrieval.

Because every profile sees the entire statute pool, reranking_hybrid should be interpreted as a ranking diagnostic rather than a candidate-coverage test. The hybrid signal helps over pure lexical matching, but dense semantic matching appears to better capture the scenario-to-statute relationship in this small legal corpus.

Metric Interpretation for Model Researchers

This is a multi-positive task. Each fact pattern can require several statutory provisions, and nDCG@10 rewards ordering many positives high in the result list. Hit@10 only indicates whether at least one relevant statute is retrieved near the top, so it is less informative than nDCG@10 for complete legal coverage.

Recall@100 is saturated because the corpus has only 82 documents. For this task, ranking quality within the full candidate set is the meaningful signal. Researchers should look for models that rank all applicable provisions above adjacent but legally inapplicable sections.

Query and Relevance Type Tendencies

Queries are long English legal fact patterns from Indian case contexts. They often describe appeals, convictions, evidentiary questions, marriage-related death, public office, conspiracy, registration, partnership dissolution, or procedural requirements. The wording is narrative and case-specific.

Relevant documents are statute provisions. They are shorter than the queries and use formal legal language. Applicability depends on matching legal conditions, roles, procedures, remedies, or elements of an offense rather than matching only surface terms.

Representative Failure Modes

BM25 can over-rank statutes sharing broad legal words such as offence, punishment, evidence, appeal, property, or registration while missing the specific legal element. Dense retrieval can retrieve a semantically related statute that fails on a condition such as timing, jurisdiction, public-servant status, intent, or remedy.

Hybrid retrieval can inherit both problems: lexical overlap may pull in adjacent sections, while semantic similarity may blur closely related legal doctrines. Rerankers should compare the fact pattern against clause-level statutory conditions.

Training Data That May Help

Useful training data includes fact-to-statute retrieval pairs, statutory legal entailment data, legal issue spotting examples, and adjacent statute hard negatives. Training should preserve the multi-positive structure because several provisions can apply to one scenario.

Synthetic data can generate realistic legal fact patterns and match them to several statute-like provisions. Negatives should share legal vocabulary but fail on a material element such as jurisdiction, procedure, remedy, detention status, mental state, or evidentiary rule. Evaluation scenarios and positive statutes from this Nano split should be excluded from training.

Model Improvement Notes

Dense retrievers should be trained for legal entailment and issue spotting, not only generic semantic similarity. Sparse systems may improve by emphasizing statute titles, legal terms of art, and clause-level concepts while reducing noise from long narrative facts. Rerankers should parse the scenario into legal issues and compare those issues against statute conditions.

For hybrid systems, NanoMMTEB-v2 / ailastatutes is a small-corpus ranking test. Candidate coverage is trivial, so improvements must come from ranking multiple applicable provisions above near-miss legal sections.

Example Data

Query	Positive document
Appellant calls in question legality of the judgment rendered by High Court confirming his conviction for offence and sentence of imprisonment for life as awarded by the learned Sessions Judge. Background facts as unfolded during trial by the prosecution are essentially as follows. One P1 (hereinafter referred to as the 'deceased') was having industry and he employed a number of girls. The accused used to make fun of the girls/workers outside the factory and this was objected to by the deceased... [500 / 3,569 chars]	Title: Attempt to murder Desc: Whoever does any act with such intention or knowledge, and under such circumstances that, if he by that act caused death, he would be guilty or murder, shall be punished with imprisonment of either description for a term which may extend to ten years, and shall also be liable to fine; and if hurt is caused to any person by such act, the offender shall be liable either to 1 [imprisonment for life], or to such punishment as is hereinbefore mentioned. Attempts by life convicts2 [When any person offending under this section is under sentence of 3 [imprisonment for life], he may, if hurt is caused, be punished with death.] Illustrations (a) A shoots at Z with intention to kill him, under such circumstances that, if death ensued. A would be guilty of murder. A is liable to punishment under this section. (b) A, with the intention of causing the death of a child of tender years, exposes it in a desert place. A has committed the offence defined by this section, th... [1,000 / 1,973 chars]
This appeal, by special leave, has been preferred against the judgment and order dated 23 February 2005 of the High Court (Aurangabad Bench), by which the appeal preferred by the appellants was dismissed and their conviction and sentence of 7 years RI imposed thereunder was affirmed. The deceased P1 was daughter of PW1. P2 resident of village Sanjkheda and she was married to appellant no. 1 P3 son of P4 about two and half years prior to the date of incident which took place on 15 September 1991.... [500 / 3,266 chars]	Title: Dowry death Desc: (1) Where the death of a woman is caused by any burns or bodily injury or occurs otherwise than under normal circumstances within seven years of her marriage and it is shown that soon before her death she was subjected to cruelty or harassment by her husband or any relative of her husband for, or in connection with, any demand for dowry, such death shall be called "dowry death", and such husband or relative shall be deemed to have caused her death. Explanation.-For the purpose of this sub-section, "dowry" shall have the same meaning as in section 2 of the Dowry Prohibition Act, 1961 (28 of 1961). (2) Whoever commits dowry death shall be punished with imprisonment for a term which shall not be less than seven years but which may extend to imprisonment for life.] Inserted by Act 43 of 1986, section 10 (w.e.f. 19-11-1986). [856 chars]
The appellant before us was examined as prime witness in the trial of T.R. on the file of the Special Judge against the first respondent. The trial ended in conviction against the first respondent and when the appeal filed by him came to be heard by the High Court the appellant had become a Cabinet Minister. On account of the disparaging remarks made by the Appellate Judge the appellant tendered his resignation and demitted office for maintaining democratic traditions. It is in that backgroud th... [500 / 2,857 chars]	Title: Certain laws not to be affected by this Act Desc: Nothing in this Act shall affect the provisions of any Act for punishing mutiny and desertion of officers, soldiers, sailors or airmen in the service of the Government of India or the provisions of any special or local law.] Substituted by the A.O. 1950, for the original section. [337 chars]

Source Reference Table

Title	Year	Type	URL
Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance	2019	task paper	https://ceur-ws.org/Vol-2517/T1-1.pdf
AILA 2019 Precedent & Statute Retrieval Task	2020	dataset release	https://zenodo.org/records/4063986
mteb/AILA_statutes	2024	dataset card	https://huggingface.co/datasets/mteb/AILA_statutes

Dataset Information

Field	Value
Nano set	NanoMMTEB-v2
Backing dataset	NanoMMTEB-v2
Task / split	ailastatutes
Hugging Face dataset	hakari-bench/NanoMMTEB-v2
Language	en
Category	natural_language
Queries	50
Documents	82
Positive qrels	217
Positives / query avg	4.34
Positives / query min	2
Positives / query median	4.50
Positives / query max	5
Multi-positive queries	50 (100.00%)
Query length avg chars	3,038.42
Document length avg chars	1,972.63

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.2070	0.6600	1.0000	top-500
Dense	`harrier_oss_v1_270m`	0.2725	0.7600	1.0000	top-500
Reranking hybrid	`reranking_hybrid`	0.2557	0.7400	1.0000	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: test
Train/eval overlap audit: not_audited
Leakage note: do not train on this Nano split's scenarios, qrels, or positive statute text
Multi-positive training: multi_positive_objective
Useful training data: fact-to-statute retrieval pairs, statutory legal entailment data, legal issue spotting examples, adjacent statute hard negatives