HAKARI-Bench

NanoLaw / NanoAILAStatutes

Overview

NanoLaw / NanoAILAStatutes is an English legal statute-retrieval task based on the AILA 2019 Artificial Intelligence for Legal Assistance track. Each query is a long Indian legal factual scenario, and the corpus contains statutory provisions with titles and descriptive legal text. The task is multi-label: a single scenario can require several applicable statutes. The Nano split has 50 queries, 82 documents, and 217 positive qrel rows. Every query has multiple positive statutes. Current diagnostics show that all candidate profiles have full recall@100 because the corpus is small, while dense retrieval provides the best top-10 ranking, reranking_hybrid is second, and BM25 is the weakest top-10 profile.

Details

What the Original Data Measures

The FIRE 2019 AILA overview describes a statute retrieval task in which systems receive 50 factual legal scenarios and identify the most relevant statutes. The paper reports that the statute pool was built from frequently cited Indian legal sections, reduced after removing repealed entries, and represented with titles and descriptions. The MTEB AILA_statutes card follows the same task framing: given a situation, retrieve applicable statutory provisions.

This task measures fact-to-statute mapping. It differs from precedent retrieval because the target documents are not case judgments but statutory provisions. The model must identify governing legal rules from long factual narratives, often where the scenario implies the statute rather than naming it directly.

Observed Data Profile

The Nano split contains 50 queries, 82 statute documents, and 217 positive qrel rows. Every query is multi-positive. Positives per query average 4.34, with a minimum of 2, a median of 4.5, and a maximum of 5. Queries average 3,038.42 characters, while statute documents average 1,972.63 characters.

The same long legal scenarios used in the AILA precedent task appear here, but the relevance target changes. Instead of finding analogous cases, the model must identify provisions such as attempt to murder, dowry death, criminal conspiracy, compulsory registration of documents, or other applicable statutory sections. The documents are shorter than judgments, but their wording is formal and provision-specific.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset covers all 82 statutes for each query and achieves nDCG@10 = 0.2070, hit@10 = 0.6600, and recall@100 = 1.0000. BM25 has perfect top-100 coverage because the corpus is smaller than the candidate limit, but its top-10 ordering is weak. Long factual narratives often describe conduct, procedure, or legal consequences without repeating the exact title or wording of the relevant statute.

Sparse matching helps when the scenario explicitly names an offence, procedure, or legal term that also appears in the statute. It struggles when applicability depends on legal classification: the model must infer that facts about injury, dowry, conspiracy, registration, public duty, or evidence correspond to a specific statutory provision. This makes BM25 a coverage baseline rather than a strong final ranker.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset also covers all 82 statutes per query. It achieves nDCG@10 = 0.2711, hit@10 = 0.7600, and recall@100 = 1.0000, making it the strongest observed profile by top-10 ranking. Dense retrieval is better suited to mapping long fact patterns to short legal provisions because it can use semantic similarity between the scenario and statutory concepts.

The absolute nDCG@10 remains modest, which reflects the task's difficulty. A scenario can activate multiple provisions, and many statutes are conceptually near one another. The model must rank a set of applicable sections above neighboring but non-governing provisions. Dense retrieval helps, but legal entailment and issue spotting are still harder than generic semantic matching.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains all 82 documents per query, with no safeguard rows. It achieves nDCG@10 = 0.2564, hit@10 = 0.7400, and recall@100 = 1.0000. Hybrid retrieval improves over BM25 and approaches dense retrieval, but it does not surpass dense in top-10 rank quality.

This pattern suggests that lexical statute names and semantic applicability both matter, but dense evidence is more helpful for final ranking in this particular split. Since every method sees the full corpus in the candidate pool, the main question is ordering rather than recall. A reranker trained for legal issue spotting could plausibly improve substantially over all three candidate profiles.

Metric Interpretation for Model Researchers

This is a multi-positive statute retrieval task. Hit@10 measures whether at least one applicable statute appears in the first ten results. nDCG@10 rewards ranking several applicable provisions high, which is important because every query has between two and five positives. Recall@100 is saturated for all three profiles because the 82-document corpus fits inside the top-100 candidate pool.

The metric pattern should therefore be read as a ranking-quality comparison, not a candidate-coverage comparison. Dense retrieval ranks applicable statutes best, hybrid follows, and BM25 has the weakest ordering despite full coverage.

Query and Relevance Type Tendencies

Queries are long legal fact patterns with procedural history, allegations, charges, and legal questions. Relevant documents are statutory provisions with titles and descriptive clause text. A positive statute is one that governs the scenario or is legally applicable to one of its issues.

The task rewards models that perform legal issue spotting: mapping facts to offence categories, procedural requirements, evidence rules, registration rules, or other statutory concepts. It is not enough to match shared words; the system must identify which provisions the facts legally trigger.

Representative Failure Modes

BM25 can miss high ranking when the scenario implies a statute without naming it, or when multiple statutes share common legal vocabulary. Dense retrieval can confuse neighboring legal provisions that are semantically similar but apply to different factual elements. Hybrid retrieval can still rank a related statute above the governing one if exact terms and semantic concepts point in different directions.

Multi-positive relevance adds another issue: retrieving one applicable statute does not mean the statutory set is complete. Systems that only optimize for one best provision may underperform on nDCG when several provisions should be ranked high.

Training Data That May Help

Helpful training data includes fact-to-statute retrieval pairs, statutory legal entailment examples, Indian penal and procedure statute hard negatives, and multi-label legal issue spotting data. Training should preserve the long scenario format and the multi-positive nature of the task.

For comparable evaluation, training should exclude NanoAILAStatutes scenarios, qrels, and positive statute mappings. Synthetic data can help when it generates case-like factual narratives that require several applicable statutes and includes hard negatives from neighboring provisions.

Model Improvement Notes

Dense retrievers should learn the relationship between factual elements and statutory requirements, especially where the statute title is not stated in the scenario. Sparse systems can improve with legal-aware term weighting for offence names, procedure words, and statutory section language. Rerankers should act more like legal issue spotters, checking whether each element of a statute is supported by the scenario.

For hybrid systems, this task suggests that sparse evidence is useful but should not dominate. Since coverage is already complete, the main improvement target is ranking the governing provisions ahead of merely related provisions.

Example Data

QueryPositive document
Appellant calls in question legality of the judgment rendered by High Court confirming his conviction for offence and sentence of imprisonment for life as awarded by the learned Sessions Judge. Background facts as unfolded during trial by the prosecution are essentially as follows. One P1 (hereinafter referred to as the 'deceased') was having industry and he employed a number of girls. The accused used to make fun of the girls/workers outside the factory and this was objected to by the deceased... [500 / 3,569 chars]Title: Attempt to murder Desc: Whoever does any act with such intention or knowledge, and under such circumstances that, if he by that act caused death, he would be guilty or murder, shall be punished with imprisonment of either description for a term which may extend to ten years, and shall also be liable to fine; and if hurt is caused to any person by such act, the offender shall be liable either to 1 [imprisonment for life], or to such punishment as is hereinbefore mentioned. Attempts by life convicts2 [When any person offending under this section is under sentence of 3 [imprisonment for life], he may, if hurt is caused, be punished with death.] Illustrations (a) A shoots at Z with intention to kill him, under such circumstances that, if death ensued. A would be guilty of murder. A is liable to punishment under this section. (b) A, with the intention of causing the death of a child of tender years, exposes it in a desert place. A has committed the offence defined by this section, th... [1,000 / 1,973 chars]
This appeal, by special leave, has been preferred against the judgment and order dated 23 February 2005 of the High Court (Aurangabad Bench), by which the appeal preferred by the appellants was dismissed and their conviction and sentence of 7 years RI imposed thereunder was affirmed. The deceased P1 was daughter of PW1. P2 resident of village Sanjkheda and she was married to appellant no. 1 P3 son of P4 about two and half years prior to the date of incident which took place on 15 September 1991.... [500 / 3,266 chars]Title: Dowry death Desc: (1) Where the death of a woman is caused by any burns or bodily injury or occurs otherwise than under normal circumstances within seven years of her marriage and it is shown that soon before her death she was subjected to cruelty or harassment by her husband or any relative of her husband for, or in connection with, any demand for dowry, such death shall be called "dowry death", and such husband or relative shall be deemed to have caused her death. Explanation.-For the purpose of this sub-section, "dowry" shall have the same meaning as in section 2 of the Dowry Prohibition Act, 1961 (28 of 1961). (2) Whoever commits dowry death shall be punished with imprisonment for a term which shall not be less than seven years but which may extend to imprisonment for life.] Inserted by Act 43 of 1986, section 10 (w.e.f. 19-11-1986). [856 chars]
The appellant before us was examined as prime witness in the trial of T.R. on the file of the Special Judge against the first respondent. The trial ended in conviction against the first respondent and when the appeal filed by him came to be heard by the High Court the appellant had become a Cabinet Minister. On account of the disparaging remarks made by the Appellate Judge the appellant tendered his resignation and demitted office for maintaining democratic traditions. It is in that backgroud th... [500 / 2,857 chars]Title: Certain laws not to be affected by this Act Desc: Nothing in this Act shall affect the provisions of any Act for punishing mutiny and desertion of officers, soldiers, sailors or airmen in the service of the Government of India or the provisions of any special or local law.] Substituted by the A.O. 1950, for the original section. [337 chars]

Source Reference Table

TitleYearTypeURL
Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance2019CEUR paperhttps://ceur-ws.org/Vol-2517/T1-1.pdf
AILA 2019 Precedent & Statute Retrieval Task2020Zenodo datasethttps://doi.org/10.5281/zenodo.4063986

Dataset Information

FieldValue
Nano setNanoLaw
Backing datasetNanoLaw
Task / splitNanoAILAStatutes
Hugging Face datasethakari-bench/NanoLaw
Languageen
Categorynatural_language
Queries50
Documents82
Positive qrels217
Positives / query avg4.34
Positives / query min2
Positives / query median4.50
Positives / query max5
Multi-positive queries50 (100.00%)
Query length avg chars3,038.42
Document length avg chars1,972.63

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.20700.66001.0000top-500
Denseharrier_oss_v1_270m0.27110.76001.0000top-500
Reranking hybridreranking_hybrid0.25640.74001.0000top-100

Training and Leakage Metadata