HAKARI-Bench

NanoLaw

Overview

NanoLaw is a compact legal retrieval group spanning English, German, and Chinese legal data. It includes Indian precedent and statute retrieval, German legal passage and QA retrieval, Chinese criminal-case retrieval, LegalBench-derived consumer-contract and corporate-lobbying retrieval, and plain-English contract-summary retrieval.

The group is useful because legal retrieval is not one search pattern. Some tasks map long fact scenarios to cases or statutes. Others match contract questions to clauses, bill descriptions to summaries, German questions to judgments, or Chinese criminal cases to related cases. A model can be topically close and still be wrong if it misses jurisdiction, legal role, statutory function, contract obligation, procedural posture, or case analogy. BM25, dense retrieval, and reranking_hybrid expose different legal matching behaviors.

What This Group Measures

NanoLaw draws from several legal NLP resources rather than a single benchmark. AILA-style tasks measure Indian legal assistance retrieval. GerDaLIR and LegalQuAD measure German legal information access. LeCaRDv2 measures Chinese legal case retrieval. LegalBench contributes consumer-contract and corporate lobbying tasks, and LegalSummarization turns contract simplification into summary-to-clause retrieval.

The shared measurement target is legal relevance. The positive document must support the requested legal relation, not merely share broad topic terms. That relation can be precedent analogy, statutory applicability, contractual right, legislative policy match, or related-case reasoning.

Task Families

Dataset Shape

NanoLaw contains 8 task pages, 1,259 queries, 15,142 split-local documents, and 5,488 positive qrel rows. Relevance density varies sharply. AILA, LeCaRDv2, and LegalSummarization are multi-positive, while the LegalBench and LegalQuAD tasks are single-positive. NanoLeCaRDv2 dominates the qrel count with many related cases per query.

The text profile is broad. AILA and LeCaRDv2 queries are long legal narratives. German legal documents can average around 19,000 characters. Contract and legislative tasks are shorter but require precise clause or bill matching. This makes NanoLaw both a legal reasoning benchmark and a long-document retrieval benchmark.

Retrieval Behavior

BM25 Profile

BM25 is strongest when legal formulas, bill phrases, German legal terms, or contract keywords repeat directly. It leads on NanoGerDaLIRSmall and NanoLegalQuAD, and is very strong on corporate lobbying. This reflects the importance of exact legal vocabulary, citations, and statutory phrasing.

BM25 is weaker on AILA scenario-to-law tasks because long fact patterns imply statutory or precedent relevance without necessarily repeating the authority's language. It can also over-rank contract or case documents that share topic words but miss the decisive legal relation.

Dense Profile

Dense retrieval helps with legal paraphrase and fact-to-authority mapping. It improves both AILA tasks, consumer-contract QA, corporate lobbying, and LegalSummarization. Dense retrieval is especially useful when the query is in plain English or factual narrative form and the target is written in legal or contractual language.

Dense retrieval is not always enough. German long-document tasks show that exact legal terminology can outperform broad semantic matching. Legal retrieval often requires preserving precise words, names, sections, and citations.

Reranking Hybrid Profile

reranking_hybrid is best on NanoLeCaRDv2, NanoLegalBenchConsumerContractsQA, and NanoLegalSummarization. These tasks benefit from combining exact legal terms with semantic or analogical matching. Hybrid is also useful where a reranker needs candidate diversity from both sparse and dense retrieval.

For reranking, multi-positive legal tasks should be read with Recall@100 in mind. A system that retrieves one plausible case or clause may still miss other valid authorities.

Task Summary

TaskRetrieval focusLangQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAILACasedocslegal fact pattern to precedent caseen501861950.28050.40030.3667Dense
NanoAILAStatuteslegal fact pattern to statuteen50822170.20700.27110.2564Dense
NanoGerDaLIRSmallGerman legal passage to judgmentde2009,9692350.59110.24050.4287BM25
NanoLeCaRDv2Chinese criminal case to related caseszh1593,7953,8960.65280.69400.7225Reranking hybrid
NanoLegalBenchConsumerContractsQAcontract question to clauseen2001532000.75560.77850.8054Reranking hybrid
NanoLegalBenchCorporateLobbyingpolicy description to bill summaryen2003192000.89550.91080.9068Dense
NanoLegalQuADGerman legal question to judgmentde2002002000.74200.58190.7043BM25
NanoLegalSummarizationplain-English summary to contract snippeten2004383450.56780.58610.6085Reranking hybrid

Interpretation Notes for Model Researchers

NanoLaw should be interpreted by jurisdiction and legal relation. English contract retrieval, Indian scenario-to-law retrieval, German legal judgment retrieval, and Chinese related-case retrieval have different relevance rules. One overall score can hide whether a model is learning legal vocabulary, jurisdiction-specific structure, or broader semantic analogy.

The BM25/dense split is instructive. BM25-led German tasks show the value of exact legal terminology. Dense-led AILA tasks show fact-to-authority semantic matching. Hybrid-led Chinese and contract tasks show that both exact legal anchors and semantic relevance are needed for candidate generation.

Training and Leakage Notes

Useful training data includes jurisdiction-specific case retrieval, fact-to-statute retrieval, citation prediction, German legal QA, Chinese related-case retrieval, consumer-contract QA, contract clause retrieval, and legislative search. Hard negatives should share statutes, charges, agencies, contract topics, or legal vocabulary while failing the decisive legal relation.

Exclude NanoLaw evaluation queries, positives, qrels, legal cases, statutes, contract clauses, bill summaries, and direct synthetic variants. Legal datasets often reuse public benchmark splits, so source and text-overlap audits are necessary before training.

Source Reference Table

SourceYearTypeURL
Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance2019paperhttps://ceur-ws.org/Vol-2517/T1-1.pdf
LeCaRDv2: A Large-Scale Chinese Legal Case Retrieval Dataset2023paperhttps://arxiv.org/abs/2310.17609
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models2023paperhttps://arxiv.org/abs/2308.11462

Metadata Summary

FieldValue
Task pages8
Queries1,259
Split-local documents15,142
Positive qrels5,488
Languagesde, en, zh
Categoriesnatural_language
Positives / query avg4.36

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAILACasedocsNanoLawennatural_language501861950.28050.40030.3667Dense
NanoAILAStatutesNanoLawennatural_language50822170.20700.27110.2564Dense
NanoGerDaLIRSmallNanoLawdenatural_language2009,9692350.59110.24050.4287BM25
NanoLeCaRDv2NanoLawzhnatural_language1593,7953,8960.65280.69400.7225Reranking hybrid
NanoLegalBenchConsumerContractsQANanoLawennatural_language2001532000.75560.77850.8054Reranking hybrid
NanoLegalBenchCorporateLobbyingNanoLawennatural_language2003192000.89550.91080.9068Dense
NanoLegalQuADNanoLawdenatural_language2002002000.74200.58190.7043BM25
NanoLegalSummarizationNanoLawennatural_language2004383450.56780.58610.6085Reranking hybrid