NanoMuPLeR

Overview

NanoMuPLeR is a language-specific translated/parallel legal retrieval benchmark for MuPLeR-retrieval. It derives from European Union legal text and covers 14 European languages: Greek, English, Spanish, Finnish, French, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Slovak, Slovenian, and Swedish. Each split contains same-language synthetic legal queries and DGT-Acquis-derived parallel legal passages.

The group contains 2,800 queries, 140,000 task-local documents, and 2,800 positive qrel rows. Every language has exactly 200 queries, 10,000 documents, and one positive per query. This parallel construction makes NanoMuPLeR useful for comparing whether a retrieval model preserves legal-search quality across languages, scripts, morphology, and translation variation.

What This Group Measures

The group measures focused legal passage retrieval in a controlled multilingual setup. Queries ask about legal conditions, treaty interpretation, state aid, procurement, import duties, nuclear policy, pre-accession rules, and EU institutional procedure. Documents are medium-length legal passages rather than full acts. The relevance relation is exact: the model must find the one passage that answers the legal query in the same language.

Because the underlying passages and questions are parallel across languages, the group is not simply a collection of unrelated monolingual legal tasks. It tests whether a model can maintain retrieval quality when legal terminology is expressed through different morphology, word order, scripts, and translation choices. This is especially useful for diagnosing English-centric models on EU legal text.

Task Families

Parallel EU legal retrieval: all fourteen tasks retrieve same-language legal passages derived from DGT-Acquis.
Single-positive passage ranking: every query has exactly one relevant passage, so nDCG@10 and hit@10 reflect precise ranking of one target.
Multilingual legal terminology: the tasks preserve EU legal references, dates, directive numbers, percentages, institutions, and member-state names across languages.

Dataset Shape

The group is highly regular. Each language split has 200 queries, 10,000 candidate passages, and 200 qrel rows. The group-level document count is the sum of language-local pools; the parallel construction means many passages have translated counterparts across languages, but evaluation is same-language within each split.

The queries are medium length and the documents are compact legal passages. This is not long-document retrieval, but the text is dense. Many wrong documents share legal vocabulary, EU institutions, and regulatory phrasing. A model must distinguish the exact actor, condition, threshold, date, article, or procedure that makes one passage relevant.

Retrieval Behavior

BM25 Profile

BM25 is strong but not best for any language in the current Nano data. The query-weighted BM25 nDCG@10 is 0.7994, and several languages are above 0.83: Dutch, Swedish, Polish, Latvian, Spanish, Portuguese, and Finnish. This reflects the lexical nature of EU legal retrieval. Directive numbers, state names, institution names, percentages, and legal terms often appear in both the query and the relevant passage.

BM25 is weakest on English and Slovak, with English at 0.6453 nDCG@10 and Slovak at 0.7041. The English result is a useful warning: even in a legal dataset with many lexical anchors, synthetic queries can paraphrase the passage enough that exact term frequency alone is insufficient. BM25 is therefore a strong legal baseline, but not the best overall retrieval strategy.

Dense Profile

Dense retrieval with harrier-oss-270m is best for English and competitive for most other languages. English rises from 0.6453 BM25 nDCG@10 to 0.8477 dense nDCG@10, showing that embedding similarity helps when the legal query and passage use different wording. Dense is also high for Spanish, Dutch, Portuguese, Swedish, and French.

Dense is not uniformly better than BM25. It trails BM25 in Finnish, Lithuanian, Latvian, Dutch, Polish, and Slovenian. This suggests that exact legal wording and morphology-aware lexical evidence remain valuable, especially in languages where the dense model's representation does not fully capture the legal condition. The group-level dense nDCG@10 is 0.8158.

Reranking Hybrid Profile

The reranking hybrid profile is the strongest group-level profile: 0.8554 nDCG@10, 0.9300 hit@10, and 0.9914 recall@100. It is best for thirteen of the fourteen languages, with English as the main dense-led exception. Hybrid retrieval works well here because legal passage search needs both exact legal anchors and semantic matching of the condition expressed in the query.

This group is one of the clearest examples where hybrid search is the preferred default. The task is single-positive, so recall@100 close to 1.0 means the combined candidate set almost always contains the right passage. The remaining challenge is top-rank ordering among legally similar passages with overlapping EU vocabulary.

Task Summary

Task	Family	Language	Queries	Docs	Positives	Positives/query	BM25 nDCG@10	Dense nDCG@10	Reranking hybrid nDCG@10	Best profile
el	EU legal retrieval	`el`	200	10,000	200	1.00	0.7749	0.7834	0.8390	Reranking hybrid
en	EU legal retrieval	`en`	200	10,000	200	1.00	0.6453	0.8477	0.7986	Dense
es	EU legal retrieval	`es`	200	10,000	200	1.00	0.8302	0.8803	0.8862	Reranking hybrid
fi	EU legal retrieval	`fi`	200	10,000	200	1.00	0.8230	0.7955	0.8682	Reranking hybrid
fr	EU legal retrieval	`fr`	200	10,000	200	1.00	0.8179	0.8329	0.8628	Reranking hybrid
it	EU legal retrieval	`it`	200	10,000	200	1.00	0.7920	0.8257	0.8422	Reranking hybrid
lt	EU legal retrieval	`lt`	200	10,000	200	1.00	0.8115	0.7495	0.8442	Reranking hybrid
lv	EU legal retrieval	`lv`	200	10,000	200	1.00	0.8376	0.7910	0.8672	Reranking hybrid
nl	EU legal retrieval	`nl`	200	10,000	200	1.00	0.8909	0.8580	0.9072	Reranking hybrid
pl	EU legal retrieval	`pl`	200	10,000	200	1.00	0.8400	0.8299	0.8909	Reranking hybrid
pt	EU legal retrieval	`pt`	200	10,000	200	1.00	0.8222	0.8552	0.8895	Reranking hybrid
sk	EU legal retrieval	`sk`	200	10,000	200	1.00	0.7041	0.7714	0.7872	Reranking hybrid
sl	EU legal retrieval	`sl`	200	10,000	200	1.00	0.7455	0.7428	0.7983	Reranking hybrid
sv	EU legal retrieval	`sv`	200	10,000	200	1.00	0.8563	0.8576	0.8946	Reranking hybrid

Interpretation Notes for Model Researchers

NanoMuPLeR is a controlled test of multilingual legal retrieval. Because every language has the same query count, document count, and positive density, score differences are easier to interpret than in mixed-domain groups. Strong performance indicates that a model can match precise legal conditions across EU languages, not merely retrieve topical legal documents.

The retrieval-profile pattern is also clear. BM25 is a strong legal baseline, dense retrieval helps especially when the query paraphrases the passage, and hybrid retrieval is usually best because it combines exact legal anchors with semantic condition matching. English is the exception in this Nano slice, where dense retrieval is strongest.

Training and Leakage Notes

Useful training data includes non-overlapping EUR-Lex and DGT-Acquis passage retrieval, multilingual legal QA, legal passage reranking data, and parallel EU legal bitext with hard negatives from nearby provisions. Synthetic legal queries can help if they preserve the exact legal condition and do not turn the task into vague topical retrieval.

Leakage control should account for parallel documents. Training should exclude MuPLeR evaluation queries, positives, and translated equivalents across languages. Hard negatives should share institutions, dates, directives, and legal topics while differing in the actual condition, actor, threshold, or procedural rule.

Source Reference Table

Source	Year	Type	URL
mteb/MuPLeR-retrieval		dataset card	https://huggingface.co/datasets/mteb/MuPLeR-retrieval
DGT-Acquis		source corpus page	https://joint-research-centre.ec.europa.eu/language-technology-resources/dgt-acquis_en
An overview of the European Union's highly multilingual parallel corpora	2014	source reference paper	https://link.springer.com/article/10.1007/s10579-014-9277-0
Massive Text Embedding Benchmark		benchmark repository	https://github.com/embeddings-benchmark/mteb

Metadata Summary

Field	Value
Task pages	14
Queries	2,800
Split-local documents	140,000
Positive qrels	2,800
Languages	el, en, es, fi, fr, it, lt, lv, nl, pl, pt, sk, sl, sv
Categories	natural_language
Positives / query avg	1.00

Task Metadata Summary

Task	Backing dataset	Lang	Category	Queries	Docs	Positives	BM25 nDCG@10	Dense nDCG@10	Reranking hybrid nDCG@10	Best profile
el	NanoMuPLeR	el	natural_language	200	10,000	200	0.7749	0.7834	0.8390	Reranking hybrid
en	NanoMuPLeR	en	natural_language	200	10,000	200	0.6453	0.8477	0.7986	Dense
es	NanoMuPLeR	es	natural_language	200	10,000	200	0.8302	0.8803	0.8862	Reranking hybrid
fi	NanoMuPLeR	fi	natural_language	200	10,000	200	0.8230	0.7955	0.8682	Reranking hybrid
fr	NanoMuPLeR	fr	natural_language	200	10,000	200	0.8179	0.8329	0.8628	Reranking hybrid
it	NanoMuPLeR	it	natural_language	200	10,000	200	0.7920	0.8257	0.8422	Reranking hybrid
lt	NanoMuPLeR	lt	natural_language	200	10,000	200	0.8115	0.7495	0.8442	Reranking hybrid
lv	NanoMuPLeR	lv	natural_language	200	10,000	200	0.8376	0.7910	0.8672	Reranking hybrid
nl	NanoMuPLeR	nl	natural_language	200	10,000	200	0.8909	0.8580	0.9072	Reranking hybrid
pl	NanoMuPLeR	pl	natural_language	200	10,000	200	0.8400	0.8299	0.8909	Reranking hybrid
pt	NanoMuPLeR	pt	natural_language	200	10,000	200	0.8222	0.8552	0.8895	Reranking hybrid
sk	NanoMuPLeR	sk	natural_language	200	10,000	200	0.7041	0.7714	0.7872	Reranking hybrid
sl	NanoMuPLeR	sl	natural_language	200	10,000	200	0.7455	0.7428	0.7983	Reranking hybrid
sv	NanoMuPLeR	sv	natural_language	200	10,000	200	0.8563	0.8576	0.8946	Reranking hybrid