NanoMLDR
Overview
NanoMLDR is the compact Nano set for MLDR, a multilingual long-document retrieval benchmark. It covers 13 monolingual retrieval splits: Arabic, German, English, Spanish, French, Hindi, Italian, Japanese, Korean, Portuguese, Russian, Thai, and Chinese. Each query is a question generated from a paragraph inside a long article, while the positive document is the full article rather than the short answer-bearing paragraph.
The group is useful because it isolates a difficult document-level retrieval problem. The query may point to one small region of a very long same-language document. A successful retriever must preserve language coverage, exact entity and phrase anchors, and enough long-document representation to select the whole source article. BM25 is the dominant profile for most languages in the current metadata, dense retrieval is weaker on long-document compression, and reranking_hybrid is useful where sparse and dense candidates recover different long documents.
What This Group Measures
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation describes the MLDR long-document setting. The MLDR dataset construction samples long documents from multilingual sources, selects a paragraph, and generates a question from that paragraph. The retrieval target remains the full document.
NanoMLDR therefore measures monolingual long-document retrieval, not short passage retrieval and not cross-lingual transfer. The answer-bearing evidence may be a small part of the document, and the document itself may be a clean Wikipedia article, a noisy mC4 page, or a Wudao-style Chinese text.
Task Families
- Wikipedia-like long-document retrieval: many language splits retrieve long encyclopedia-style articles from generated questions.
- Noisy web long-document retrieval: German, Spanish, Thai, and Chinese can include noisier web or mixed-source documents.
- Script-diverse monolingual retrieval: Arabic, Hindi, Japanese, Korean, Thai, and Chinese test long-document retrieval under different segmentation and script conditions.
- Single-positive article selection: every split is single-positive, so the key question is whether the full source document is ranked near the top.
Dataset Shape
NanoMLDR contains 13 task pages, 2,089 queries, 55,585 split-local documents, and 2,089 positive qrel rows. Query counts vary by language, from 117 German queries to 200 English and Chinese queries. Every observed query has one positive full document.
Documents are long. English averages nearly 28,000 characters per document, while many European and Indic-language splits average around 12,000 to 15,000 characters. Japanese, Korean, and Thai have shorter character counts but are still long-document retrieval tasks. The group should be interpreted as document-level retrieval under multilingual source and noise variation.
Retrieval Behavior
BM25 Profile
BM25 is the best profile for nearly every NanoMLDR language in the current metadata. Portuguese, Spanish, French, Italian, and Russian are especially strong, suggesting that generated questions often preserve rare entities, phrases, dates, or topical terms from the source article. BM25 is also strong for Arabic, German, English, Japanese, Korean, and Chinese relative to dense retrieval.
Thai and Hindi are harder. Thai includes noisier web documents and weaker lexical anchoring. Hindi is the main split where hybrid beats BM25, suggesting that sparse and dense retrieval recover complementary signals.
Dense Profile
Dense retrieval is generally weaker than BM25 on NanoMLDR. The likely issue is long-document compression: a single embedding must represent an entire article while the query is grounded in one paragraph. Important rare terms can be diluted by the rest of the document.
Dense scores are still diagnostic. A model that improves dense retrieval here without losing BM25-like exact anchors is likely improving long-document representation rather than only short-passage semantics.
Reranking Hybrid Profile
reranking_hybrid usually sits between BM25 and dense. It helps when BM25 captures exact terms and dense retrieval captures broader semantic relation. Hindi is the clearest hybrid-led language in the current metadata; several other languages have hybrid scores that are meaningfully above dense but below BM25.
For reranker experiments, hybrid can be a safer candidate pool than dense alone because it preserves sparse long-document anchors. This matters when the full document is long and only a small region answers the question.
Language Summary
| Language | Task | Queries | Docs | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| Arabic | ar | 150 | 4,766 | 0.7604 | 0.4443 | 0.6181 | BM25 |
| German | de | 117 | 5,046 | 0.7138 | 0.4208 | 0.5773 | BM25 |
| English | en | 200 | 10,000 | 0.7254 | 0.4611 | 0.5916 | BM25 |
| Spanish | es | 176 | 3,312 | 0.9439 | 0.7844 | 0.8580 | BM25 |
| French | fr | 152 | 3,059 | 0.9125 | 0.7706 | 0.8421 | BM25 |
| Hindi | hi | 159 | 2,858 | 0.3184 | 0.3192 | 0.3883 | Reranking hybrid |
| Italian | it | 158 | 3,116 | 0.8884 | 0.6832 | 0.7807 | BM25 |
| Japanese | ja | 148 | 3,112 | 0.7589 | 0.5014 | 0.6452 | BM25 |
| Korean | ko | 177 | 3,087 | 0.6868 | 0.4120 | 0.5925 | BM25 |
| Portuguese | pt | 141 | 3,028 | 0.9503 | 0.7667 | 0.8565 | BM25 |
| Russian | ru | 160 | 3,125 | 0.8664 | 0.5992 | 0.6969 | BM25 |
| Thai | th | 151 | 3,199 | 0.3873 | 0.2671 | 0.3469 | BM25 |
| Chinese | zh | 200 | 7,877 | 0.7030 | 0.3392 | 0.4933 | BM25 |
Interpretation Notes for Model Researchers
NanoMLDR is a long-document retrieval benchmark first and a multilingual benchmark second. Strong results mean the model can identify a full document from a short question grounded in one paragraph. Dense models should not be judged only by passage-retrieval performance; this group tests whether their representations survive long-document aggregation.
The BM25 dominance is meaningful. It shows that exact rare terms and entities remain powerful when questions are generated from source paragraphs. Dense or hybrid improvements are most interesting in languages where BM25 is weak, such as Hindi and Thai, or where noisy web documents make exact matching less stable.
Training and Leakage Notes
Useful training data includes MLDR-style paragraph-grounded question/full article pairs, multilingual long-document QA, Wikipedia article retrieval, mC4/Wudao web-document retrieval, and hard negatives with overlapping entities, dates, locations, or template language. Training should preserve full-document targets rather than converting all examples to short passage retrieval.
Exclude NanoMLDR evaluation queries, positives, qrels, and source documents. If using public MLDR data, audit train/dev/test boundaries and article overlap before mixing examples into training.
Source Reference Table
| Source | Year | Type | URL |
| M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation | 2024 | paper | https://arxiv.org/abs/2402.03216 |
| MLDR dataset | dataset | https://huggingface.co/datasets/Shitao/MLDR |
Metadata Summary
| Field | Value |
| Task pages | 13 |
| Queries | 2,089 |
| Split-local documents | 55,585 |
| Positive qrels | 2,089 |
| Languages | ar, de, en, es, fr, hi, it, ja, ko, pt, ru, th, zh |
| Categories | natural_language |
| Positives / query avg | 1.00 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| ar | NanoMLDR | ar | natural_language | 150 | 4,766 | 150 | 0.7604 | 0.4443 | 0.6181 | BM25 |
| de | NanoMLDR | de | natural_language | 117 | 5,046 | 117 | 0.7138 | 0.4208 | 0.5773 | BM25 |
| en | NanoMLDR | en | natural_language | 200 | 10,000 | 200 | 0.7254 | 0.4611 | 0.5916 | BM25 |
| es | NanoMLDR | es | natural_language | 176 | 3,312 | 176 | 0.9439 | 0.7844 | 0.8580 | BM25 |
| fr | NanoMLDR | fr | natural_language | 152 | 3,059 | 152 | 0.9125 | 0.7706 | 0.8421 | BM25 |
| hi | NanoMLDR | hi | natural_language | 159 | 2,858 | 159 | 0.3184 | 0.3192 | 0.3883 | Reranking hybrid |
| it | NanoMLDR | it | natural_language | 158 | 3,116 | 158 | 0.8884 | 0.6832 | 0.7807 | BM25 |
| ja | NanoMLDR | ja | natural_language | 148 | 3,112 | 148 | 0.7589 | 0.5014 | 0.6452 | BM25 |
| ko | NanoMLDR | ko | natural_language | 177 | 3,087 | 177 | 0.6868 | 0.4120 | 0.5925 | BM25 |
| pt | NanoMLDR | pt | natural_language | 141 | 3,028 | 141 | 0.9503 | 0.7667 | 0.8565 | BM25 |
| ru | NanoMLDR | ru | natural_language | 160 | 3,125 | 160 | 0.8664 | 0.5992 | 0.6969 | BM25 |
| th | NanoMLDR | th | natural_language | 151 | 3,199 | 151 | 0.3873 | 0.2671 | 0.3469 | BM25 |
| zh | NanoMLDR | zh | natural_language | 200 | 7,877 | 200 | 0.7030 | 0.3392 | 0.4933 | BM25 |