HAKARI-Bench

NanoMLDR

Overview

NanoMLDR is the compact Nano set for MLDR, a multilingual long-document retrieval benchmark. It covers 13 monolingual retrieval splits: Arabic, German, English, Spanish, French, Hindi, Italian, Japanese, Korean, Portuguese, Russian, Thai, and Chinese. Each query is a question generated from a paragraph inside a long article, while the positive document is the full article rather than the short answer-bearing paragraph.

The group is useful because it isolates a difficult document-level retrieval problem. The query may point to one small region of a very long same-language document. A successful retriever must preserve language coverage, exact entity and phrase anchors, and enough long-document representation to select the whole source article. BM25 is the dominant profile for most languages in the current metadata, dense retrieval is weaker on long-document compression, and reranking_hybrid is useful where sparse and dense candidates recover different long documents.

What This Group Measures

M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation describes the MLDR long-document setting. The MLDR dataset construction samples long documents from multilingual sources, selects a paragraph, and generates a question from that paragraph. The retrieval target remains the full document.

NanoMLDR therefore measures monolingual long-document retrieval, not short passage retrieval and not cross-lingual transfer. The answer-bearing evidence may be a small part of the document, and the document itself may be a clean Wikipedia article, a noisy mC4 page, or a Wudao-style Chinese text.

Task Families

Dataset Shape

NanoMLDR contains 13 task pages, 2,089 queries, 55,585 split-local documents, and 2,089 positive qrel rows. Query counts vary by language, from 117 German queries to 200 English and Chinese queries. Every observed query has one positive full document.

Documents are long. English averages nearly 28,000 characters per document, while many European and Indic-language splits average around 12,000 to 15,000 characters. Japanese, Korean, and Thai have shorter character counts but are still long-document retrieval tasks. The group should be interpreted as document-level retrieval under multilingual source and noise variation.

Retrieval Behavior

BM25 Profile

BM25 is the best profile for nearly every NanoMLDR language in the current metadata. Portuguese, Spanish, French, Italian, and Russian are especially strong, suggesting that generated questions often preserve rare entities, phrases, dates, or topical terms from the source article. BM25 is also strong for Arabic, German, English, Japanese, Korean, and Chinese relative to dense retrieval.

Thai and Hindi are harder. Thai includes noisier web documents and weaker lexical anchoring. Hindi is the main split where hybrid beats BM25, suggesting that sparse and dense retrieval recover complementary signals.

Dense Profile

Dense retrieval is generally weaker than BM25 on NanoMLDR. The likely issue is long-document compression: a single embedding must represent an entire article while the query is grounded in one paragraph. Important rare terms can be diluted by the rest of the document.

Dense scores are still diagnostic. A model that improves dense retrieval here without losing BM25-like exact anchors is likely improving long-document representation rather than only short-passage semantics.

Reranking Hybrid Profile

reranking_hybrid usually sits between BM25 and dense. It helps when BM25 captures exact terms and dense retrieval captures broader semantic relation. Hindi is the clearest hybrid-led language in the current metadata; several other languages have hybrid scores that are meaningfully above dense but below BM25.

For reranker experiments, hybrid can be a safer candidate pool than dense alone because it preserves sparse long-document anchors. This matters when the full document is long and only a small region answers the question.

Language Summary

LanguageTaskQueriesDocsBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
Arabicar1504,7660.76040.44430.6181BM25
Germande1175,0460.71380.42080.5773BM25
Englishen20010,0000.72540.46110.5916BM25
Spanishes1763,3120.94390.78440.8580BM25
Frenchfr1523,0590.91250.77060.8421BM25
Hindihi1592,8580.31840.31920.3883Reranking hybrid
Italianit1583,1160.88840.68320.7807BM25
Japaneseja1483,1120.75890.50140.6452BM25
Koreanko1773,0870.68680.41200.5925BM25
Portuguesept1413,0280.95030.76670.8565BM25
Russianru1603,1250.86640.59920.6969BM25
Thaith1513,1990.38730.26710.3469BM25
Chinesezh2007,8770.70300.33920.4933BM25

Interpretation Notes for Model Researchers

NanoMLDR is a long-document retrieval benchmark first and a multilingual benchmark second. Strong results mean the model can identify a full document from a short question grounded in one paragraph. Dense models should not be judged only by passage-retrieval performance; this group tests whether their representations survive long-document aggregation.

The BM25 dominance is meaningful. It shows that exact rare terms and entities remain powerful when questions are generated from source paragraphs. Dense or hybrid improvements are most interesting in languages where BM25 is weak, such as Hindi and Thai, or where noisy web documents make exact matching less stable.

Training and Leakage Notes

Useful training data includes MLDR-style paragraph-grounded question/full article pairs, multilingual long-document QA, Wikipedia article retrieval, mC4/Wudao web-document retrieval, and hard negatives with overlapping entities, dates, locations, or template language. Training should preserve full-document targets rather than converting all examples to short passage retrieval.

Exclude NanoMLDR evaluation queries, positives, qrels, and source documents. If using public MLDR data, audit train/dev/test boundaries and article overlap before mixing examples into training.

Source Reference Table

SourceYearTypeURL
M3-Embedding: Multi-Linguality, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation2024paperhttps://arxiv.org/abs/2402.03216
MLDR datasetdatasethttps://huggingface.co/datasets/Shitao/MLDR

Metadata Summary

FieldValue
Task pages13
Queries2,089
Split-local documents55,585
Positive qrels2,089
Languagesar, de, en, es, fr, hi, it, ja, ko, pt, ru, th, zh
Categoriesnatural_language
Positives / query avg1.00

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
arNanoMLDRarnatural_language1504,7661500.76040.44430.6181BM25
deNanoMLDRdenatural_language1175,0461170.71380.42080.5773BM25
enNanoMLDRennatural_language20010,0002000.72540.46110.5916BM25
esNanoMLDResnatural_language1763,3121760.94390.78440.8580BM25
frNanoMLDRfrnatural_language1523,0591520.91250.77060.8421BM25
hiNanoMLDRhinatural_language1592,8581590.31840.31920.3883Reranking hybrid
itNanoMLDRitnatural_language1583,1161580.88840.68320.7807BM25
jaNanoMLDRjanatural_language1483,1121480.75890.50140.6452BM25
koNanoMLDRkonatural_language1773,0871770.68680.41200.5925BM25
ptNanoMLDRptnatural_language1413,0281410.95030.76670.8565BM25
ruNanoMLDRrunatural_language1603,1251600.86640.59920.6969BM25
thNanoMLDRthnatural_language1513,1991510.38730.26710.3469BM25
zhNanoMLDRzhnatural_language2007,8772000.70300.33920.4933BM25