NanoMIRACL
Overview
NanoMIRACL is a language-specific Nano benchmark for MIRACL, a multilingual ad hoc retrieval benchmark built around Wikipedia passage retrieval. The original MIRACL work covers eighteen languages and asks a monolingual retrieval question in each split: an Arabic query retrieves Arabic passages, a Japanese query retrieves Japanese passages, and so on. This group keeps that retrieval setting while making the task small enough to inspect one language at a time.
The group is valuable because it holds the high-level task constant while changing script, morphology, tokenization behavior, resource level, and Wikipedia coverage. The model is not translating and is not answering from a fixed article. It must rank the passage that contains the answer-bearing evidence for a short natural-language question. In the current Nano metadata, BM25 is often a strong lexical anchor, dense retrieval from harrier_oss_v1_270m is usually the best top-rank signal, and reranking_hybrid gives the broadest top-100 candidate coverage.
What This Group Measures
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages and its arXiv version, Making a MIRACL, describe MIRACL as a multilingual retrieval benchmark with native-language queries, Wikipedia passages, and relevance judgments. NanoMIRACL should be read as a compact monolingual passage-retrieval suite derived from that benchmark.
The shared relevance relation is evidence retrieval: the positive passage should answer or directly support the query. The task is therefore different from generic semantic similarity. A high-scoring retriever must keep exact entity names, dates, and article-title clues when they matter, while also recognizing the relation asked by the question. This is especially important in languages where segmentation, inflection, named-entity spelling, or short query length can change sparse and dense behavior.
Task Families
- Wikipedia evidence retrieval: all 18 tasks use monolingual factual question-to-passage retrieval over Wikipedia-style passages.
- Short-query entity retrieval pressure: Japanese, Chinese, Korean, Thai, Arabic, Persian, and several European-language splits often depend on short entity-heavy questions.
- Multi-positive passage ranking: most splits have more positive qrels than queries, so models are rewarded for ranking several acceptable passages, not only for finding one exact page.
- Low-resource and script-diverse evaluation: Bengali, Telugu, Swahili, Yoruba, Thai, and Persian expose failure modes that may be hidden by English-centric retrieval tuning.
Dataset Shape
NanoMIRACL contains 18 task pages, 3,519 queries, 180,000 split-local documents, and 8,071 positive qrel rows. Each split has 10,000 documents in the current metadata; this is a sum over language-local candidate pools, not a deduplicated corpus size. Most languages have 200 queries, while Yoruba has 119. Positive density differs substantially: Telugu averages close to one positive per query, while Spanish averages more than four.
Queries are short, with a query-weighted mean around 37.6 characters. Documents are compact passages, with a document-weighted mean around 353 characters. CJK splits have much shorter character counts than European-language splits, so raw character length should not be compared as if it were token length. The group is best interpreted as eighteen parallel retrieval conditions rather than one large multilingual pool.
Retrieval Behavior
BM25 Profile
BM25 is strongest when the query contains rare words, entity names, article titles, dates, or other exact anchors that survive tokenization. Finnish, Spanish, English, Indonesian, and Japanese are among the stronger sparse profiles in the current metadata. BM25 is less reliable when the relevant passage shares many terms with non-answering neighbors or when segmentation and morphology make exact matching brittle.
At the group level, BM25 is not merely a weak baseline. It provides high top-100 coverage and identifies tasks where lexical memorization or exact surface-form handling is still central. For model researchers, BM25-led or BM25-competitive splits should be treated as warnings that dense-only retrieval may be discarding useful exact-match evidence.
Dense Profile
Dense retrieval with harrier_oss_v1_270m is the best nDCG@10 profile for most NanoMIRACL languages. It is especially useful when the query and passage express the same answer relation with different wording, or when the relevant passage is not the one with the highest exact lexical overlap. Dense retrieval improves the interpretation of short questions because it can rank evidence passages by answerability rather than by surface-term count alone.
The dense profile is still not a full replacement for sparse retrieval. Rare names, spellings, and transliterated entities can be smoothed away by embedding similarity. The most useful comparison is therefore not simply BM25 versus dense, but which languages are dense-led, which are sparse-competitive, and which lose top-100 coverage under dense retrieval.
Reranking Hybrid Profile
reranking_hybrid combines the retrieval strengths needed for top-100 reranker candidate generation. It is not always the best nDCG@10 sorter, but it often has the safest Recall@100 because it can keep positives found by either lexical or dense retrieval. Indonesian and Korean are examples where the hybrid profile is best by nDCG@10 in the current metadata.
For reranker experiments, this profile should be read as the practical candidate pool. If dense has the best nDCG@10 but hybrid has better recall, the task is telling a clear story: first-stage dense ranking is good, but a reranker benefits from candidates recovered by both sparse and dense retrieval.
Language Summary
| Language | Task | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| Arabic | ar | 200 | 10,000 | 386 | 0.6352 | 0.8223 | 0.7514 | Dense |
| Bengali | bn | 200 | 10,000 | 407 | 0.5033 | 0.7661 | 0.6537 | Dense |
| German | de | 200 | 10,000 | 538 | 0.5172 | 0.7389 | 0.6418 | Dense |
| English | en | 200 | 10,000 | 560 | 0.6774 | 0.7721 | 0.7474 | Dense |
| Spanish | es | 200 | 10,000 | 934 | 0.6861 | 0.7793 | 0.7478 | Dense |
| Persian | fa | 200 | 10,000 | 427 | 0.5788 | 0.6476 | 0.6334 | Dense |
| Finnish | fi | 200 | 10,000 | 328 | 0.7734 | 0.8634 | 0.8332 | Dense |
| French | fr | 200 | 10,000 | 417 | 0.4658 | 0.6828 | 0.5896 | Dense |
| Hindi | hi | 200 | 10,000 | 410 | 0.3037 | 0.6847 | 0.5174 | Dense |
| Indonesian | id | 200 | 10,000 | 654 | 0.6773 | 0.7076 | 0.7171 | Reranking hybrid |
| Japanese | ja | 200 | 10,000 | 373 | 0.6601 | 0.7745 | 0.7223 | Dense |
| Korean | ko | 200 | 10,000 | 508 | 0.4994 | 0.6910 | 0.7026 | Reranking hybrid |
| Russian | ru | 200 | 10,000 | 555 | 0.5887 | 0.7693 | 0.6816 | Dense |
| Swahili | sw | 200 | 10,000 | 405 | 0.5852 | 0.7872 | 0.7292 | Dense |
| Telugu | te | 200 | 10,000 | 211 | 0.5292 | 0.8720 | 0.6953 | Dense |
| Thai | th | 200 | 10,000 | 343 | 0.6229 | 0.8101 | 0.7296 | Dense |
| Yoruba | yo | 119 | 10,000 | 144 | 0.5816 | 0.8416 | 0.7651 | Dense |
| Chinese | zh | 200 | 10,000 | 471 | 0.4022 | 0.7191 | 0.5619 | Dense |
Interpretation Notes for Model Researchers
Read NanoMIRACL as a controlled multilingual retrieval comparison. The task family is stable, but the tokenizer, script, entity distribution, and Wikipedia coverage change by language. A model that improves only English, Spanish, or French may be learning resource-rich Wikipedia retrieval rather than robust multilingual passage retrieval. Conversely, gains on Japanese, Chinese, Korean, Thai, Bengali, Telugu, Swahili, or Yoruba may reveal improvements in segmentation, multilingual representation, or low-resource transfer.
The most informative comparisons are the profile switches. Dense-led languages show where semantic answerability helps. BM25-competitive languages show where surface forms remain essential. Hybrid-led languages suggest complementarity between exact anchors and embedding similarity. Because many queries have multiple positives, nDCG@10 and Recall@100 should be inspected together: a model can find one acceptable passage while still ranking the broader positive set poorly.
Training and Leakage Notes
Useful training data includes non-overlapping MIRACL training data, language-matched Wikipedia question-to-passage pairs, open-domain QA evidence retrieval data, and hard negatives drawn from same-article or same-entity passages. Training should preserve the monolingual design: the query and document should stay in the same language unless a separate cross-lingual experiment is being run.
Leakage control is important. Exclude NanoMIRACL evaluation queries, qrels, positive passages, and direct translations of evaluation examples. Upstream MIRACL development or test rows should be checked for overlap before use. Synthetic data should preserve named entities, dates, numerals, aliases, orthography, and article-title conventions, and should include hard negatives that share surface terms but do not answer the specific relation.
Source Reference Table
| Source | Year | Type | URL |
| Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages | 2022 | paper | https://arxiv.org/abs/2210.09984 |
| MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages | 2023 | paper | https://aclanthology.org/2023.tacl-1.63/ |
| MIRACL GitHub repository | project | https://github.com/project-miracl/miracl | |
| MIRACL corpus dataset | dataset | https://huggingface.co/datasets/miracl/miracl-corpus | |
| MIRACL source queries and qrels | dataset | https://huggingface.co/datasets/miracl/miracl |
Metadata Summary
| Field | Value |
| Task pages | 18 |
| Queries | 3,519 |
| Split-local documents | 180,000 |
| Positive qrels | 8,071 |
| Languages | ar, bn, de, en, es, fa, fi, fr, hi, id, ja, ko, multilingual, ru, te, th, zh |
| Categories | natural_language |
| Positives / query avg | 2.29 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| ar | NanoMIRACL | ar | natural_language | 200 | 10,000 | 386 | 0.6352 | 0.8223 | 0.7514 | Dense |
| bn | NanoMIRACL | bn | natural_language | 200 | 10,000 | 407 | 0.5033 | 0.7661 | 0.6537 | Dense |
| de | NanoMIRACL | de | natural_language | 200 | 10,000 | 538 | 0.5172 | 0.7389 | 0.6418 | Dense |
| en | NanoMIRACL | en | natural_language | 200 | 10,000 | 560 | 0.6774 | 0.7721 | 0.7474 | Dense |
| es | NanoMIRACL | es | natural_language | 200 | 10,000 | 934 | 0.6861 | 0.7793 | 0.7478 | Dense |
| fa | NanoMIRACL | fa | natural_language | 200 | 10,000 | 427 | 0.5788 | 0.6476 | 0.6334 | Dense |
| fi | NanoMIRACL | fi | natural_language | 200 | 10,000 | 328 | 0.7734 | 0.8634 | 0.8332 | Dense |
| fr | NanoMIRACL | fr | natural_language | 200 | 10,000 | 417 | 0.4658 | 0.6828 | 0.5896 | Dense |
| hi | NanoMIRACL | hi | natural_language | 200 | 10,000 | 410 | 0.3037 | 0.6847 | 0.5174 | Dense |
| id | NanoMIRACL | id | natural_language | 200 | 10,000 | 654 | 0.6773 | 0.7076 | 0.7171 | Reranking hybrid |
| ja | NanoMIRACL | ja | natural_language | 200 | 10,000 | 373 | 0.6601 | 0.7745 | 0.7223 | Dense |
| ko | NanoMIRACL | ko | natural_language | 200 | 10,000 | 508 | 0.4994 | 0.6910 | 0.7026 | Reranking hybrid |
| ru | NanoMIRACL | ru | natural_language | 200 | 10,000 | 555 | 0.5887 | 0.7693 | 0.6816 | Dense |
| sw | NanoMIRACL | multilingual | natural_language | 200 | 10,000 | 405 | 0.5852 | 0.7872 | 0.7292 | Dense |
| te | NanoMIRACL | te | natural_language | 200 | 10,000 | 211 | 0.5292 | 0.8720 | 0.6953 | Dense |
| th | NanoMIRACL | th | natural_language | 200 | 10,000 | 343 | 0.6229 | 0.8101 | 0.7296 | Dense |
| yo | NanoMIRACL | multilingual | natural_language | 119 | 10,000 | 144 | 0.5816 | 0.8416 | 0.7651 | Dense |
| zh | NanoMIRACL | zh | natural_language | 200 | 10,000 | 471 | 0.4022 | 0.7191 | 0.5619 | Dense |