HAKARI-Bench

NanoMIRACL

Overview

NanoMIRACL is a language-specific Nano benchmark for MIRACL, a multilingual ad hoc retrieval benchmark built around Wikipedia passage retrieval. The original MIRACL work covers eighteen languages and asks a monolingual retrieval question in each split: an Arabic query retrieves Arabic passages, a Japanese query retrieves Japanese passages, and so on. This group keeps that retrieval setting while making the task small enough to inspect one language at a time.

The group is valuable because it holds the high-level task constant while changing script, morphology, tokenization behavior, resource level, and Wikipedia coverage. The model is not translating and is not answering from a fixed article. It must rank the passage that contains the answer-bearing evidence for a short natural-language question. In the current Nano metadata, BM25 is often a strong lexical anchor, dense retrieval from harrier_oss_v1_270m is usually the best top-rank signal, and reranking_hybrid gives the broadest top-100 candidate coverage.

What This Group Measures

MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages and its arXiv version, Making a MIRACL, describe MIRACL as a multilingual retrieval benchmark with native-language queries, Wikipedia passages, and relevance judgments. NanoMIRACL should be read as a compact monolingual passage-retrieval suite derived from that benchmark.

The shared relevance relation is evidence retrieval: the positive passage should answer or directly support the query. The task is therefore different from generic semantic similarity. A high-scoring retriever must keep exact entity names, dates, and article-title clues when they matter, while also recognizing the relation asked by the question. This is especially important in languages where segmentation, inflection, named-entity spelling, or short query length can change sparse and dense behavior.

Task Families

Dataset Shape

NanoMIRACL contains 18 task pages, 3,519 queries, 180,000 split-local documents, and 8,071 positive qrel rows. Each split has 10,000 documents in the current metadata; this is a sum over language-local candidate pools, not a deduplicated corpus size. Most languages have 200 queries, while Yoruba has 119. Positive density differs substantially: Telugu averages close to one positive per query, while Spanish averages more than four.

Queries are short, with a query-weighted mean around 37.6 characters. Documents are compact passages, with a document-weighted mean around 353 characters. CJK splits have much shorter character counts than European-language splits, so raw character length should not be compared as if it were token length. The group is best interpreted as eighteen parallel retrieval conditions rather than one large multilingual pool.

Retrieval Behavior

BM25 Profile

BM25 is strongest when the query contains rare words, entity names, article titles, dates, or other exact anchors that survive tokenization. Finnish, Spanish, English, Indonesian, and Japanese are among the stronger sparse profiles in the current metadata. BM25 is less reliable when the relevant passage shares many terms with non-answering neighbors or when segmentation and morphology make exact matching brittle.

At the group level, BM25 is not merely a weak baseline. It provides high top-100 coverage and identifies tasks where lexical memorization or exact surface-form handling is still central. For model researchers, BM25-led or BM25-competitive splits should be treated as warnings that dense-only retrieval may be discarding useful exact-match evidence.

Dense Profile

Dense retrieval with harrier_oss_v1_270m is the best nDCG@10 profile for most NanoMIRACL languages. It is especially useful when the query and passage express the same answer relation with different wording, or when the relevant passage is not the one with the highest exact lexical overlap. Dense retrieval improves the interpretation of short questions because it can rank evidence passages by answerability rather than by surface-term count alone.

The dense profile is still not a full replacement for sparse retrieval. Rare names, spellings, and transliterated entities can be smoothed away by embedding similarity. The most useful comparison is therefore not simply BM25 versus dense, but which languages are dense-led, which are sparse-competitive, and which lose top-100 coverage under dense retrieval.

Reranking Hybrid Profile

reranking_hybrid combines the retrieval strengths needed for top-100 reranker candidate generation. It is not always the best nDCG@10 sorter, but it often has the safest Recall@100 because it can keep positives found by either lexical or dense retrieval. Indonesian and Korean are examples where the hybrid profile is best by nDCG@10 in the current metadata.

For reranker experiments, this profile should be read as the practical candidate pool. If dense has the best nDCG@10 but hybrid has better recall, the task is telling a clear story: first-stage dense ranking is good, but a reranker benefits from candidates recovered by both sparse and dense retrieval.

Language Summary

LanguageTaskQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
Arabicar20010,0003860.63520.82230.7514Dense
Bengalibn20010,0004070.50330.76610.6537Dense
Germande20010,0005380.51720.73890.6418Dense
Englishen20010,0005600.67740.77210.7474Dense
Spanishes20010,0009340.68610.77930.7478Dense
Persianfa20010,0004270.57880.64760.6334Dense
Finnishfi20010,0003280.77340.86340.8332Dense
Frenchfr20010,0004170.46580.68280.5896Dense
Hindihi20010,0004100.30370.68470.5174Dense
Indonesianid20010,0006540.67730.70760.7171Reranking hybrid
Japaneseja20010,0003730.66010.77450.7223Dense
Koreanko20010,0005080.49940.69100.7026Reranking hybrid
Russianru20010,0005550.58870.76930.6816Dense
Swahilisw20010,0004050.58520.78720.7292Dense
Telugute20010,0002110.52920.87200.6953Dense
Thaith20010,0003430.62290.81010.7296Dense
Yorubayo11910,0001440.58160.84160.7651Dense
Chinesezh20010,0004710.40220.71910.5619Dense

Interpretation Notes for Model Researchers

Read NanoMIRACL as a controlled multilingual retrieval comparison. The task family is stable, but the tokenizer, script, entity distribution, and Wikipedia coverage change by language. A model that improves only English, Spanish, or French may be learning resource-rich Wikipedia retrieval rather than robust multilingual passage retrieval. Conversely, gains on Japanese, Chinese, Korean, Thai, Bengali, Telugu, Swahili, or Yoruba may reveal improvements in segmentation, multilingual representation, or low-resource transfer.

The most informative comparisons are the profile switches. Dense-led languages show where semantic answerability helps. BM25-competitive languages show where surface forms remain essential. Hybrid-led languages suggest complementarity between exact anchors and embedding similarity. Because many queries have multiple positives, nDCG@10 and Recall@100 should be inspected together: a model can find one acceptable passage while still ranking the broader positive set poorly.

Training and Leakage Notes

Useful training data includes non-overlapping MIRACL training data, language-matched Wikipedia question-to-passage pairs, open-domain QA evidence retrieval data, and hard negatives drawn from same-article or same-entity passages. Training should preserve the monolingual design: the query and document should stay in the same language unless a separate cross-lingual experiment is being run.

Leakage control is important. Exclude NanoMIRACL evaluation queries, qrels, positive passages, and direct translations of evaluation examples. Upstream MIRACL development or test rows should be checked for overlap before use. Synthetic data should preserve named entities, dates, numerals, aliases, orthography, and article-title conventions, and should include hard negatives that share surface terms but do not answer the specific relation.

Source Reference Table

SourceYearTypeURL
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages2022paperhttps://arxiv.org/abs/2210.09984
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages2023paperhttps://aclanthology.org/2023.tacl-1.63/
MIRACL GitHub repositoryprojecthttps://github.com/project-miracl/miracl
MIRACL corpus datasetdatasethttps://huggingface.co/datasets/miracl/miracl-corpus
MIRACL source queries and qrelsdatasethttps://huggingface.co/datasets/miracl/miracl

Metadata Summary

FieldValue
Task pages18
Queries3,519
Split-local documents180,000
Positive qrels8,071
Languagesar, bn, de, en, es, fa, fi, fr, hi, id, ja, ko, multilingual, ru, te, th, zh
Categoriesnatural_language
Positives / query avg2.29

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
arNanoMIRACLarnatural_language20010,0003860.63520.82230.7514Dense
bnNanoMIRACLbnnatural_language20010,0004070.50330.76610.6537Dense
deNanoMIRACLdenatural_language20010,0005380.51720.73890.6418Dense
enNanoMIRACLennatural_language20010,0005600.67740.77210.7474Dense
esNanoMIRACLesnatural_language20010,0009340.68610.77930.7478Dense
faNanoMIRACLfanatural_language20010,0004270.57880.64760.6334Dense
fiNanoMIRACLfinatural_language20010,0003280.77340.86340.8332Dense
frNanoMIRACLfrnatural_language20010,0004170.46580.68280.5896Dense
hiNanoMIRACLhinatural_language20010,0004100.30370.68470.5174Dense
idNanoMIRACLidnatural_language20010,0006540.67730.70760.7171Reranking hybrid
jaNanoMIRACLjanatural_language20010,0003730.66010.77450.7223Dense
koNanoMIRACLkonatural_language20010,0005080.49940.69100.7026Reranking hybrid
ruNanoMIRACLrunatural_language20010,0005550.58870.76930.6816Dense
swNanoMIRACLmultilingualnatural_language20010,0004050.58520.78720.7292Dense
teNanoMIRACLtenatural_language20010,0002110.52920.87200.6953Dense
thNanoMIRACLthnatural_language20010,0003430.62290.81010.7296Dense
yoNanoMIRACLmultilingualnatural_language11910,0001440.58160.84160.7651Dense
zhNanoMIRACLzhnatural_language20010,0004710.40220.71910.5619Dense