HAKARI-Bench

NanoMMTEB-v2

Overview

NanoMMTEB-v2 is a compact retrieval group drawn from multilingual MTEB/MMTEB retrieval tasks. It is intentionally heterogeneous: legal statute retrieval, counterargument retrieval, multilingual reading comprehension, Chinese COVID policy search, FAQ-style retrieval, legal bill retrieval, long-context passkey retrieval, MIRACL, MLQA, scientific related-paper retrieval, spatial and temporal reasoning, StackOverflow QA, StatCan dialogue-to-table retrieval, TREC-COVID, Danish Twitter advice retrieval, multilingual Wikipedia QA, and WinoGrande-style referent retrieval all appear in one group.

The group is useful as a mixed-domain multilingual stress test. It does not isolate one source benchmark, one language, or one relevance relation. BM25 identifies tasks where answer text, legal terms, or web vocabulary repeat; dense retrieval identifies semantic, multilingual, and reasoning-style gains; reranking_hybrid highlights tasks where exact anchors and semantic candidates recover different positives.

What This Group Measures

MMTEB: Massive Multilingual Text Embedding Benchmark expands the MTEB framework to a wide multilingual task inventory. MTEB: Massive Text Embedding Benchmark provides the retrieval interface used by many source tasks. NanoMMTEB-v2 is a Nano-style retrieval subset that samples a diverse set of those tasks.

The group measures robustness under task heterogeneity. Some tasks are normal passage retrieval; others convert legal, reasoning, dialogue, FAQ, or long-context problems into retrieval. The common format is query, corpus, qrels, and candidate rankings, but the relevance relation changes sharply by task.

Task Families

Dataset Shape

NanoMMTEB-v2 contains 18 task pages, 3,248 queries, 116,569 split-local documents, and 9,408 positive qrel rows. The group mixes single-positive tasks with strongly multi-positive ones. TREC-COVID has many positives per query, while MIRACL, SCIDOCS, StatCan, Twitter Hjerne, and AILAStatutes also require multi-positive interpretation.

Text shape varies from one-word or short answer candidates to long legal queries and long passkey documents. statcan_dialogue_dataset uses dialogue logs and table metadata; lembpasskey uses long synthetic documents; legal statute retrieval uses long fact scenarios; reasoning tasks can have very short answer documents. This diversity is the point of the group.

Retrieval Behavior

BM25 Profile

BM25 is strongest on tasks with direct answer or term repetition: lembpasskey, hagrid, wikipedia_multilingual, corporate lobbying, StackOverflow QA, Chinese COVID retrieval, and MIRACL. These tasks often expose names, answer terms, exact legal or policy words, or distinctive support text.

BM25 is weak on tasks where the answer must be inferred or represented through a different format: statcan_dialogue_dataset, temp_reason_l1, mlqa, belebele, and scidocs. In these cases, exact term overlap is not enough or the target text is too short to carry many lexical anchors.

Dense Profile

Dense retrieval is strongest on many low-BM25 tasks. It improves Belebele, MLQA, StatCan, Twitter Hjerne, SCIDOCS, MIRACL, StackOverflow QA, and several legal or scientific tasks by matching semantic intent or multilingual evidence where exact overlap is weak. This group is therefore useful for identifying whether dense models handle heterogeneous retrieval relations, not just one passage-search format.

Dense retrieval can still lose exact anchors. Passkey, legal bill retrieval, COVID policy retrieval, and short-answer reasoning tasks can depend on exact tokens, dates, names, or answer strings.

Reranking Hybrid Profile

reranking_hybrid is best for tasks such as spart_qa, treccovid, and wino_grande, and remains competitive on many others. These are cases where sparse and dense candidate sets provide complementary evidence. In WinoGrande and spatial reasoning, a short answer may require both context matching and exact candidate discrimination.

For reranker experiments, this group is valuable because candidate-generation failure modes differ by task. The same reranker pool will face long documents, short answer strings, multilingual passages, code answers, legal text, and scientific abstracts.

Task Summary

TaskLanguageRetrieval focusQueriesDocsBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ailastatutesenlegal scenario to statute50820.20700.27250.2557Dense
argu_anaenargument to counterargument1998,6260.34640.39980.3716Dense
belebelemultilingualreading comprehension to answer37610,0000.09030.27810.1782Dense
covidzhCOVID query to policy/news passage20010,0000.78880.75920.7873BM25
hagridenFAQ-style query to answer2004930.98140.95700.9639BM25
legal_bench_corporate_lobbyingenpolicy description to bill summary2003190.89550.91100.9080Dense
lembpasskeyenpasskey query to long context1001000.99630.84630.8525BM25
miraclmultilingualquestion to Wikipedia passage20010,0000.57600.77750.6942Dense
mlqamultilingualmultilingual QA to evidence19610,0000.03900.09590.0534Dense
scidocsenpaper title to related document20010,0000.20670.27730.2590Dense
spart_qaenspatial reasoning query to answer2001,5920.18480.25910.3382Reranking hybrid
stack_overflow_qaendeveloper question to answer20010,0000.79700.88860.8457Dense
statcan_dialogue_datasetmultilingualdialogue to statistical table20010,0000.01120.27310.1564Dense
temp_reason_l1multilingualtemporal reasoning to date answer20010,0000.01610.04880.0134Dense
treccovidenCOVID topic to biomedical abstracts5010,0000.36270.42660.4505Reranking hybrid
twitter_hjernedaDanish tweet to advice response772620.23950.62430.4402Dense
wikipedia_multilingualmultilingualquestion to Wikipedia answer passage20010,0000.94250.96240.9452Dense
wino_grandeenpronoun reasoning context to referent2005,0950.50840.49400.6139Reranking hybrid

Interpretation Notes for Model Researchers

NanoMMTEB-v2 should be read as a stress test for breadth. It does not tell a single story about one language or one retrieval relation. Instead, it exposes whether a model is robust across legal, QA, reasoning, scientific, code, social, and long-context formats.

The most useful analysis is task-family based. BM25-heavy tasks test exact anchors and short answer matching. Dense-heavy tasks test semantic and multilingual transfer. Hybrid-led tasks test complementarity and candidate coverage. Because several tasks are multi-positive, Recall@100 should be read alongside nDCG@10.

Training and Leakage Notes

Useful training data must be task-matched: statute retrieval for legal tasks, multilingual QA evidence for Belebele/MIRACL/MLQA, scientific and biomedical retrieval for SCIDOCS/TREC-COVID, code QA for StackOverflow, and reasoning data for spatial, temporal, and WinoGrande-style tasks. Pooling everything into one generic similarity objective can erase the distinctions this group tests.

Exclude NanoMMTEB-v2 evaluation queries, positives, qrels, answer candidates, long contexts, and source rows. Because the group mixes public benchmark tasks, upstream evaluation splits should be audited carefully before use in training.

Source Reference Table

SourceYearTypeURL
MMTEB: Massive Multilingual Text Embedding Benchmark2025paperhttps://arxiv.org/abs/2502.13595
MTEB: Massive Text Embedding Benchmark2022paperhttps://arxiv.org/abs/2210.07316

Metadata Summary

FieldValue
Task pages18
Queries3,248
Split-local documents116,569
Positive qrels9,408
Languagesda, en, multilingual, zh
Categoriesnatural_language
Positives / query avg2.90

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ailastatutesNanoMMTEB-v2ennatural_language50822170.20700.27250.2557Dense
argu_anaNanoMMTEB-v2ennatural_language1998,6261990.34640.39980.3716Dense
belebeleNanoMMTEB-v2multilingualnatural_language37610,0003760.09030.27810.1782Dense
covidNanoMMTEB-v2zhnatural_language20010,0002040.78880.75920.7873BM25
hagridNanoMMTEB-v2ennatural_language2004932000.98140.95700.9639BM25
legal_bench_corporate_lobbyingNanoMMTEB-v2ennatural_language2003192000.89550.91100.9080Dense
lembpasskeyNanoMMTEB-v2ennatural_language1001001000.99630.84630.8525BM25
miraclNanoMMTEB-v2multilingualnatural_language20010,0004440.57600.77750.6942Dense
mlqaNanoMMTEB-v2multilingualnatural_language19610,0001960.03900.09590.0534Dense
scidocsNanoMMTEB-v2ennatural_language20010,0009860.20670.27730.2590Dense
spart_qaNanoMMTEB-v2ennatural_language2001,5923840.18480.25910.3382Reranking hybrid
stack_overflow_qaNanoMMTEB-v2ennatural_language20010,0002000.79700.88860.8457Dense
statcan_dialogue_datasetNanoMMTEB-v2multilingualnatural_language20010,0003130.01120.27310.1564Dense
temp_reason_l1NanoMMTEB-v2multilingualnatural_language20010,0002000.01610.04880.0134Dense
treccovidNanoMMTEB-v2ennatural_language5010,0004,5270.36270.42660.4505Reranking hybrid
twitter_hjerneNanoMMTEB-v2danatural_language772622620.23950.62430.4402Dense
wikipedia_multilingualNanoMMTEB-v2multilingualnatural_language20010,0002000.94250.96240.9452Dense
wino_grandeNanoMMTEB-v2ennatural_language2005,0952000.50840.49400.6139Reranking hybrid