HAKARI-Bench

NanoMTEB-Misc

Overview

NanoMTEB-Misc is a mixed multilingual retrieval group for NanoMTEB-family tasks that do not belong cleanly to one language-specific benchmark set. It combines NeuCLIR 2022 Persian, Russian, and Chinese news retrieval; RuSciBench Russian scientific citation and co-citation retrieval; EuroPIRQ English, Finnish, and Portuguese legal passage retrieval; and German-French CLSD translation-pair retrieval from WMT19 and WMT21.

The group contains 1,636 queries, 99,624 task-local documents, and 7,538 positive qrel rows. It should be read as a stress test for mixed relevance definitions rather than as one coherent domain benchmark. Some tasks have broad multi-positive news or citation relevance, while others are single-positive legal question retrieval or cross-lingual translation-equivalence retrieval.

What This Group Measures

The group measures robustness across source families. The NeuCLIR tasks ask a model to retrieve many relevant target-language news articles from Persian, Russian, or Chinese topic statements. RuSciBench tasks use Russian scientific paper relations derived from citation graphs. EuroPIRQ tasks retrieve EU legal or administrative passages from synthetic questions in English, Finnish, and Portuguese. CLSD tasks retrieve the true German-French or French-German translation counterpart among close distractors.

Because the relevance relation changes from task to task, aggregate scores are only a starting point. A model can look strong by excelling at cross-lingual sentence retrieval while still struggling with broad news relevance or citation graph retrieval. Conversely, a sparse system can be competitive on legal passages while failing on translation-pair retrieval where exact word overlap is mostly unavailable.

Task Families

Dataset Shape

The group has twelve task pages. The NeuCLIR and RuSciBench tasks are multi-positive: NeuCLIR topics can have dozens of relevant news articles, and RuSciBench tasks use five positive papers per query. EuroPIRQ and CLSD are single-positive, so a query is expected to retrieve one target passage or translation counterpart.

Text length and corpus shape vary sharply. RuSciBench queries are long title-plus-abstract scientific texts. NeuCLIR documents are long news articles. EuroPIRQ documents are formal legal paragraphs. CLSD documents are short translated sentences. This variation makes NanoMTEB-Misc especially sensitive to tokenization, truncation, multilingual representation quality, and whether a model was trained on sentence-pair or document-retrieval supervision.

Retrieval Behavior

BM25 Profile

BM25 is strongest on the EuroPIRQ legal tasks. It is the best profile for fi, and it remains very close to the best profile for en and pt. The legal questions often preserve distinctive terms, institutions, dates, or entities from the target passage, which gives sparse retrieval a clear lexical signal. BM25 also remains useful for RuSciBench, where Russian title and abstract terms overlap with related scientific papers.

BM25 is weak on the CLSD translation tasks, especially compared with dense retrieval. This is expected because cross-lingual sentence retrieval gives BM25 few shared tokens beyond names, numbers, and international terms. It is also limited on NeuCLIR Chinese and Persian, where broad topic relevance and long news articles do not reduce to direct token overlap. The query-weighted BM25 nDCG@10 is 0.4700, making it a useful baseline but not the dominant retrieval profile for this group.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest query-weighted profile: 0.7842 nDCG@10 and 0.9273 hit@10. Its advantage is most visible on the CLSD tasks. The four German-French translation retrieval tasks score between 0.8954 and 0.9574 nDCG@10 with dense retrieval, while BM25 stays much lower. This suggests that the dense model captures cross-lingual semantic equivalence much better than lexical overlap can.

Dense is also best on all three NeuCLIR tasks and on cite_ru. For NeuCLIR, embedding similarity helps connect information needs with relevant news articles even when vocabulary differs. For citation retrieval, dense similarity captures topic and abstract-level relatedness. Dense is weaker than BM25 on fi and slightly weaker than hybrid on en, but its overall dominance makes NanoMTEB-Misc a strong multilingual semantic retrieval diagnostic.

Reranking Hybrid Profile

The reranking hybrid profile has the best query-weighted recall@100 at 0.9019, but it is not the best nDCG@10 profile overall. It is best on 2022_ru, en, and cocite_ru, where sparse and dense retrieval appear to recover complementary candidates. The Russian NeuCLIR and co-citation tasks are good examples of hybrid search helping when exact terms and semantic relatedness both matter.

Hybrid underperforms dense on the CLSD tasks because dense retrieval already captures the translation relation very strongly, while sparse evidence is weak. It also trails dense on Persian and Chinese NeuCLIR. The pattern is therefore not "hybrid always wins"; rather, hybrid is useful for candidate coverage and mixed evidence retrieval, while dense ranking is often better for cross-lingual semantic equivalence.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
2022_faNeuCLIR news retrievalfa458,8821,13125.130.26000.49150.4138Dense
2022_ruNeuCLIR news retrievalru448,7221,66437.820.34900.58070.6011Reranking hybrid
2022_zhNeuCLIR news retrievalzh4710,0001,64334.960.29310.51010.4072Dense
cite_ruScientific citation retrievalru20010,0001,0005.000.55660.61820.6134Dense
cocite_ruScientific co-citation retrievalru20010,0001,0005.000.39200.42490.4346Reranking hybrid
enEU legal passage retrievalen1009,4221001.000.94140.92550.9438Reranking hybrid
fiEU legal passage retrievalfi1009,4221001.000.90920.85420.8813BM25
ptEU legal passage retrievalpt1009,5171001.000.91860.86230.8901BM25
wmt19_de_frTranslation retrievalmultilingual2007,3642001.000.22040.91510.5447Dense
wmt19_fr_deTranslation retrievalmultilingual2007,3652001.000.30780.95740.6054Dense
wmt21_de_frTranslation retrievalmultilingual2004,4652001.000.31270.92490.5988Dense
wmt21_fr_deTranslation retrievalmultilingual2004,4652001.000.46580.89540.6999Dense

Interpretation Notes for Model Researchers

NanoMTEB-Misc is a good place to look for failure modes hidden by average scores. Dense models that handle German-French sentence equivalence well can score very highly on the CLSD block, but that does not imply strong NeuCLIR news retrieval or Russian citation retrieval. Sparse systems can look strong on EuroPIRQ legal passages while failing cross-lingual translation retrieval.

The group also highlights different uses of many positives. NeuCLIR evaluates broad ad hoc relevance with dozens of positives per topic, while RuSciBench uses graph-derived scientific relations. These are not the same retrieval problem. Researchers should inspect task-family means and not rely solely on the overall NanoMTEB-Misc score.

Training and Leakage Notes

Training should be source-family specific. NeuCLIR benefits from multilingual news retrieval and same-event hard negatives. RuSciBench benefits from citation, co-citation, and scientific abstract representation learning. EuroPIRQ benefits from EU legal question-passage pairs and legal boilerplate negatives. CLSD benefits from German-French bitext retrieval and semantically close translation distractors.

Leakage control should exclude Nano queries, qrels, and positive documents from NeuCLIR, RuSciBench, EuroPIRQ, and CLSD/WMT-derived sources. Synthetic data should preserve the original relevance relation: broad news topics with many articles, citation-graph relations, legal question-to-passage mapping, and true translation equivalence with close cross-lingual negatives.

Source Reference Table

SourceYearTypeURL
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
Overview of the TREC 2022 NeuCLIR Track2023source task paperhttps://arxiv.org/abs/2304.12367
NeuCLIR official siteproject pagehttps://neuclir.github.io/
RuSciBench: Open Benchmark for Russian and English Scientific Document Representations2024source task paperhttps://doi.org/10.1134/S1064562424602191
EuroPIRQ-retrieval2025dataset cardhttps://huggingface.co/datasets/eherra/EuroPIRQ-retrieval
MMTEB: Massive Multilingual Text Embedding Benchmark2025benchmark paperhttps://arxiv.org/abs/2502.13595
Cross-Lingual Semantic Discrimination for Building Better Multilingual Embeddings2025source task paperhttps://arxiv.org/abs/2502.08638
Andrianos/clsd_wmt19_21dataset cardhttps://huggingface.co/datasets/Andrianos/clsd_wmt19_21
mteb/NeuCLIR2022RetrievalHardNegativesdataset cardhttps://huggingface.co/datasets/mteb/NeuCLIR2022RetrievalHardNegatives
mlsa-iai-msu-lab/ru_sci_bench_cite_retrievaldataset cardhttps://huggingface.co/datasets/mlsa-iai-msu-lab/ru_sci_bench_cite_retrieval
mlsa-iai-msu-lab/ru_sci_bench_cocite_retrievaldataset cardhttps://huggingface.co/datasets/mlsa-iai-msu-lab/ru_sci_bench_cocite_retrieval

Metadata Summary

FieldValue
Task pages12
Queries1,636
Split-local documents99,624
Positive qrels7,538
Languagesen, fa, fi, multilingual, pt, ru, zh
Categoriesnatural_language
Positives / query avg4.61

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
2022_faNanoMTEB-Miscfanatural_language458,8821,1310.26000.49150.4138Dense
2022_ruNanoMTEB-Miscrunatural_language448,7221,6640.34900.58070.6011Reranking hybrid
2022_zhNanoMTEB-Misczhnatural_language4710,0001,6430.29310.51010.4072Dense
cite_ruNanoMTEB-Miscrunatural_language20010,0001,0000.55660.61820.6134Dense
cocite_ruNanoMTEB-Miscrunatural_language20010,0001,0000.39200.42490.4346Reranking hybrid
enNanoMTEB-Miscennatural_language1009,4221000.94140.92550.9438Reranking hybrid
fiNanoMTEB-Miscfinatural_language1009,4221000.90920.85420.8813BM25
ptNanoMTEB-Miscptnatural_language1009,5171000.91860.86230.8901BM25
wmt19_de_frNanoMTEB-Miscmultilingualnatural_language2007,3642000.22040.91510.5447Dense
wmt19_fr_deNanoMTEB-Miscmultilingualnatural_language2007,3652000.30780.95740.6054Dense
wmt21_de_frNanoMTEB-Miscmultilingualnatural_language2004,4652000.31270.92490.5988Dense
wmt21_fr_deNanoMTEB-Miscmultilingualnatural_language2004,4652000.46580.89540.6999Dense