HAKARI-Bench

NanoMTEB-Spanish

Overview

NanoMTEB-Spanish is a compact Spanish and Spanish-English retrieval group. It covers complex entity-answer QA, Spanish Wikipedia passage retrieval, Spanish consumer-health passage and document retrieval, and product question answering in Spanish-English, English-Spanish, and Spanish-Spanish directions. The group is useful because the target is not always a Spanish paragraph with obvious word overlap: some positives are short entity answers, health passages, or compact product snippets.

The group contains 1,334 queries, 25,262 task-local documents, and 4,806 positive qrel rows. It is multi-positive overall, with MIRACL, Spanish Passage Retrieval, and xPQA contributing multiple relevant documents or snippets per query. This makes the group a good diagnostic for Spanish retrieval systems that need to combine semantic answerability, domain evidence, and cross-lingual product matching.

What This Group Measures

The group measures several Spanish retrieval relations. mintaka_es maps Spanish complex questions to short answer strings or entity names. miracl_es retrieves Spanish Wikipedia passages for information needs. spanish_passage_s2_p retrieves full Spanish health web pages, while spanish_passage_s2_s retrieves shorter answer passages for the same consumer-health setting. The three xPQA tasks retrieve product answer snippets across Spanish-English and monolingual Spanish directions.

This mixture separates lexical Spanish passage retrieval from semantic and cross-lingual retrieval. A model can do well on MIRACL or health pages because query terms overlap with passages, but still fail on product QA where snippets are short and may be in another language. Conversely, a cross-lingual dense model can be strong on xPQA while still needing exact medical terms and entities for health retrieval.

Task Families

Dataset Shape

The group has seven task pages. mintaka_es is single-positive, while the other six tasks have multiple positives per query on average. The Spanish Passage Retrieval tasks have the densest relevance sets, with about 5.96 and 7.35 positives per query. miracl_es averages 4.67 positives per query, and the xPQA tasks average about 2.3 to 2.5 positives per query.

Document length varies sharply. Mintaka positives are very short answer strings. xPQA snippets are compact product answers. MIRACL uses mid-length Wikipedia passages. The health s2_p split uses long full web pages, while s2_s uses shorter answer passages. The group therefore tests how retrieval systems behave when the target unit changes from entity string to snippet to passage to full page.

Retrieval Behavior

BM25 Profile

BM25 is best only for mintaka_es, where the relevant answer strings often contain names, titles, or entities that can be matched directly when present in the query. BM25 is also reasonably strong on Spanish health retrieval and MIRACL because Spanish queries often share medical terms, entities, or topical words with the relevant pages and passages. spanish_passage_s2_p is a case where BM25 beats dense, reaching 0.5129 nDCG@10.

BM25 struggles on cross-lingual product QA. xpqa_eng_spa and xpqa_spa_eng score 0.0986 and 0.1227 nDCG@10, because the question and answer snippets may be in different languages and are too short for sparse overlap to recover many relevant items. At group level, BM25 reaches 0.3599 query-weighted nDCG@10, which is useful but clearly below dense retrieval.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest query-weighted profile for the group at 0.5100 nDCG@10. It is best for mintaka_es, miracl_es, spanish_passage_s2_s, xpqa_eng_spa, xpqa_spa_eng, and xpqa_spa_spa. The cross-lingual product QA gains are especially large: xpqa_spa_eng rises from 0.1227 BM25 nDCG@10 to 0.4872 dense nDCG@10, and xpqa_eng_spa rises from 0.0986 to 0.3104.

Dense retrieval is also strong for answer passage retrieval and MIRACL, where it can connect Spanish questions to semantically relevant passages even when surface wording differs. Its one clear weakness is spanish_passage_s2_p, where full health pages and medical lexical anchors favor hybrid or BM25 more than dense alone.

Reranking Hybrid Profile

The reranking hybrid profile is best for spanish_passage_s2_p, reaching 0.6220 nDCG@10 and the highest recall@100 for that task. This is the expected pattern for full-page health retrieval: sparse evidence finds medical terms and entities, while dense evidence helps with question intent and related concepts. Hybrid is also close to dense on spanish_passage_s2_s, miracl_es, and xpqa_spa_spa.

Hybrid does not dominate the cross-lingual xPQA tasks. It trails dense sharply on xpqa_eng_spa and xpqa_spa_eng, where sparse evidence contributes little because the query and answer may be in different languages. The group therefore shows a clean division: hybrid is useful for long Spanish health pages, while dense retrieval is more important for short cross-lingual product snippets and semantic answer matching.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
mintaka_esEntity-answer retrievalmultilingual2001,6932001.000.25020.36140.2721Dense
miracl_esWikipedia retrievales20010,0009344.670.56200.74810.7042Dense
spanish_passage_s2_pHealth page retrievales1677,5019965.960.51290.47190.6220Reranking hybrid
spanish_passage_s2_sHealth passage retrievales1672501,2287.350.54580.63980.6333Dense
xpqa_eng_spaProduct QA retrievalmultilingual2001,9364912.460.09860.31040.1428Dense
xpqa_spa_engProduct QA retrievalmultilingual2001,9414692.340.12270.48720.1444Dense
xpqa_spa_spaProduct QA retrievales2001,9414882.440.48290.56670.5582Dense

Interpretation Notes for Model Researchers

NanoMTEB-Spanish is a useful diagnostic for whether a model's Spanish retrieval strength comes from lexical overlap, semantic matching, or cross-lingual alignment. Dense retrieval dominates the group because it handles short answers, MIRACL passages, and xPQA snippets better than sparse retrieval. Hybrid is most valuable on full-page health retrieval, where exact medical terminology and semantic question intent both matter.

The cross-lingual xPQA tasks should be inspected separately from the Spanish monolingual tasks. A model can improve Spanish passage retrieval without improving Spanish-English product QA. Similarly, strong product QA does not guarantee good retrieval over long health pages. Per-task analysis is necessary before interpreting the aggregate score.

Training and Leakage Notes

Useful training data includes non-overlapping Mintaka examples, Spanish Wikidata-style entity QA, MIRACL Spanish training data, Spanish Wikipedia question-passage pairs, Spanish consumer-health QA, medical FAQ retrieval, document-level health web retrieval, and product QA ranking data in Spanish and English. Multi-positive behavior should be preserved for MIRACL, Spanish Passage Retrieval, and xPQA.

Leakage control should exclude Nano evaluation queries, qrels, answer strings, positive passages, health pages, and product snippets. Synthetic examples should preserve entity names, medical terms, product model numbers, quantities, dimensions, compatibility terms, yes/no polarity, and customer-reported facts. Hard negatives should come from the same entity type, medical topic, product category, or answer family.

Source Reference Table

SourceYearTypeURL
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering2022source task paperhttps://arxiv.org/abs/2210.01613
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages2023source task paperhttps://arxiv.org/abs/2210.09984
A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources2019source task paperhttps://doi.org/10.1007/978-3-030-15719-7_19
Spanish Passage Retrieval dataset pageproject pagehttps://mklab.iti.gr/results/spanish-passage-retrieval-dataset/
xPQA: Cross-Lingual Product Question Answering across 12 Languages2023source task paperhttps://arxiv.org/abs/2305.09249
mteb/MintakaRetrievaldataset cardhttps://huggingface.co/datasets/mteb/MintakaRetrieval
mteb/MIRACLRetrievalHardNegativesdataset cardhttps://huggingface.co/datasets/mteb/MIRACLRetrievalHardNegatives
mteb/SpanishPassageRetrievalS2Pdataset cardhttps://huggingface.co/datasets/mteb/SpanishPassageRetrievalS2P
mteb/XPQARetrievaldataset cardhttps://huggingface.co/datasets/mteb/XPQARetrieval

Metadata Summary

FieldValue
Task pages7
Queries1,334
Split-local documents25,262
Positive qrels4,806
Languageses, multilingual
Categoriesnatural_language
Positives / query avg3.60

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
mintaka_esNanoMTEB-Spanishmultilingualnatural_language2001,6932000.25020.36140.2721Dense
miracl_esNanoMTEB-Spanishesnatural_language20010,0009340.56200.74810.7042Dense
spanish_passage_s2_pNanoMTEB-Spanishesnatural_language1677,5019960.51290.47190.6220Reranking hybrid
spanish_passage_s2_sNanoMTEB-Spanishesnatural_language1672501,2280.54580.63980.6333Dense
xpqa_eng_spaNanoMTEB-Spanishmultilingualnatural_language2001,9364910.09860.31040.1428Dense
xpqa_spa_engNanoMTEB-Spanishmultilingualnatural_language2001,9414690.12270.48720.1444Dense
xpqa_spa_spaNanoMTEB-Spanishesnatural_language2001,9414880.48290.56670.5582Dense