NanoMTEB-Spanish
Overview
NanoMTEB-Spanish is a compact Spanish and Spanish-English retrieval group. It covers complex entity-answer QA, Spanish Wikipedia passage retrieval, Spanish consumer-health passage and document retrieval, and product question answering in Spanish-English, English-Spanish, and Spanish-Spanish directions. The group is useful because the target is not always a Spanish paragraph with obvious word overlap: some positives are short entity answers, health passages, or compact product snippets.
The group contains 1,334 queries, 25,262 task-local documents, and 4,806 positive qrel rows. It is multi-positive overall, with MIRACL, Spanish Passage Retrieval, and xPQA contributing multiple relevant documents or snippets per query. This makes the group a good diagnostic for Spanish retrieval systems that need to combine semantic answerability, domain evidence, and cross-lingual product matching.
What This Group Measures
The group measures several Spanish retrieval relations. mintaka_es maps Spanish complex questions to short answer strings or entity names. miracl_es retrieves Spanish Wikipedia passages for information needs. spanish_passage_s2_p retrieves full Spanish health web pages, while spanish_passage_s2_s retrieves shorter answer passages for the same consumer-health setting. The three xPQA tasks retrieve product answer snippets across Spanish-English and monolingual Spanish directions.
This mixture separates lexical Spanish passage retrieval from semantic and cross-lingual retrieval. A model can do well on MIRACL or health pages because query terms overlap with passages, but still fail on product QA where snippets are short and may be in another language. Conversely, a cross-lingual dense model can be strong on xPQA while still needing exact medical terms and entities for health retrieval.
Task Families
- Complex entity-answer retrieval:
mintaka_esretrieves short canonical answers for Spanish complex questions. - Wikipedia retrieval:
miracl_esretrieves Spanish Wikipedia passages. - Consumer-health retrieval:
spanish_passage_s2_pretrieves full pages, andspanish_passage_s2_sretrieves answer passages. - Product QA retrieval:
xpqa_eng_spa,xpqa_spa_eng, andxpqa_spa_sparetrieve compact product answer snippets across Spanish and English.
Dataset Shape
The group has seven task pages. mintaka_es is single-positive, while the other six tasks have multiple positives per query on average. The Spanish Passage Retrieval tasks have the densest relevance sets, with about 5.96 and 7.35 positives per query. miracl_es averages 4.67 positives per query, and the xPQA tasks average about 2.3 to 2.5 positives per query.
Document length varies sharply. Mintaka positives are very short answer strings. xPQA snippets are compact product answers. MIRACL uses mid-length Wikipedia passages. The health s2_p split uses long full web pages, while s2_s uses shorter answer passages. The group therefore tests how retrieval systems behave when the target unit changes from entity string to snippet to passage to full page.
Retrieval Behavior
BM25 Profile
BM25 is best only for mintaka_es, where the relevant answer strings often contain names, titles, or entities that can be matched directly when present in the query. BM25 is also reasonably strong on Spanish health retrieval and MIRACL because Spanish queries often share medical terms, entities, or topical words with the relevant pages and passages. spanish_passage_s2_p is a case where BM25 beats dense, reaching 0.5129 nDCG@10.
BM25 struggles on cross-lingual product QA. xpqa_eng_spa and xpqa_spa_eng score 0.0986 and 0.1227 nDCG@10, because the question and answer snippets may be in different languages and are too short for sparse overlap to recover many relevant items. At group level, BM25 reaches 0.3599 query-weighted nDCG@10, which is useful but clearly below dense retrieval.
Dense Profile
Dense retrieval with harrier-oss-270m is the strongest query-weighted profile for the group at 0.5100 nDCG@10. It is best for mintaka_es, miracl_es, spanish_passage_s2_s, xpqa_eng_spa, xpqa_spa_eng, and xpqa_spa_spa. The cross-lingual product QA gains are especially large: xpqa_spa_eng rises from 0.1227 BM25 nDCG@10 to 0.4872 dense nDCG@10, and xpqa_eng_spa rises from 0.0986 to 0.3104.
Dense retrieval is also strong for answer passage retrieval and MIRACL, where it can connect Spanish questions to semantically relevant passages even when surface wording differs. Its one clear weakness is spanish_passage_s2_p, where full health pages and medical lexical anchors favor hybrid or BM25 more than dense alone.
Reranking Hybrid Profile
The reranking hybrid profile is best for spanish_passage_s2_p, reaching 0.6220 nDCG@10 and the highest recall@100 for that task. This is the expected pattern for full-page health retrieval: sparse evidence finds medical terms and entities, while dense evidence helps with question intent and related concepts. Hybrid is also close to dense on spanish_passage_s2_s, miracl_es, and xpqa_spa_spa.
Hybrid does not dominate the cross-lingual xPQA tasks. It trails dense sharply on xpqa_eng_spa and xpqa_spa_eng, where sparse evidence contributes little because the query and answer may be in different languages. The group therefore shows a clean division: hybrid is useful for long Spanish health pages, while dense retrieval is more important for short cross-lingual product snippets and semantic answer matching.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| mintaka_es | Entity-answer retrieval | multilingual | 200 | 1,693 | 200 | 1.00 | 0.2502 | 0.3614 | 0.2721 | Dense |
| miracl_es | Wikipedia retrieval | es | 200 | 10,000 | 934 | 4.67 | 0.5620 | 0.7481 | 0.7042 | Dense |
| spanish_passage_s2_p | Health page retrieval | es | 167 | 7,501 | 996 | 5.96 | 0.5129 | 0.4719 | 0.6220 | Reranking hybrid |
| spanish_passage_s2_s | Health passage retrieval | es | 167 | 250 | 1,228 | 7.35 | 0.5458 | 0.6398 | 0.6333 | Dense |
| xpqa_eng_spa | Product QA retrieval | multilingual | 200 | 1,936 | 491 | 2.46 | 0.0986 | 0.3104 | 0.1428 | Dense |
| xpqa_spa_eng | Product QA retrieval | multilingual | 200 | 1,941 | 469 | 2.34 | 0.1227 | 0.4872 | 0.1444 | Dense |
| xpqa_spa_spa | Product QA retrieval | es | 200 | 1,941 | 488 | 2.44 | 0.4829 | 0.5667 | 0.5582 | Dense |
Interpretation Notes for Model Researchers
NanoMTEB-Spanish is a useful diagnostic for whether a model's Spanish retrieval strength comes from lexical overlap, semantic matching, or cross-lingual alignment. Dense retrieval dominates the group because it handles short answers, MIRACL passages, and xPQA snippets better than sparse retrieval. Hybrid is most valuable on full-page health retrieval, where exact medical terminology and semantic question intent both matter.
The cross-lingual xPQA tasks should be inspected separately from the Spanish monolingual tasks. A model can improve Spanish passage retrieval without improving Spanish-English product QA. Similarly, strong product QA does not guarantee good retrieval over long health pages. Per-task analysis is necessary before interpreting the aggregate score.
Training and Leakage Notes
Useful training data includes non-overlapping Mintaka examples, Spanish Wikidata-style entity QA, MIRACL Spanish training data, Spanish Wikipedia question-passage pairs, Spanish consumer-health QA, medical FAQ retrieval, document-level health web retrieval, and product QA ranking data in Spanish and English. Multi-positive behavior should be preserved for MIRACL, Spanish Passage Retrieval, and xPQA.
Leakage control should exclude Nano evaluation queries, qrels, answer strings, positive passages, health pages, and product snippets. Synthetic examples should preserve entity names, medical terms, product model numbers, quantities, dimensions, compatibility terms, yes/no polarity, and customer-reported facts. Hard negatives should come from the same entity type, medical topic, product category, or answer family.
Source Reference Table
| Source | Year | Type | URL |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering | 2022 | source task paper | https://arxiv.org/abs/2210.01613 |
| Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages | 2023 | source task paper | https://arxiv.org/abs/2210.09984 |
| A Test Collection for Passage Retrieval Evaluation of Spanish Health-Related Resources | 2019 | source task paper | https://doi.org/10.1007/978-3-030-15719-7_19 |
| Spanish Passage Retrieval dataset page | project page | https://mklab.iti.gr/results/spanish-passage-retrieval-dataset/ | |
| xPQA: Cross-Lingual Product Question Answering across 12 Languages | 2023 | source task paper | https://arxiv.org/abs/2305.09249 |
| mteb/MintakaRetrieval | dataset card | https://huggingface.co/datasets/mteb/MintakaRetrieval | |
| mteb/MIRACLRetrievalHardNegatives | dataset card | https://huggingface.co/datasets/mteb/MIRACLRetrievalHardNegatives | |
| mteb/SpanishPassageRetrievalS2P | dataset card | https://huggingface.co/datasets/mteb/SpanishPassageRetrievalS2P | |
| mteb/XPQARetrieval | dataset card | https://huggingface.co/datasets/mteb/XPQARetrieval |
Metadata Summary
| Field | Value |
| Task pages | 7 |
| Queries | 1,334 |
| Split-local documents | 25,262 |
| Positive qrels | 4,806 |
| Languages | es, multilingual |
| Categories | natural_language |
| Positives / query avg | 3.60 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| mintaka_es | NanoMTEB-Spanish | multilingual | natural_language | 200 | 1,693 | 200 | 0.2502 | 0.3614 | 0.2721 | Dense |
| miracl_es | NanoMTEB-Spanish | es | natural_language | 200 | 10,000 | 934 | 0.5620 | 0.7481 | 0.7042 | Dense |
| spanish_passage_s2_p | NanoMTEB-Spanish | es | natural_language | 167 | 7,501 | 996 | 0.5129 | 0.4719 | 0.6220 | Reranking hybrid |
| spanish_passage_s2_s | NanoMTEB-Spanish | es | natural_language | 167 | 250 | 1,228 | 0.5458 | 0.6398 | 0.6333 | Dense |
| xpqa_eng_spa | NanoMTEB-Spanish | multilingual | natural_language | 200 | 1,936 | 491 | 0.0986 | 0.3104 | 0.1428 | Dense |
| xpqa_spa_eng | NanoMTEB-Spanish | multilingual | natural_language | 200 | 1,941 | 469 | 0.1227 | 0.4872 | 0.1444 | Dense |
| xpqa_spa_spa | NanoMTEB-Spanish | es | natural_language | 200 | 1,941 | 488 | 0.4829 | 0.5667 | 0.5582 | Dense |