NanoMTEB-Spanish / spanish_passage_s2_s
Overview
spanish_passage_s2_s is the passage-level version of the Spanish Passage Retrieval health dataset. The same Spanish consumer-health questions are used as queries, but the documents are concise answer-bearing passages rather than full web pages. This makes the task closer to direct answer-passage retrieval for Spanish health information needs about baby care, vaccination, breastfeeding, emergencies, and low back pain.
The Nano split contains 167 queries, 250 passage documents, and 1,228 positive relevance judgments. Queries average about 68 characters, while passages average about 442 characters. Almost every query is multi-positive: 165 of 167 queries have more than one relevant passage, with an average of 7.35 positives and a maximum of 20. Compared with S2P, this variant removes much of the full-page noise and asks the model to rank short passages that explicitly answer the question.
Details
What the Original Data Measures
The Spanish Passage Retrieval collection provides manually assessed passage-level relevance for Spanish health questions. The S2S variant uses those passages as the retrieval units. This means the positive document should directly answer the user's information need, rather than merely contain an answer somewhere inside a longer page.
The task is still multi-positive. Several passages can answer the same question, sometimes with different wording, levels of detail, or source pages. A good retriever should recover multiple valid answer passages.
Observed Data Profile
The passages are short health explanations written for lay readers. Examples discuss the benefits of breastfeeding, when to introduce complementary foods, how frequently a newborn should feed, which vaccines are publicly financed, and how vaccines prevent infectious disease. Other passages cover lumbago, back injury causes, pediatric checkups, and newborn weight.
Because the corpus is small and answer-focused, exact answer terms are more concentrated than in S2P. At the same time, paraphrase still matters: darle el pecho, amamantar, lactancia materna, and related expressions may refer to the same need.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.5458, hit@10 of 0.9401, and recall@100 of 0.9438. The lexical baseline is strong because passages are short and often contain the key health terms from the query. With less long-document noise, BM25 can match question terms to answer passages more reliably than in full-page retrieval.
The remaining limitation is paraphrase and consumer-medical style mismatch. A user may ask in everyday language, while the passage uses a more formal phrase. BM25 can also over-rank passages that share a topic term but answer a different subquestion.
Dense Evaluation Profile
The dense harrier-oss-270m run is strongest for top-10 ranking, with nDCG@10 of 0.6398, hit@10 of 0.9701, and recall@100 of 0.9902. Dense retrieval benefits from the shorter answer-passage unit: the embedding can focus on the actual answer rather than a full web page with mixed content.
This profile shows that semantic matching is valuable in Spanish health QA. Dense retrieval can connect layperson questions with concise answer passages even when wording differs. It also preserves nearly all positives within the top 100.
Reranking Hybrid Evaluation Profile
reranking_hybrid reports nDCG@10 of 0.6333, hit@10 of 0.9701, and recall@100 of 0.9919. Candidate lists contain exactly 100 items with no safeguard rows. Hybrid retrieval is essentially tied with dense at hit@10, slightly below dense in nDCG@10, and slightly above dense in recall@100.
This makes S2S a balanced dense/hybrid task. Dense is marginally better for final top ordering, while hybrid is marginally better for preserving positives. Because the corpus is small and answer-focused, both semantic and lexical signals work well.
Metric Interpretation for Model Researchers
This split is dense-favorable for nDCG@10 and hybrid-favorable by a very small margin for recall@100. BM25 is strong but clearly behind the semantic profiles. The contrast with S2P is important: when the retrieval unit is the answer passage rather than the full page, dense retrieval becomes much more effective.
The task has many positives per query, so top-10 ranking should be interpreted as ranking a set of valid passages. Hit@10 is high for all methods, but nDCG@10 reveals whether the better passages are ranked earlier and whether several positives appear near the top.
Query and Relevance Type Tendencies
Representative queries ask about breast milk benefits, the timing of complementary foods, breastfeeding frequency, free vaccines, and vaccination for infectious disease prevention. Relevant passages directly answer the question in one or a few paragraphs. Many are written in educational or public-health language.
The model should understand both medical vocabulary and layperson phrasing. It should also retrieve multiple passages when the same question has several valid explanatory answers.
Representative Failure Modes
BM25 may miss paraphrased answer passages or retrieve a passage from the same health topic that answers a different question. Dense retrieval may retrieve a semantically related health passage that is not specific enough. Hybrid retrieval can include both lexical and semantic near misses, although the small corpus limits the damage.
Another failure mode is under-ranking secondary positives. Since many queries have more than five positives, a model may retrieve one excellent passage but miss other valid answers that use different wording or focus on a different detail.
Training Data That May Help
Useful training data includes Spanish medical FAQ passage retrieval pairs, consumer-health question-answer sentence pairs, multi-positive Spanish health retrieval examples, and paraphrase-rich data about baby care, vaccination, and low back pain. Training should exclude PRES evaluation questions, qrels, and overlapping passage text.
Hard negatives should be passages from the same health topic that answer a different subquestion. These are essential for teaching the model to distinguish broad topical relevance from direct answer relevance.
Model Improvement Notes
Dense models can improve by focusing on Spanish health paraphrases, layperson-to-medical wording, and multi-positive passage ranking. Sparse systems benefit from the short passage unit but need synonym and morphology handling to close the gap. Hybrid systems are robust for candidate preservation, while dense retrieval is slightly better for final top ordering.
For downstream use, S2S is a good benchmark for answer-passage retrieval in consumer health. The best models should retrieve concise, directly useful passages rather than broad health pages.
Example Data
| Query | Positive document |
| ¿Cuáles son los beneficios de la leche materna? [47 chars] | En la misma se reconoce que la lactancia materna es el mejor modo de proporcionar al recién nacido los nutrientes que necesita durante los primeros meses de vida. [162 chars] |
| ¿Cuándo debo introducir alimentos complementarios aparte de la lactancia materna? [81 chars] | Durante los primeros 6 meses de vida el bebé solamente necesita tomar leche materna. Es recomendable utilizar la edad corregida para comenzar a introducir el resto de alimentos, individualizando según las necesidades. No es conveniente introducir alimentación complementaria antes de los 4 meses de edad corregida. [314 chars] |
| ¿Tendría que darle el pecho a mi bebé siempre que me lo pida? [61 chars] | Durante el primer mes de vida, su recién nacido debería alimentarse entre ocho y 12 veces al día. [97 chars] |
Source Reference Table
| Source | What it contributes |
| ECIR paper | Original Spanish health retrieval test collection. |
| Project page | Passage-level relevance and topic description. |
| Source dataset card | Public dataset packaging. |
| MTEB task card | S2S retrieval formulation. |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-Spanish |
| Backing dataset | NanoMTEB-Spanish |
| Task / split | spanish_passage_s2_s |
| Hugging Face dataset | hakari-bench/NanoMTEB-Spanish |
| Language | es |
| Category | natural_language |
| Queries | 167 |
| Documents | 250 |
| Positive qrels | 1,228 |
| Positives / query avg | 7.35 |
| Positives / query min | 1 |
| Positives / query median | 6.00 |
| Positives / query max | 20 |
| Multi-positive queries | 165 (98.80%) |
| Query length avg chars | 67.56 |
| Document length avg chars | 442.43 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.5458 | 0.9401 | 0.9438 | top-500 |
| Dense | harrier_oss_v1_270m | 0.6398 | 0.9701 | 0.9902 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.6333 | 0.9701 | 0.9919 | top-100 |
Training and Leakage Metadata
- Original train split: not_found
- Evaluation split origin: test
- Train/eval overlap audit: not_audited
- Leakage note: exclude PRES evaluation questions, qrels, and Spanish health passages likely to overlap with Nano
- Multi-positive training: multi_positive_objective
- Useful training data: Spanish medical FAQ passage retrieval pairs, consumer-health question-answer sentence pairs, multi-positive Spanish health retrieval examples, paraphrase-rich baby care, vaccination, and low back pain data