NanoMTEB-Spanish / xpqa_spa_spa
Overview
xpqa_spa_spa is the Spanish-to-Spanish xPQA retrieval split. Both queries and candidate documents are Spanish. The task models product question answering in an e-commerce setting: a user asks a short Spanish question about a product, and the retriever must find short Spanish answer snippets that address the requested detail. Unlike the cross-lingual xPQA directions, this split keeps language constant, so exact product vocabulary can help more.
The Nano split contains 200 queries, 1,941 documents, and 488 positive relevance judgments. Queries average about 45 characters, while documents average about 68 characters. The average positives per query is 2.44, and 127 queries have multiple positives. Many documents are direct answer snippets with yes/no polarity, quantities, materials, model codes, dimensions, and compatibility claims.
Details
What the Original Data Measures
xPQA collects product questions and candidate answer information from e-commerce data. The retrieval objective is to rank answer snippets that fully or partially answer the question. Even in the monolingual Spanish condition, this is not generic semantic search: relevance depends on product-specific facts, such as whether a pack has 12 units, whether a material is stainless steel, whether an item is real silver, or whether a product fits a specific use.
The task rewards models that preserve concrete purchase and usage details. A related product snippet is not enough unless it answers the exact question.
Observed Data Profile
Queries are compact product questions, often asking yes/no or either/or details. Candidate documents are also short and answer-like. Many begin with Sí, No, or Un cliente ha dicho, followed by the relevant product fact. This makes polarity, attribute names, quantities, and product constraints central to retrieval.
Examples include questions about whether a pack of three straps contains different sizes, whether sizes run large or tight, whether a guitar model is acoustic or electro-acoustic, how to choose a size, and whether a pack includes 12 units.
BM25 Evaluation Profile
BM25 is moderately strong, with nDCG@10 of 0.4829, hit@10 of 0.7000, and recall@100 of 0.7766. Because both query and answer are Spanish, lexical overlap on product terms, quantities, sizes, and materials helps substantially. This is much easier for BM25 than the cross-lingual XPQA directions.
BM25 still misses many positives. Answer snippets may use different wording from the question, or a customer answer may imply the property indirectly. Exact overlap can also retrieve wrong snippets from the same product category or same item.
Dense Evaluation Profile
The dense harrier-oss-270m run is strongest, with nDCG@10 of 0.5667, hit@10 of 0.7650, and recall@100 of 0.8975. Dense retrieval improves because it can match product-question intent to answer snippets even when the wording differs. It can connect a question about "tallas grandes o justas" to a snippet describing a tight, body-shaping fit.
The dense gain shows that monolingual product QA is still semantic. Product answers are short, informal, and often indirect. Dense retrieval is better at connecting the user's requested attribute to the answer evidence.
Reranking Hybrid Evaluation Profile
reranking_hybrid reports nDCG@10 of 0.5582, hit@10 of 0.7400, and recall@100 of 0.8832. Candidate lists contain 100 to 101 items, and 20 rows use the positive safeguard. Hybrid retrieval is close to dense but slightly lower across the reported metrics.
This suggests that lexical evidence is useful but dense retrieval already captures most of the relevant product-answer relation. Hybrid search may still be useful for a reranker because it preserves product codes and exact quantities, but dense is the best direct first-stage profile.
Metric Interpretation for Model Researchers
This split is dense-favorable, with BM25 providing a much stronger baseline than in the cross-lingual directions. The gap between BM25 and dense reflects paraphrase and answer-style mismatch rather than translation. Hybrid retrieval is near dense but does not surpass it.
Because many queries have multiple positives, recall@100 remains important. A product question can be answered by several snippets, and systems should retrieve more than one valid answer when available. nDCG@10 measures whether the most useful snippets appear early.
Query and Relevance Type Tendencies
Representative queries ask whether a pack contains three same-size straps, whether sizes are large or tight, whether a guitar model is electro-acoustic, how to choose a wrist size, and whether a pack contains 12 units. Relevant snippets answer with short factual claims, measurements, or customer advice.
The task is practical and detail-oriented. It rewards preserving quantities, polarity, model type, material, compatibility, and fit. Broad product similarity is not enough.
Representative Failure Modes
BM25 may retrieve snippets sharing a product term but answering a different attribute. Dense retrieval may retrieve semantically plausible snippets from the same product category that are not specific enough. Hybrid retrieval can overvalue shared numbers or product identifiers when they do not answer the query.
Polarity confusion is a common risk. A snippet with Sí or No is only useful if it answers the exact property asked. A model also needs to distinguish between product title information and customer-experience evidence.
Training Data That May Help
Useful training data includes Spanish xPQA train examples, Spanish e-commerce QA pairs, customer-question to answer-snippet retrieval pairs, and same-product or same-category hard negatives. Training should exclude xPQA test examples, Nano queries, qrels, and positive snippets.
Hard negatives should share the same product or category but answer another detail. Examples with wrong quantities, wrong fit, different material, or incompatible model variants are especially valuable.
Model Improvement Notes
Dense models can improve through Spanish product-domain supervision and stronger modeling of polarity, quantities, units, and compatibility. Sparse systems can improve by preserving model codes and numeric tokens, but they need paraphrase robustness for answer snippets. Hybrid systems are useful for recall and reranking, though dense retrieval is the best direct profile here.
For evaluation, this split measures monolingual e-commerce answer retrieval. The strongest systems should rank short answer snippets by whether they answer the exact product question.
Example Data
| Query | Positive document |
| el pack de 3 cintas, ¿es una de cada tamaño o las 3 del mismo tamaño? [69 chars] | El paquete contiene 3 piezas de 120 cm de largo. [48 chars] |
| que son tallas grandes o justas? [32 chars] | Son de talla ajustada moldeando la curvatura del cuerpo. [56 chars] |
| és el modelo acústico o electro acústico? [41 chars] | Este producto es una guitarra electroacústica. [46 chars] |
Source Reference Table
| Source | What it contributes |
| xPQA paper | Original product QA dataset and ranking objective. |
| MTEB paper | Benchmark context. |
| MTEB task card | Retrieval packaging. |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-Spanish |
| Backing dataset | NanoMTEB-Spanish |
| Task / split | xpqa_spa_spa |
| Hugging Face dataset | hakari-bench/NanoMTEB-Spanish |
| Language | es |
| Category | natural_language |
| Queries | 200 |
| Documents | 1,941 |
| Positive qrels | 488 |
| Positives / query avg | 2.44 |
| Positives / query min | 1 |
| Positives / query median | 2.00 |
| Positives / query max | 5 |
| Multi-positive queries | 127 (63.50%) |
| Query length avg chars | 45.16 |
| Document length avg chars | 68.28 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.4829 | 0.7000 | 0.7766 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5667 | 0.7650 | 0.8975 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.5582 | 0.7400 | 0.8832 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: test
- Train/eval overlap audit: not_audited
- Leakage note: exclude xPQA test examples, Nano queries, qrels, and positive product snippets
- Multi-positive training: multi_positive_objective
- Useful training data: Spanish xPQA train examples, Spanish e-commerce QA pairs, customer-question to answer-snippet retrieval pairs, same-product and same-category hard negatives