NanoMTEB-French / xpqa_eng_fra

Overview

xpqa_eng_fra is a cross-lingual product-question retrieval task from xPQA. Queries are French product questions, while documents are English product answer candidates or snippets. The Nano split contains 200 queries, 1,674 documents, and 451 positive qrels. Unlike single-answer QA tasks, this split has an average of 2.255 positives per query, and 52.5% of queries have multiple positive snippets. The task is a strong diagnostic for cross-lingual retrieval in e-commerce: the model must connect French customer wording to English specifications, reviews, warranties, dimensions, compatibility statements, and other product facts.

Details

What the Original Data Measures

xPQA: Cross-Lingual Product Question Answering across 12 Languages introduced a cross-lingual product QA benchmark where non-English questions are matched to English candidates containing answer evidence. The benchmark is domain-specific: product questions often ask about compatibility, size, warranty, materials, setup, accessories, or subjective experience. These cues differ from general web QA and from Wikipedia-style retrieval.

In xpqa_eng_fra, the retrieval model sees French questions and English candidate documents. The goal is not only translation, but answerability: a positive snippet must contain enough information to answer the product question.

Observed Data Profile

The split has 200 mostly French queries, 1,674 mostly English documents, and 451 positive judgments. Queries average 54.61 characters, and documents average 137.30 characters. Each query has between one and five positives, with a median of two. Documents are short product snippets, sometimes written as customer answers and sometimes as metadata-like fields such as warranty descriptions or dimensions.

The multiple-positive structure matters. A model does not need to retrieve a single canonical answer; it should retrieve any snippet that answers the question. This makes recall valuable, but the cross-lingual direction and short snippets make candidate generation difficult.

BM25 Evaluation Profile

BM25 is weak in this cross-lingual direction, reaching nDCG@10 of 0.1061, hit@10 of 0.2050, and recall@100 of 0.3149. This is expected: French query terms rarely match English answer text except through shared product codes, brand names, units, numbers, and occasional cognates. BM25 can surface useful candidates when the question contains a model name or dimension, but most of the answerability signal is not lexical within a single language.

The low recall@100 means a BM25-only top-100 pool would be a poor base for reranking. Many positive snippets never reach the candidate set, so a reranker would be bounded before it can use semantic evidence.

Dense Evaluation Profile

Dense retrieval is much stronger, with nDCG@10 of 0.3639, hit@10 of 0.5850, and recall@100 of 0.7384. This indicates that harrier-oss-270m can bridge a meaningful amount of French-to-English product semantics. It can connect questions about an Android box, a Fitbit extension, blue-light protection, or phone compatibility to English snippets that answer those questions.

The task remains difficult because product language is terse and specific. Dense models must distinguish compatible from incompatible, material from dimension, and evidence-bearing answers from vague product descriptions. Strong performance likely reflects both multilingual alignment and product-domain representation.

Reranking Hybrid Evaluation Profile

The reranking_hybrid profile reaches nDCG@10 of 0.1775, hit@10 of 0.3200, and recall@100 of 0.6918. It improves substantially over BM25 recall, but it does not approach the dense-only top-10 quality. The candidate lists contain 100 to 101 entries, and 49 rows use the rank-101 safeguard to force a relevant candidate into the pool.

This pattern shows that hybrid search recovers part of the dense signal while retaining useful lexical anchors, but lexical evidence can still dilute ranking quality in a strongly cross-lingual task. For downstream reranking, the hybrid pool is more usable than BM25 alone, but dense candidates are the cleaner starting point for this split.

Metric Interpretation for Model Researchers

xpqa_eng_fra is dense-favorable and cross-lingual. BM25 is limited by language mismatch, dense retrieval is the leading candidate source, and reranking_hybrid lands between them. Recall@100 is particularly important because more than half the queries have multiple positives; a strong candidate generator should retrieve at least one answerable snippet and preferably several for robust reranking.

nDCG@10 measures whether answerable snippets are ranked early, while hit@10 measures whether any positive is available to a top-k consumer. Because positives are short and domain-specific, improvements should be interpreted as progress in multilingual product QA retrieval rather than broad passage retrieval alone.

Query and Relevance Type Tendencies

Queries are customer-style French product questions. They often ask about warranty, volume, dimensions, compatibility, materials, accessories, setup, and subjective advantages. Documents are short English snippets that may come from product metadata, customer answers, or review-like evidence.

Relevance is answerability-based. A document is positive if it contains enough information to answer the question, not merely because it mentions the same product. Multiple snippets can be relevant for the same query when they provide the same answer or complementary evidence.

Representative Failure Modes

BM25 fails when translation is required or when the useful English answer uses different vocabulary from the French question. Dense retrieval can fail by matching the product category but missing the specific property being asked, such as compatibility, material, or warranty. Hybrid retrieval can over-rank snippets with shared numbers or product terms even when they do not answer the question.

Another common failure mode is confusing vague product descriptions with direct answers. For this task, relevance depends on whether the snippet resolves the customer's question, not whether it describes the product generally.

Training Data That May Help

Useful training data includes xPQA train examples, French-to-English product QA retrieval pairs, bilingual e-commerce FAQ data, and hard negatives from the same product category. Training should exclude xPQA test examples, Nano queries, qrels, and positive product candidates.

Synthetic data should pair French product questions with English snippets containing concrete evidence: warranty fields, dimensions, compatibility, volume, accessory lists, material descriptions, and customer-use statements. Multi-positive training is appropriate because several snippets can answer the same question.

Model Improvement Notes

Models should combine multilingual alignment with product-domain specificity. Dense encoders need to preserve small but decisive details such as units, negation, compatibility constraints, and material terms. Rerankers should be trained to judge answerability directly, because topical similarity to the same product is not enough.

Example Data

Query	Positive document
bonjour, quels sont les avantages de cette box android, comparée aux autres ? merci [83 chars]	i have had several different android boxes and find this one of the best // easy to set up and lots of memory storage. [118 chars]
sur quel produit fitbit avez vous essayé cette extension ? [58 chars]	this worked great as an extender for the fitbit charge. [55 chars]
bonjour, la vitre est-elle en verre ou en plastique? [52 chars]	the front transparent plastic is a good protect the pictures. [61 chars]

Source Reference Table

Title	Year	Type	URL
xPQA: Cross-Lingual Product Question Answering across 12 Languages	2023	Paper	https://arxiv.org/abs/2305.09249
MTEB: Massive Text Embedding Benchmark	2023	Paper	https://arxiv.org/abs/2210.07316
mteb/XPQARetrieval	2025	Dataset card	https://huggingface.co/datasets/mteb/XPQARetrieval

Dataset Information

Field	Value
Nano set	NanoMTEB-French
Backing dataset	NanoMTEB-French
Task / split	xpqa_eng_fra
Hugging Face dataset	hakari-bench/NanoMTEB-French
Language	multilingual
Category	natural_language
Queries	200
Documents	1,674
Positive qrels	451
Positives / query avg	2.25
Positives / query min	1
Positives / query median	2.00
Positives / query max	5
Multi-positive queries	105 (52.50%)
Query length avg chars	54.61
Document length avg chars	137.30

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.1061	0.2050	0.3149	top-500
Dense	`harrier_oss_v1_270m`	0.3639	0.5850	0.7384	top-500
Reranking hybrid	`reranking_hybrid`	0.1775	0.3200	0.6918	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: test
Train/eval overlap audit: not_audited
Leakage note: exclude xPQA test examples, Nano queries, qrels, and positive product candidates
Multi-positive training: multi_positive_objective
Useful training data: xPQA train examples, French-to-English product QA retrieval pairs, bilingual e-commerce FAQ data, hard negatives from the same product category