NanoMTEB-French / fquad

Overview

fquad is a French passage-retrieval task derived from FQuAD, a French question-answering benchmark built from French Wikipedia. Each query is a French reading-comprehension question, and the positive document is the Wikipedia passage that contains the evidence needed to answer it. The Nano split contains 200 queries, 269 candidate documents, and 200 positive qrels, with exactly one positive passage per query. Because the document pool is small and the passages are article-like paragraphs with explicit entities, dates, and definitions, this task is a compact test of whether a retrieval model can map a French question to the answer-bearing encyclopedic passage. It is especially useful for studying the boundary between lexical matching and semantic QA retrieval in a native French setting.

Details

What the Original Data Measures

FQuAD: French Question Answering Dataset introduced a SQuAD-style reading-comprehension benchmark for French, using native French questions and answer spans over French Wikipedia passages. The larger FQuAD line later added unanswerable examples in FQuAD2.0, but this Nano retrieval task focuses on answer-bearing passages: the model is not asked to extract the answer span, only to retrieve the passage that contains the evidence.

The resulting retrieval problem is narrower than open-domain web search but closer to how QA retrieval is used in a reader-reranker pipeline. Queries are short natural questions, while documents preserve enough surrounding context to make the answer identifiable. A good model needs to recognize French question forms, named entities, paraphrased predicates, and local topical context.

Observed Data Profile

The Nano split has 200 French queries, 269 documents, and 200 positive judgments. Query text averages 56.21 characters, while documents average 898.31 characters. The documents are much longer than the questions and usually contain article-title prefixes followed by compact encyclopedic prose. Examples cover biographical, historical, film, religious, political, and scientific topics.

The single-positive setup makes each query a direct evidence-location task. A retriever cannot benefit from many interchangeable positives, but the small candidate pool means the target passage is often lexically close to the query. The corpus size is also below the nominal top-500 candidate depth, so BM25 and dense candidate lists effectively cover all 269 documents for every query.

BM25 Evaluation Profile

BM25 is very strong on this task: the dataset-provided BM25 candidates reach nDCG@10 of 0.8899, hit@10 of 0.9700, and recall@100 of 1.0000. The result shows that exact or near-exact term overlap is highly informative for FQuAD-style French retrieval. Many questions reuse salient nouns, named entities, or answer-bearing phrases from the positive passage, so BM25 can often rank the right paragraph near the top without modeling deep semantics.

This does not mean the task is trivial. Errors are more likely when a question asks for a relation that is expressed with different wording in the passage, or when several paragraphs share the same article title and entities. Still, the dominant retrieval signal is lexical specificity: rare names, film titles, institution names, and dates tend to anchor the correct document.

Dense Evaluation Profile

The dense harrier-oss-270m candidate column is weaker than BM25 here, with nDCG@10 of 0.8102, hit@10 of 0.9250, and recall@100 of 0.9600. Dense retrieval still performs well, indicating that the semantic relationship between French questions and answer-bearing passages is learnable. However, the drop against BM25 suggests that embedding similarity sometimes smooths over the exact lexical cues that distinguish one Wikipedia paragraph from another.

For model researchers, this is a useful diagnostic: a dense model that improves substantially on this task likely handles French entity grounding, question intent, and long-passage pooling well. A model that underperforms BM25 may still be semantically fluent, but may not preserve enough fine-grained entity and phrase identity for compact QA retrieval.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate set reaches nDCG@10 of 0.8666, hit@10 of 0.9500, and recall@100 of 1.0000. It recovers the complete top-100 relevant coverage of BM25 while ranking slightly below BM25 at the top of the list. This is the expected shape for a hybrid search emulation when lexical matching is already very strong: the hybrid list protects recall and combines semantic and lexical evidence, but it does not automatically beat the best lexical ordering for the first ten positions.

There are no safeguard rank-101 rows for this task, because the positive documents are already covered in the top-100 hybrid candidates. The hybrid profile is therefore best read as a high-recall candidate pool for later reranking rather than as proof that dense semantics dominate the task.

Metric Interpretation for Model Researchers

fquad is a BM25-favorable French QA retrieval task. The most important metric comparison is not only absolute nDCG@10, but the gap between BM25, dense, and hybrid behavior. BM25 leads at top-10 ranking, dense loses some recall by 100, and reranking_hybrid restores recall while staying close to BM25. A reranker evaluated on this candidate pool should be expected to exploit both exact entity overlap and question-passage entailment.

Because each query has one positive, nDCG@10 and hit@10 are easy to interpret: they largely measure whether the correct passage appears early. Recall@100 measures whether a downstream reranker has any chance to recover the correct answer. On this split, failures below recall@100 are more concerning for dense retrievers than for lexical or hybrid candidate generation.

Query and Relevance Type Tendencies

Queries are extractive French questions asking who, when, where, why, or what relationship holds. The relevant document normally states the answer directly inside a Wikipedia-style paragraph. Some questions are entity-heavy, while others depend on recognizing a paraphrased event or role.

Relevance is evidence-based rather than topical-only. A paragraph about the same person or film is not enough if it does not contain the required fact. This makes same-article hard negatives useful: they share many surface terms but lack the precise answer evidence.

Representative Failure Modes

BM25 can fail when the query uses a paraphrase and the positive passage uses a different formulation, or when several candidate passages share the same entity name and broad topic. Dense retrieval can fail in the opposite direction by retrieving a semantically related paragraph that lacks the exact answer-bearing fact. Both approaches can confuse adjacent passages from the same article when the title prefix dominates the representation.

Hybrid retrieval is less likely to miss the target entirely, but its top ranks can still be pulled toward passages that are lexically similar or semantically near but not answer-bearing. This makes the task a good candidate for testing cross-encoder rerankers and late-interaction models over a compact top-100 pool.

Training Data That May Help

Useful training data includes non-overlapping FQuAD train examples, French Wikipedia QA retrieval pairs, same-article hard negatives, and French extractive QA paraphrases. Training should exclude FQuAD test examples, Nano queries, qrels, and positive French Wikipedia passages likely to overlap with the evaluation split.

Synthetic data can be useful if it keeps the evidence structure explicit: generate French Wikipedia-style paragraphs with entities, dates, definitions, and roles, then generate answerable French questions whose answer is stated in the paragraph. The most valuable negatives are not random documents, but nearby paragraphs from the same topic that do not answer the question.

Model Improvement Notes

Models should preserve exact French entity names while also handling question paraphrase. Strong passage pooling matters because the answer evidence may be a small span inside a much longer paragraph. For dense models, contrastive training with same-article negatives can reduce over-reliance on topical similarity. For rerankers, explicit attention to answer-bearing spans should improve top-10 ordering beyond candidate recall.

Example Data

Query	Positive document
Quand est-ce que Pierre Lambert est proche des Jésuites ? [57 chars]	pierre-lambert-de-la-motte_2_36 La spiritualité de Pierre Lambert de La Motte évolue tout au long de sa vie. Il est marqué par son époque et principalement par le centralisme issue de l'Église post tridentine, qui conduit à la codification de bon nombre d'actes de piété, mais aussi par le rigorisme naissant du XVIIe siècle, avec une recherche de l'austérité. Les deux écoles de spiritualité qui le marquent sont celle des Jésuites dont il est très proche quand il travaille à Rouen, mais aussi ce qui deviendra l'École française de spiritualité, qui recherche l'imitation du Christ, et le rôle important donné à la sanctification des prêtres. Ces sources de la formation spirituelle de Pierre Lambert le marquent profondément, même si des évolutions sont visibles au cours de sa vie. [786 chars]
Comment se nomme le frère de Carnot ? [37 chars]	sadi-carnot-(physicien)_12_8 Parmi ses écrits posthumes, un manuscrit intitulé Recherche d’une formule propre à représenter la puissance motrice de la vapeur d’eau, rédigé entre novembre 1819 et mars 1827 mais probablement après les Réflexions, fut conservé. Dans celui-ci il ébauchait la première loi de la thermodynamique, en tentant de préciser le lien entre travail et chaleur. Cette note fut finalement publiée en 1878, c’est-à-dire trop tardivement pour pouvoir influer sur le développement de la science, par Hippolyte Carnot, dans un volume édité en hommage à son frère dans lequel il inséra une « Notice biographique sur Sadi Carnot ». C’est sans doute au printemps 1832 que Sadi découvre le principe de l’équivalence et qu’il reprend, dans de brèves notes, les conclusions d’un long mémoire, qui fut finalement détruit par Hippolyte. Ces notes, publiées également en 1878, indiquent qu’il avait alors renoncé à la théorie du calorique qui imprégnait encore son essai de 1824, et au sujet de... [1,000 / 1,382 chars]
Pour quoi sont réputés les deux frères engagés par Wallis ? [59 chars]	casablanca-(film)_7_9 Ce sont Julius J. et Philip G. Epstein qui sont engagés par Wallis pour adapter la pièce au grand écran. Réputés pour leur esprit ironique, les deux frères introduisent plusieurs personnages secondaires hauts en couleurs ainsi que des dialogues donnant un ton fascinant aux conversations entre les protagonistes du film. Malgré leur empreinte sur le film, les Epstein quittent rapidement le projet pour se consacrer à la série de films de propagande commandés par le Gouvernement américain et réalisés pour la plupart par Frank Capra, Why We Fight. À ce moment de la production, le scénario en est arrêté au flashback, ce qui représente à peu près la moitié du film, et ne possède pas de fil narratif évident. [732 chars]

Source Reference Table

Title	Year	Type	URL
FQuAD: French Question Answering Dataset	2020	Paper	https://arxiv.org/abs/2002.06071
FQuAD2.0: French Question Answering and knowing that you know nothing	2021	Paper	https://arxiv.org/abs/2109.13209
manu/fquad2_test	2024	Dataset card	https://huggingface.co/datasets/manu/fquad2_test

Dataset Information

Field	Value
Nano set	NanoMTEB-French
Backing dataset	NanoMTEB-French
Task / split	fquad
Hugging Face dataset	hakari-bench/NanoMTEB-French
Language	fr
Category	natural_language
Queries	200
Documents	269
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	56.21
Document length avg chars	898.31

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.8899	0.9700	1.0000	top-500
Dense	`harrier_oss_v1_270m`	0.8102	0.9250	0.9600	top-500
Reranking hybrid	`reranking_hybrid`	0.8666	0.9500	1.0000	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: test
Train/eval overlap audit: not_audited
Leakage note: exclude FQuAD test examples, Nano queries, qrels, and positive French Wikipedia passages likely to overlap with this evaluation
Multi-positive training: single_positive_question_document_focus
Useful training data: non-overlapping FQuAD train examples, French Wikipedia QA retrieval pairs, same-article hard negatives, French extractive QA paraphrases