HAKARI-Bench

NanoMIRACL / fr

Overview

NanoMIRACL / fr is the French split of the MIRACL-style multilingual monolingual retrieval benchmark. French queries retrieve French Wikipedia passages, not translated evidence. The Nano split has 200 queries, 10,000 documents, and 417 positive qrel rows. The task is notable because BM25 keeps high candidate coverage but ranks direct evidence relatively poorly, while dense retrieval substantially improves top-rank quality. Current diagnostics show dense retrieval as the strongest nDCG@10 profile and reranking_hybrid as the strongest hit and recall profile.

Details

What the Original Data Measures

MIRACL was introduced as a multilingual ad hoc retrieval benchmark over Wikipedia passages. Its design is monolingual: French queries retrieve French passages from French Wikipedia. The benchmark emphasizes natural-language questions, passage-level evidence, and human relevance judgments.

French is one of the MIRACL languages created beyond the earlier Mr. TyDi/TyDi QA sources. It should therefore be read as a MIRACL-created French Wikipedia retrieval task, not as translated English retrieval. The relevant item is the French passage that supports the question, rather than a direct answer string or an article-level label.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 417 positive qrel rows. Positives per query average 2.09, with a minimum of 1, a median of 2, and a maximum of 7. There are 123 multi-positive queries, representing 61.5 percent of the split. Queries average 43.26 characters, while documents average 385.31 characters.

The examples are ordinary French questions using forms such as Qui, Quel, Quelle, Quels, Quelles, Quand, , Comment, Combien, Pourquoi, and Qu’est-ce. Topics include people, places, science, music, politics, history, mathematics, demographics, food, religion, electricity generation, biomedical data, geography, and definitions.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.4658, hit@10 = 0.9000, and recall@100 = 0.9832. BM25 is good at keeping French positives somewhere in the candidate pool. Exact entities, place names, topical nouns, and technical terms often surface the right neighborhood.

The weak point is ranking. BM25 frequently retrieves plausible French passages that share topic words or morphology but do not express the requested relation. This makes the split difficult for sparse retrieval even though recall@100 is high. The model must go beyond matching fleuve, chimie, café, or morphologie and identify the passage that directly answers the question.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.6828, hit@10 = 0.9200, and recall@100 = 0.9113. Dense retrieval is the strongest observed profile by nDCG@10. It improves top-rank quality by matching question intent and passage evidence more directly than surface overlap alone.

The tradeoff is lower recall@100 than BM25 and hybrid retrieval. Dense retrieval is better at selecting strong top evidence, but it leaves more judged positives outside the top-100 candidate set. For French, this makes dense retrieval a strong ranker but not the most complete candidate generator.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains exactly 100 candidates per query, with no safeguard rows. It achieves nDCG@10 = 0.5896, hit@10 = 0.9250, and recall@100 = 0.9976. Hybrid retrieval is below dense retrieval by nDCG@10, but it has the best hit@10 and the strongest top-100 positive coverage.

This profile is useful for reranking pipelines. BM25 contributes exact French surface forms and names, while dense retrieval contributes semantic relation matching. The combined candidate set almost fully preserves judged positives, but a downstream reranker is still needed to reach dense-level or better top-rank ordering.

Metric Interpretation for Model Researchers

This task is multi-positive for 61.5 percent of queries. Hit@10 measures whether at least one relevant passage appears near the top. nDCG@10 rewards ranking relevant passages high, and recall@100 measures how much of the judged positive set survives for reranking.

The French split clearly separates ranking quality from candidate coverage. Dense retrieval is best for top-rank ordering, BM25 keeps many positives but ranks them less well, and reranking_hybrid gives the strongest recall. A good French retrieval system should combine dense semantic ranking with lexical coverage and then rerank for evidence specificity.

Query and Relevance Type Tendencies

Queries are concise French information needs about definitions, people, places, dates, quantities, causes, roles, classifications, and scientific or geographic facts. They often include an obvious topic word, but the relevant passage must state the requested relation.

Relevant documents are French Wikipedia passages with title context and answer-bearing prose. The task rewards sense disambiguation, entity matching, question-form understanding, and passage-level relation selection.

Representative Failure Modes

BM25 can retrieve a passage with the right word sense family but the wrong answer relation. A question about the Orinoco river can retrieve an unrelated song or named-entity passage containing Orénoque. A question about branches of chemistry can retrieve language or chemistry-adjacent pages before the passage that describes the relevant scientific domains. A question about the largest coffee producer can retrieve coffee-terroir or country-specific pages before a general coffee passage with the answer. A question about flower morphology can retrieve linguistic morphology pages instead of botanical evidence.

Dense retrieval can fail by choosing a semantically related passage that lacks the exact fact. Hybrid retrieval reduces missing positives but still needs a reranker to prefer the direct evidence passage among many plausible French candidates.

Training Data That May Help

Useful training data includes non-overlapping MIRACL French training data, French Wikipedia question-to-passage retrieval pairs, French open-domain QA evidence retrieval datasets, and French entity-attribute supervision for places, dates, counts, professions, definitions, causes, and classifications. Hard negatives should include homonymous and near-topic French passages.

Synthetic data can help when it creates French Wikipedia-style passages with titles, aliases, dates, places, demographic facts, definitions, roles, and factual evidence. Generated questions should use varied Qui, Quel, Quelle, Quels, Quelles, Quand, , Comment, Combien, Pourquoi, and Qu’est-ce forms. Comparable evaluation should exclude upstream development/test data or other MIRACL-derived examples likely to overlap with this Nano split.

Model Improvement Notes

Dense retrievers should preserve their French top-rank advantage while recovering more lexical candidate coverage. Sparse systems should improve sense disambiguation and relation-aware term weighting rather than relying only on surface overlap. Rerankers should penalize topic-near but non-answering passages, especially for ambiguous terms and scientific or geographic questions.

For hybrid systems, NanoMIRACL / fr supports a high-coverage candidate stage followed by strong reranking. The hybrid profile nearly covers all positives, while the dense profile shows the level of semantic ranking quality needed at the top of the list.

Example Data

QueryPositive document
Qu’est-ce qu’un cardinal fait? [30 chars]Cardinal (religion) En fait, la nomination de cardinaux est une indication politique sur le pontificat en cours et la future élection, les cardinaux étant chargés d'élire le pape. Dans l'histoire, elle a aussi été une manière d'honorer les cadets de grandes familles royales ou nobles et de récompenser des proches. Cet état de fait était désigné sous le nom de népotisme, du latin "nepos", le neveu. Le pape choisissait un de ses neveux qu'il créait cardinal afin de faire entrer sa parenté dans la « carrière » ecclésiastique. [529 chars]
Comment s’appelle la femme de Pablo Picasso? [44 chars]Pablo Picasso Son fils Paulo naît le . Durant l'été, il s'installe avec Olga et Paulo à Fontainebleau. Il y peint les "Femmes à la fontaine" (Paris, musée Picasso et New York, ) et "Les Trois Musiciens" (New York, et Philadelphie ). Cette même année, le musée de Grenoble obtient du peintre le premier tableau pour exposition dans une collection publique française ("Femme lisant"), représentant sa femme Olga Khokhlova. En , lors d'un séjour à Dinard sur la côte nord de la Bretagne, il peint "Deux femmes courant sur la plage" ("La Course", Paris, Musée Picasso). Puis, en décembre, il réalise le décor pour L'"Antigone" de Cocteau, créée par Charles Dullin au théâtre de l'Atelier. En 1923, il fait un nouveau séjour estival sur la Côte d'Azur, au cap d'Antibes, et peint "La Flûte de Pan" (Paris, Musée Picasso). Pendant l'été 1924, il séjourne à la villa La Vigie à Juan-les-Pins (Côte d'Azur), il fait son "Carnet de dessins abstraits" et peint "Paul en arlequin" (Paris, musée Picasso). [994 chars]
Qu’est-ce que c’est un référendum? [34 chars]Référendum Le référendum appartient au domaine du droit : on ne peut décider par référendum que des lois. Dans les régimes de démocratie représentative, des parlementaires discutent et amendent les lois. Le référendum, selon les conceptions du juriste Raymond Carré de Malberg, a pour objet de limiter et de contrôler ce pouvoir. Si , il est sain que le compromis que les parlementaires ont trouvé entre les divers intérêts et opinions soit soumis au corps électoral. La place du référendum dans la hiérarchie des pouvoirs pose un problème pratique. Les lois sont soumises à un contrôle de conformité à la Constitution, qui notamment protège les minorités. Comment le référendum se place-t-il par rapport à cette norme ? [721 chars]

Source Reference Table

TitleYearTypeURL
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages2022paperhttps://arxiv.org/abs/2210.09984
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages2023paperhttps://aclanthology.org/2023.tacl-1.63/
MIRACL GitHub repositoryproject repositoryhttps://github.com/project-miracl/miracl
miracl/miracl-corpusdataset cardhttps://huggingface.co/datasets/miracl/miracl-corpus

Dataset Information

FieldValue
Nano setNanoMIRACL
Backing datasetNanoMIRACL
Task / splitfr
Hugging Face datasethakari-bench/NanoMIRACL
Languagefr
Categorynatural_language
Queries200
Documents10,000
Positive qrels417
Positives / query avg2.08
Positives / query min1
Positives / query median2.00
Positives / query max7
Multi-positive queries123 (61.50%)
Query length avg chars43.26
Document length avg chars385.31

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.46580.90000.9832top-500
Denseharrier_oss_v1_270m0.68280.92000.9113top-500
Reranking hybridreranking_hybrid0.58960.92500.9976top-100

Training and Leakage Metadata