HAKARI-Bench

MNanoBEIR / NanoBEIR-fr / NanoMSMARCO

Overview

This task is the French NanoBEIR version of MS MARCO, a web question answering and passage ranking benchmark built from real search-engine queries. The original MS MARCO dataset contains anonymized user questions paired with answer-bearing web passages, so the questions are often short, practical, and underspecified. In this NanoBEIR slice, French translated web-search style questions must retrieve French translated answer passages from 5,043 candidate documents. The task contains 50 queries and 50 positive relevance judgments, with exactly one positive passage per query. It is a compact diagnostic for short-query answer retrieval across everyday information needs, definitions, media questions, consumer topics, and practical web search phrasing.

Details

What the Original Data Measures

MS MARCO measures retrieval for real user information needs rather than questions written to match a known paragraph. In retrieval form, the relevant document is the passage that directly answers the question. This often requires connecting a short query to explanatory wording in an answer passage. A page that repeats a term but does not resolve the user's intent is not sufficient.

Observed Data Profile

The French Nano task has 50 queries, 5,043 documents, and 50 positives. Every query has one positive passage. Query length averages about 45 characters, and documents average about 374 characters. Example questions ask what rumination syndrome is, who sang "Here I Go Again", what role Cameron Boyce plays in Liv and Maddie, where most large deserts are located, and what a policing slang term means. Documents are concise French translated web answer passages.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.461, Hit@10 of 0.620, and Recall@100 of 0.880. Sparse retrieval gives useful coverage because many questions contain distinctive names, titles, or terms. However, it does not always rank the answer passage high because web questions often need answerability rather than surface repetition. BM25 can retrieve passages sharing an entity or term while missing the passage that actually answers the question.

Dense Evaluation Profile

The dense harrier-oss-270m baseline is strongest by top-10 ranking, with nDCG@10 of 0.575, Hit@10 of 0.720, and Recall@100 of 0.940. This shows that semantic answer matching is central for this French MS MARCO slice. Dense retrieval can connect short questions to answer passages that explain, define, or resolve the intent using different wording. It also improves candidate coverage over BM25 in this sample.

Reranking Hybrid Evaluation Profile

The reranking_hybrid profile reaches nDCG@10 of 0.531, Hit@10 of 0.700, and Recall@100 of 0.940, with three safeguard rows at 101 candidates. It matches dense Recall@100 but trails dense top-10 ranking. This suggests that hybrid search is useful as a robust candidate generator, while dense ranking is better at ordering the single answer passage near the top. BM25 still contributes exact names and terms, especially for entity-heavy queries.

Metric Interpretation for Model Researchers

Because each query has one positive, Hit@10 and Recall@100 are direct evidence-finding measures. nDCG@10 is the clearest signal for first-page quality. The contrast between dense and hybrid is useful: both have high recall, but dense places positives higher. For reranking pipelines, hybrid candidates may be sufficient; for direct retrieval, dense ordering is stronger.

Query and Relevance Type Tendencies

Queries are short French user questions or search prompts. Relevant documents are concise answer-bearing passages. Some queries are definition requests, some ask about media or names, and others ask practical facts. Hard negatives may mention the same entity or topic but answer a different question. The task rewards answerability, paraphrase handling, and precise entity matching.

Representative Failure Modes

BM25 can over-rank passages with repeated query terms but no answer. Dense retrieval can retrieve passages that answer a nearby but different question. Hybrid retrieval improves coverage but may still put a lexical distractor ahead of the answer. Failure analysis should check whether the retrieved passage directly satisfies the query.

Training and Leakage Considerations

Training should exclude MS MARCO, BEIR, NanoBEIR, and translated records likely to overlap with these queries or passages. Useful non-overlapping data includes MS MARCO-style passage-ranking pairs, French or multilingual web QA retrieval data, search-query-to-answer-passage pairs, and noisy real user-question datasets. Synthetic data should generate short French questions from concise non-evaluation web answer passages.

Model Improvement Signals

Strong models should improve answerability ranking while preserving exact entity recall. Useful training signals include short-query paraphrases, definition questions, same-entity hard negatives, and multilingual web QA examples. Hybrid systems should use sparse matching for names and rare terms while dense retrieval handles explanatory answer wording.

Example Data

QueryPositive document
Qu'est-ce que le syndrome de rumination ? [41 chars]Syndrome de rumination. Le syndrome de rumination, également appelé merycisme, est un trouble alimentaire non spécifié qui provoque la régurgitation des aliments. Bien qu'il ne soit pas identifié comme un trouble alimentaire spécifique dans le DSM-IV, certains critères ont été définis pour diagnostiquer ce trouble. [316 chars]
Qui a chanté "Here I Go Again" ? [32 chars]Pour d'autres usages, voir Here I Go Again (homonymie). "Here I Go Again" est une chanson du groupe de rock britannique Whitesnake. Parue à l'origine sur leur album de 1982, Saints & Sinners, la chanson a été réenregistrée pour leur album éponyme de 1987, Whitesnake. La chanson a été réenregistrée la même année dans une nouvelle version mixée pour la radio. [359 chars]
Quel rôle joue Cameron Boyce dans la série Liv et Maddie ? [58 chars]Préparez-vous à bien rire, les amis. Dans un EXCLUSIF avant-première de l'épisode du 19 avril de Liv & Maddie intitulé “Prom-A-Rooney.” Évidemment. Dans cet extra hilarant, on voit l'acteur de Jessie, Cameron Boyce, apparaître dans une autre série Disney pour rencontrer Maddie (Shelby Wulfert). Son personnage est, euh, excentrique ! [334 chars]

Source Reference Table

LabelURL

| MS MARCO paper (https://arxiv.org/abs/1611.09268) | | MS MARCO dataset site (https://microsoft.github.io/msmarco/Datasets.html) | | BEIR benchmark (https://github.com/beir-cellar/beir) | | MMTEB benchmark (https://arxiv.org/abs/2502.13595) | | NanoBEIR dataset (https://huggingface.co/collections/zeta-alpha-ai/nanobeir) |

Dataset Information

FieldValue
Nano setMNanoBEIR
Backing datasetNanoBEIR-fr
Task / splitNanoMSMARCO
Hugging Face datasethakari-bench/NanoBEIR-fr
Languagefr
Categorynatural_language
Queries50
Documents5,043
Positive qrels50
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars44.62
Document length avg chars373.74

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.46080.62000.8800top-500
Denseharrier_oss_v1_270m0.57480.72000.9400top-500
Reranking hybridreranking_hybrid0.53120.70000.9400top-100