MNanoBEIR / NanoBEIR-it / NanoMSMARCO
Overview
NanoBEIR-it__NanoMSMARCO is the Italian NanoBEIR version of MS MARCO passage retrieval, a benchmark built from real web-search questions and answer-bearing passages. The task asks a retrieval model to rank Italian translated passages for short Italian translated user queries. Unlike multi-positive evidence tasks, this Nano split has 50 queries, 5,043 documents, and 50 positive qrels, with exactly one positive passage per query. It is therefore a focused test of whether the retriever can identify the single passage that directly answers a compact question. The observed results show a clear dense-retrieval advantage: semantic matching substantially improves both top-10 quality and top-100 coverage over lexical matching alone.
Details
What the Original Data Measures
MS MARCO was introduced as a large-scale machine reading comprehension and web passage retrieval dataset built from human-generated search queries. In BEIR, the passage retrieval version is used as a zero-shot retrieval benchmark: the model receives a natural web query and must rank passages that contain an answer. The Italian NanoBEIR task preserves that intent through translated queries and translated answer passages. It is less about retrieving all evidence for a complex topic and more about matching a short information need to a concise answer-bearing paragraph.
Observed Data Profile
The task contains 50 queries and 5,043 documents. The qrels contain one positive for each query: the average, minimum, median, and maximum positives per query are all 1, and there are no multi-positive queries. Queries are short, averaging 42.04 characters, while documents average 356.59 characters. The examples cover definition questions, entity questions, entertainment questions, geography, and word-meaning queries. This profile resembles ordinary web search more than scientific or argumentative retrieval: a user asks a compact question, and the best passage often answers with wording that may not repeat the query exactly.
BM25 Evaluation Profile
The BM25 top-500 subset reaches nDCG@10 = 0.3957, hit@10 = 0.6000, and Recall@100 = 0.8800. BM25 retrieves many positives somewhere in the first 100 candidates, but its top-10 ranking quality is modest. This is expected for web questions where the answer passage may use paraphrase, explanatory wording, or a different grammatical form than the query. Exact term overlap is still useful for entities, titles, and rare words, but it is not enough to consistently push the answer-bearing passage into the first few ranks.
Dense Evaluation Profile
The dense harrier-oss-270m top-500 subset reaches nDCG@10 = 0.5087, hit@10 = 0.7000, and Recall@100 = 0.9800. This is the strongest single profile for the task, with a large improvement over BM25 on both ranking quality and support coverage. The result indicates that embedding similarity is well matched to Italian MS MARCO-style queries: definitions, paraphrased answers, and implicit question-answer relationships benefit from semantic retrieval. For model researchers, this task is a useful diagnostic for whether a multilingual embedding model can map a short web query and an answer paragraph into a shared semantic space without relying only on repeated words.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses 100 to 101 candidates per query and reaches nDCG@10 = 0.4781, hit@10 = 0.6800, and Recall@100 = 0.9800. One query uses the rank-101 safeguard. The hybrid subset matches dense retrieval on top-100 coverage, but its nDCG@10 is lower than dense alone. This means that combining lexical and dense candidates successfully keeps the relevant passage available for reranking, while the fused ranking is not always as clean as the dense ordering in the first 10 positions. In this task, hybrid search is best viewed as a high-coverage candidate pool rather than the strongest first-stage ranker.
Metric Interpretation for Model Researchers
The main pattern is dense dominance. BM25 has acceptable Recall@100, so lexical signals are not irrelevant, but the low nDCG@10 shows that exact matching often places answer passages below distractors. Dense retrieval improves both top-10 hit rate and rank ordering, which is the most important behavior for single positive web passage retrieval. The hybrid profile is still valuable because it preserves dense-level Recall@100 while adding lexical coverage, but it does not beat dense ranking at nDCG@10 on this split. A strong reranker trained for this task should therefore focus on using the hybrid pool to recover coverage and then re-establish the semantic answer match near the top.
Query and Relevance Type Tendencies
The sample questions are concise and answer seeking: "what is" questions, "who sang" questions, actor-role questions, location questions, and lexical meaning questions. The positive documents usually contain direct explanations, short encyclopedia-like statements, or snippets from web-style answer pages. Relevance is often determined by whether the passage answers the exact information need, not merely whether it discusses the same entity. This makes near-topic distractors a serious source of error.
Representative Failure Modes
BM25 can miss the best passage when the query asks a short question and the answer is phrased as an explanatory sentence rather than a term repetition. Dense retrieval can retrieve semantically close but non-answering passages, especially for broad entities or common concepts. Hybrid search can include the right passage but still rank a lexical distractor above it if the distractor shares more surface terms with the query. Because each query has only one positive, a single rank swap has a large effect on nDCG@10.
Training Data That May Help
Useful training data includes non-overlapping web QA retrieval pairs, Italian search-query logs, multilingual passage retrieval data, and answer-bearing question-passage pairs. Hard negatives should include passages that share the entity or topic but do not answer the question. Training and tuning should avoid overlap with MS MARCO, BEIR, NanoBEIR, and translated passages from this benchmark.
Model Improvement Notes
This task rewards semantic answer matching more than broad topical similarity. Improvements should prioritize concise query understanding, paraphrase handling, and hard-negative separation between answer-bearing and merely related passages. A practical system can use hybrid candidate generation for coverage, but the final ranking model needs a strong answer selection signal to beat the dense baseline at the top of the list.
Example Data
| Query | Positive document |
| Cos'è la sindrome da ruminazione? [33 chars] | Sindrome da Ruminazione. La sindrome da ruminazione, nota anche come mericismo, è un tipo di disturbo alimentare non altrimenti specificato che provoca la rigurgitazione del cibo. Sebbene non sia identificata come un disturbo alimentare specifico nel DSM-IV, sono stati definiti alcuni parametri per diagnosticare il disturbo. [326 chars] |
| Chi ha cantato "Ecco che vado di nuovo"? [40 chars] | Per altri significati, vedi Here I Go Again (disambiguazione). Here I Go Again è una canzone del gruppo rock britannico Whitesnake. Pubblicata originariamente nell'album del 1982 Saints & Sinners, la canzone è stata registrata nuovamente per l'album omonimo del 1987 Whitesnake. Quell'anno è stata registrata nuovamente in una versione radio-mix. [346 chars] |
| Chi interpreta Cameron Boyce in "Liv e Maddie"? [47 chars] | Preparatevi a sbellicarvi dalle risate, ragazzi. In un'anteprima esclusiva dell'episodio del 19 aprile di 'Liv & Maddie' intitolato 'Prom-A-Rooney.' Ovviamente. Nel divertentissimo clip, vediamo Cameron Boyce, il protagonista di 'Jessie,' fare un salto in un altro show Disney per incontrare Maddie (Shelby Wulfert). Il suo personaggio è, beh, eccentrico! [355 chars] |
Source Reference Table
| Title | Year | Type | URL |
| MS MARCO: A Human Generated Machine Reading Comprehension Dataset | 2016 | task paper | https://arxiv.org/abs/1611.09268 |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models | 2021 | benchmark paper | https://arxiv.org/abs/2104.08663 |
| MMTEB: Massive Multilingual Text Embedding Benchmark | 2025 | benchmark paper | https://arxiv.org/abs/2502.13595 |
| NanoBEIR: Smaller BEIR dataset subsets | 2024 | dataset collection | https://huggingface.co/collections/zeta-alpha-ai/nanobeir |
Dataset Information
| Field | Value |
| Nano set | MNanoBEIR |
| Backing dataset | NanoBEIR-it |
| Task / split | NanoMSMARCO |
| Hugging Face dataset | hakari-bench/NanoBEIR-it |
| Language | it |
| Category | natural_language |
| Queries | 50 |
| Documents | 5,043 |
| Positive qrels | 50 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 42.04 |
| Document length avg chars | 356.59 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.3957 | 0.6000 | 0.8800 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5087 | 0.7000 | 0.9800 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.4781 | 0.6800 | 0.9800 | top-100 |