NanoMMTEB-v2 / hagrid

Overview

NanoMMTEB-v2 / hagrid is an English information-seeking retrieval task from HAGRID. Queries are short fact-seeking questions, and documents are concise answer passages with citation-style markers. The Nano split has 200 queries, 493 documents, and 200 positive qrel rows, with exactly one positive document per query. Current diagnostics show all retrieval profiles as very strong, with BM25 best on nDCG@10 and hit@10, reranking_hybrid best on recall@100, and dense retrieval close behind. The task is mostly about exact attributable evidence selection among compact factual snippets.

Details

What the Original Data Measures

HAGRID is a human-LLM collaborative dataset for generative information seeking with attribution. It builds on information-seeking questions and relevant passages, then adds generated answers and human judgments of informativeness and attribution. In this retrieval version, systems retrieve passages that can support a cited answer.

The task measures whether a retriever can find direct factual evidence for a question. The positive passage should explicitly support the answer, not merely mention the same entity.

Observed Data Profile

The Nano split contains 200 queries, 493 documents, and 200 positive qrel rows. Every query has exactly one positive document. Queries average 38.36 characters, while documents average 229.57 characters.

Queries are short English fact questions. Documents are compact answer snippets, often with citation markers such as bracketed source numbers. Examples ask about the Australian Football League, Nausicaa of the Valley of the Wind, Abi Branning, Loretta Lynn, and the Cincinnati Bengals.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains all 493 documents per query and achieves nDCG@10 = 0.9814, hit@10 = 0.9950, and recall@100 = 0.9950. BM25 is the strongest top-rank profile.

This reflects the factual QA nature of the task. Queries and positive passages often share core entity names and answer attributes. Exact lexical matching is therefore highly effective, and the remaining challenge is distinguishing near-duplicate snippets about the same entity or adjacent fact.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains all 493 documents per query and achieves nDCG@10 = 0.9570, hit@10 = 0.9650, and recall@100 = 0.9800. Dense retrieval is also very strong, but slightly below BM25.

Dense similarity helps match question intent and paraphrased evidence, but it can blur closely related facts. For example, passages about the same person, team, film, or television character may be semantically close while only one contains the requested answer.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains 100 candidates per query and achieves nDCG@10 = 0.9639, hit@10 = 0.9800, and recall@100 = 1.0000. Hybrid retrieval has perfect recall@100 and sits between BM25 and dense retrieval for top-rank quality.

The profile suggests that hybrid retrieval is an excellent candidate generator: it keeps every positive available for reranking. However, BM25 alone remains the best observed final ranker because exact entity and attribute overlap is so strong.

Metric Interpretation for Model Researchers

This task is single-positive: each question has one annotated supporting passage. Hit@10 measures whether that passage appears near the top. nDCG@10 is sensitive to its exact rank, and recall@100 measures whether it remains available to a reranker.

Because the corpus is small and the passages are short, high scores should be expected from competent lexical and dense retrievers. The most useful signal is whether a model can avoid confusing same-entity passages that answer different attributes.

Query and Relevance Type Tendencies

Queries are short English questions about entities, dates, counts, places, adaptations, status, or biographical facts. Relevant documents are concise answer passages that explicitly state the requested fact and often include citation-like markers.

The task rewards entity recognition, attribute matching, and exact answer support. It does not require long-context retrieval, but it does require distinguishing the cited supporting snippet from nearby factual snippets.

Representative Failure Modes

BM25 can fail when several snippets repeat the same entity name but answer different questions. Dense retrieval can fail by selecting a semantically related snippet about the same entity without the requested attribute. Hybrid retrieval can preserve the correct passage in the candidate pool while ranking a nearby same-entity answer above it.

Rerankers should verify answer support directly: the passage should contain the number, place, date, adaptation source, or yes-no status asked by the query.

Training Data That May Help

Useful training data includes open-domain QA retrieval pairs, attributable answer support selection data, quote retrieval, non-overlapping MIRACL or Wikipedia question-passage pairs, and same-entity factual hard negatives. The Nano split's questions, qrels, and cited answer passages should be excluded from training.

Synthetic data can generate short factual questions and concise answer-bearing passages with citation-like markers. Negatives should mention the same entity but answer a different attribute, date, location, count, or status. Positives should explicitly support the answer.

Model Improvement Notes

Sparse systems should preserve entity and attribute terms. Dense retrievers should strengthen fine-grained factual discrimination within the same entity cluster. Rerankers should compare the question's requested slot against the candidate passage.

For hybrid systems, NanoMMTEB-v2 / hagrid is a near-ceiling candidate generation task. reranking_hybrid gives perfect recall@100, so the remaining work is precise top-rank ordering among compact factual snippets.

Example Data

Query	Positive document
How many clubs are in the Australian Football League? [53 chars]	The Australian Football League consists of 18 clubs [1][2] [58 chars]
What was the film Nausicaä of the Valley of the Wind adapted from? [66 chars]	Nausicaä of the Valley of the Wind was adapted from the manga series of the same name written and illustrated by Hayao Miyazaki. [4] [132 chars]
Is Abi Branning still a character on EastEnders? [48 chars]	No, Abi Branning is not a regular character on EastEnders as she was killed off in January 2018 after falling from the roof of The Queen Victoria pub [2]. However, she did make a few guest appearances in 2018, with one being in July and another being over the Christmas period [1]. [281 chars]

Source Reference Table

Title	Year	Type	URL
HAGRID: A Human-LLM Collaborative Dataset for Generative Information-Seeking with Attribution	2023	task paper	https://arxiv.org/abs/2307.16883
HAGRID GitHub repository	2023	project page	https://github.com/project-miracl/hagrid
mteb/HagridRetrieval	2024	dataset card	https://huggingface.co/datasets/mteb/HagridRetrieval

Dataset Information

Field	Value
Nano set	NanoMMTEB-v2
Backing dataset	NanoMMTEB-v2
Task / split	hagrid
Hugging Face dataset	hakari-bench/NanoMMTEB-v2
Language	en
Category	natural_language
Queries	200
Documents	493
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	38.36
Document length avg chars	229.57

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.9814	0.9950	0.9950	top-500
Dense	`harrier_oss_v1_270m`	0.9570	0.9650	0.9800	top-500
Reranking hybrid	`reranking_hybrid`	0.9639	0.9800	1.0000	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: dev
Train/eval overlap audit: not_audited
Leakage note: do not train on this Nano split's questions, qrels, or cited answer passages
Multi-positive training: single_positive_question_document_focus
Useful training data: open-domain QA retrieval pairs, attributable answer support selection data, non-overlapping MIRACL or Wikipedia question-passage pairs, same-entity factual hard negatives