NanoMTEB-v2 / hotpot_qa

Overview

NanoMTEB-v2 / hotpot_qa is a multi-hop question-to-Wikipedia retrieval task derived from HotpotQA. Queries are natural-language questions, and each query has two relevant supporting passages. The original HotpotQA benchmark was built for explainable multi-hop question answering over Wikipedia, with annotated supporting facts that require combining evidence across pages. This Nano retrieval split evaluates whether a model can retrieve those support passages from a 10,000-document candidate pool. It is useful for studying bridge-entity retrieval, multi-positive ranking, and whether first-stage systems can find both parts of a two-hop evidence chain.

Details

What the Original Data Measures

HotpotQA measures question answering that often requires linking two Wikipedia entities or facts. In the retrieval version, the answer itself is not the target; the model must retrieve the supporting passages needed to answer the question. This makes the task different from ordinary single-passage QA retrieval.

Every query in this Nano split has exactly two positives. A strong retrieval system should not only find one obvious entity page, but also recover the second supporting passage that completes the reasoning chain.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 400 positive qrel rows. Each query has exactly 2 positives, so all 200 queries are multi-positive. Queries average 95.83 characters, while documents average 421.20 characters.

The examples include questions about films, lighthouse lamps, shared ancestry, screenwriters, and actors. Many questions explicitly mention one entity and ask for a relation that requires another page or linked fact. Documents are short Wikipedia-style passages with titles.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.8950, hit@10 of 1.0000, and recall@100 of 0.9725. This is a very strong sparse profile. Many queries contain entity names, work titles, or distinctive phrases that appear in at least one supporting passage.

BM25 is especially good at finding the first hop when a bridge entity is stated in the query. Its remaining weakness is balanced multi-hop coverage: a system can get a hit by retrieving one support passage while still missing or under-ranking the second.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.8904, hit@10 of 0.9850, and recall@100 of 0.9700. Dense retrieval is also very strong, but slightly below BM25 on this Nano sample. The result suggests that explicit entity overlap is highly informative here.

Dense retrieval remains useful because some second-hop passages may be semantically connected rather than lexically obvious. However, if the query names a bridge entity directly, sparse matching can be hard to beat.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates with no safeguard positives. It reaches nDCG@10 of 0.9156, hit@10 of 1.0000, and recall@100 of 0.9975. This is the strongest profile across the candidate types. The hybrid pool captures the exact-entity strengths of BM25 and the semantic bridge coverage of dense retrieval.

For reranking, this task is a good example of why hybrid search matters even when individual first-stage systems are strong. The combined pool nearly saturates supporting-passage coverage, giving a reranker the opportunity to place both positives early.

Metric Interpretation for Model Researchers

Because every query has two positives, hit@10 alone is not enough. A system can hit with one supporting passage while still failing to retrieve the full evidence set. nDCG@10 and recall@100 should be read together: nDCG reflects whether positives are ranked high, while recall shows whether both support passages are available for downstream multi-hop reasoning.

The near-ceiling scores mean this split is not primarily a hard candidate-generation benchmark. Its value is in checking multi-hop support coverage and rank ordering under hard negatives.

Query and Relevance Type Tendencies

Queries are English multi-hop questions, often naming one entity and asking about a related entity, location, date, role, or work. Relevant documents are short Wikipedia passages. The two positives usually correspond to the support pages required to answer the question.

The relevance relation is supporting evidence for multi-hop QA. Topical similarity is not enough unless the passage contributes to the answer chain.

Representative Failure Modes

Common failures include retrieving only one hop, over-ranking an entity page that is mentioned in the query but not sufficient for the answer, missing the bridge page, and confusing similarly named works or people. Dense systems may retrieve semantically related pages that do not complete the reasoning chain; sparse systems may over-focus on the explicitly named entity.

Training Data That May Help

Useful training data includes HotpotQA supporting-fact retrieval pairs, multi-hop Wikipedia QA data, entity-linking retrieval data, and hard negatives that mention one entity but lack the needed relation. Multi-positive training is required because both supporting passages matter.

Model Improvement Notes

Models should optimize for evidence-set retrieval, not just first-hit retrieval. Candidate generation should preserve both lexical entity matches and semantic bridge candidates. Rerankers should learn to identify complementary support passages rather than ranking several passages about the same first-hop entity.

Example Data

Query	Positive document
The Soul of Buddha is a 1918 American silent romance film shot in a borough that is the western terminus of what? [114 chars]	The Soul of Buddha The Soul of Buddha is a 1918 American silent romance film directed by J. Gordon Edwards and starring Theda Bara, who also wrote the film's story. The film was produced by Fox Film Corporation and shot at the Fox Studio in Fort Lee, New Jersey. [263 chars]
The lamp used in many lighthouses is similiar to this type of lamp patented in 1780 by Aimé Argand? [99 chars]	Lewis lamp The Lewis lamp is a type of light fixture used in lighthouses. It was invented by Winslow Lewis who patented the design in 1810. The primary marketing point of the Lewis lamp was that it used less than half the oil of the prior oil lamps which they replaced. The lamp used a similar design to an Argand lamp, adding a parabolic reflector behind the lamp and a magnifying lens made from 4 in green bottle glass in front of the lamp. A similar variant using a parabolic reflector was created by the inventor of the Argand lamp, Aimé Argand. While the Argand variant became widely used by European lighthouses, the Lewis lamp design was selected by the United States for use in American lighthouses. [708 chars]
What is the shared country of ancestry between Art Laboe and Scout Tufankjian? [78 chars]	Art Laboe Art Laboe (born Arthur Egnoian on August 7, 1925) is an Armenian American disc jockey, songwriter, record producer, and radio station owner, generally credited with coining the term "Oldies But Goodies". [214 chars]

Source Reference Table

Title	Year	Type	URL
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering	2018	source task paper	https://arxiv.org/abs/1809.09600
MTEB: Massive Text Embedding Benchmark	2023	benchmark paper	https://arxiv.org/abs/2210.07316
mteb/HotpotQA_test_top_250_only_w_correct-v2		dataset card	https://huggingface.co/datasets/mteb/HotpotQA_test_top_250_only_w_correct-v2

Dataset Information

Field	Value
Nano set	NanoMTEB-v2
Backing dataset	NanoMTEB-v2
Task / split	hotpot_qa
Hugging Face dataset	hakari-bench/NanoMTEB-v2
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	400
Positives / query avg	2.00
Positives / query min	2
Positives / query median	2.00
Positives / query max	2
Multi-positive queries	200 (100.00%)
Query length avg chars	95.83
Document length avg chars	421.20

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.8950	1.0000	0.9725	top-500
Dense	`harrier_oss_v1_270m`	0.8904	0.9850	0.9700	top-500
Reranking hybrid	`reranking_hybrid`	0.9156	1.0000	0.9975	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MTEB HotpotQA hard-negative test split
Train/eval overlap audit: not_audited
Leakage note: exclude NanoMTEB-v2 hotpot_qa questions and supporting passages
Multi-positive training: required
Useful training data: HotpotQA supporting-fact retrieval pairs, multi-hop Wikipedia QA data, entity bridge hard negatives