NanoMTEB-v2 / fever

Overview

NanoMTEB-v2 / fever is a claim-to-evidence retrieval task derived from FEVER. Queries are short factual claims, and relevant documents are Wikipedia passages that provide evidence for or against those claims. The original FEVER benchmark was introduced for fact extraction and verification: a system must retrieve evidence and then decide whether a claim is supported, refuted, or unverifiable. This Nano retrieval split focuses on the evidence-retrieval step using 200 claims over 10,000 candidate passages. It is a comparatively lexical, entity-centered fact-checking task where claims often contain named entities that also appear in the correct evidence passage.

Details

What the Original Data Measures

FEVER measures whether systems can retrieve and reason over Wikipedia evidence for factual claims. In the retrieval conversion, the model is not asked to output the final verification label; it must find the passages that contain the relevant evidence. A relevant passage may support or refute the claim, but it must be evidentially connected to the specific entity and relation in the query.

The MTEB hard-negative source makes the task more useful for retrieval evaluation by including plausible Wikipedia passages, not only random negatives.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 229 positive qrel rows. Each query has 1.145 positives on average, with a median of 1 and a maximum of 4. There are 25 multi-positive queries, or 12.5% of the query set. Queries average 50.56 characters, while documents average 565.98 characters.

The examples include claims about films, locations, companies, actors, and historical figures. Documents usually begin with a Wikipedia title followed by a passage, so entity names and aliases are strong retrieval anchors.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.8893, hit@10 of 0.9950, and recall@100 of 0.9869. This is a very strong sparse profile. The claims often include distinctive named entities, and the evidence passages repeat those entities in titles or opening sentences.

BM25's remaining difficulty is exact evidence selection. A passage about the correct entity may not contain the specific fact needed for the claim, and a claim can be false while still sharing many words with a related article. Even so, sparse lexical retrieval is already near ceiling as a candidate generator on this Nano split.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.9652, hit@10 of 0.9800, and recall@100 of 0.9738. Dense retrieval has the best nDCG@10, although its hit@10 and recall@100 are slightly below BM25. This indicates that dense representations rank the correct evidence passage more cleanly when it is retrieved, but BM25 has slightly broader coverage.

The dense advantage at the top rank likely comes from modeling claim semantics beyond entity names. It can prefer passages that match the relation or role in the claim rather than passages that only repeat the entity.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates with no safeguard positives. It reaches nDCG@10 of 0.9450, hit@10 of 0.9950, and recall@100 of 0.9869. The hybrid pool matches BM25 coverage while improving top-rank quality over BM25, though dense retrieval remains strongest by nDCG@10.

This pattern is useful for reranking: BM25 supplies excellent entity coverage, dense retrieval improves semantic ordering, and their hybrid gives a reliable candidate set for a reranker that can judge evidential relevance.

Metric Interpretation for Model Researchers

Scores are high, so this task is partly a sanity check for English Wikipedia evidence retrieval. A weak model may fail to connect claims and titles, but strong systems should approach ceiling. Small differences in nDCG@10 still matter because they show whether a model ranks the exact evidence passage first rather than merely retrieving it somewhere in the candidate list.

The low multi-positive rate means most claims are effectively single-evidence retrieval cases. Exact top-rank placement is therefore important.

Query and Relevance Type Tendencies

Queries are short declarative factual claims. Relevant documents are Wikipedia passages about the claim's entities, works, places, or historical figures. Many queries require matching a named entity and a specific predicate such as award count, location, occupation, company type, or cause of death.

The relevance relation is evidential, not just topical. A page about the entity may still be wrong if it lacks evidence for the claim.

Representative Failure Modes

Common failures include retrieving the correct entity page but the wrong passage, matching an entity without matching the predicate, confusing related works or people, and ignoring negation or numeric constraints. Dense models may occasionally prefer semantically broad passages, while sparse systems may over-rank pages with heavy entity-term overlap.

Training Data That May Help

Useful training data includes FEVER claim-evidence pairs, Wikipedia entity retrieval data, fact-checking evidence datasets, and hard negatives with the same named entities but different predicates. Training examples from this evaluation split should be excluded.

Model Improvement Notes

The most important improvements are exact evidence discrimination and relation-aware ranking. Candidate generation is already strong, so rerankers should focus on whether the passage actually supports or refutes the claim. Hard negatives should share the same entity while changing relation, date, role, award, location, or membership facts.

Example Data

Query	Positive document
One Flew Over the Cuckoo's Nest only won one Academy Award. [59 chars]	One Flew Over the Cuckoo's Nest (film) One Flew Over the Cuckoo 's Nest is a 1975 American comedy-drama film directed by Miloš Forman , based on the 1962 novel One Flew Over the Cuckoo 's Nest by Ken Kesey . The film stars Jack Nicholson and features a supporting cast of Louise Fletcher , William Redfield , Will Sampson , and Brad Dourif . Considered to be one of the greatest films ever made , One Flew Over the Cuckoo 's Nest is No. 33 on the American Film Institute 's 100 Years ... 100 Movies list . The film was the second to win all five major Academy Awards ( Best Picture , Actor in Lead Role , Actress in Lead Role , Director , and Screenplay ) following It Happened One Night in 1934 , an accomplishment not repeated until 1991 by The Silence of the Lambs . It also won numerous Golden Globe and BAFTA Awards . In 1993 , the film was deemed `` culturally , historically , or aesthetically significant '' by the United States Library of Congress and selected for preservation in the Nation... [1,000 / 1,023 chars]
Salt River Valley is on the Mississippi River. [46 chars]	Salt River Valley The Salt River Valley is an extensive valley on the Salt River in central Arizona , which contains the Phoenix Metropolitan Area . Although this geographic term still identifies the area , the name `` Valley of the Sun '' popularly replaced the usage starting in the early 1930s for purposes of boosterism . A common dust for testing air filter efficiency was derived from top soil dust from the Salt River Valley , referred to as Arizona Dust . The dust was found to include small abrasive particles . [525 chars]
Sky UK is a British telecommunications company. [47 chars]	United Kingdom The United Kingdom of Great Britain and Northern Ireland , commonly known as the United Kingdom ( UK ) or Britain , is a sovereign country in western Europe . Lying off the north-western coast of the European mainland , the United Kingdom includes the island of Great Britain , the north-eastern part of the island of Ireland and many smaller islands . Northern Ireland is the only part of the United Kingdom that shares a land border with another sovereign statethe Republic of Ireland.Although Northern Ireland is the only part of the UK that shares a land border with another sovereign state , two of its Overseas Territories also share land borders with other sovereign countries . Gibraltar shares a border with Spain , while the Sovereign Base Areas of Akrotiri and Dhekelia share borders with the Republic of Cyprus , the Turkish Republic of Northern Cyprus and the UN buffer zone separating the two Cypriot polities . Apart from this land border , the United Kingdom is surroun... [1,000 / 5,000 chars]

Source Reference Table

Title	Year	Type	URL
FEVER: a Large-scale Dataset for Fact Extraction and VERification	2018	source task paper	https://arxiv.org/abs/1803.05355
MTEB: Massive Text Embedding Benchmark	2023	benchmark paper	https://arxiv.org/abs/2210.07316
mteb/FEVER_test_top_250_only_w_correct-v2		dataset card	https://huggingface.co/datasets/mteb/FEVER_test_top_250_only_w_correct-v2

Dataset Information

Field	Value
Nano set	NanoMTEB-v2
Backing dataset	NanoMTEB-v2
Task / split	fever
Hugging Face dataset	hakari-bench/NanoMTEB-v2
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	229
Positives / query avg	1.15
Positives / query min	1
Positives / query median	1.00
Positives / query max	4
Multi-positive queries	25 (12.50%)
Query length avg chars	50.56
Document length avg chars	565.98

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.8893	0.9950	0.9869	top-500
Dense	`harrier_oss_v1_270m`	0.9652	0.9800	0.9738	top-500
Reranking hybrid	`reranking_hybrid`	0.9450	0.9950	0.9869	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MTEB FEVER hard-negative test split
Train/eval overlap audit: not_audited
Leakage note: exclude NanoMTEB-v2 fever claims and evidence pages
Multi-positive training: optional
Useful training data: FEVER claim-evidence pairs, Wikipedia evidence retrieval data, fact-checking hard negatives