HAKARI-Bench

NanoMTEB-Dutch / fever

Overview

fever is the Dutch FEVER retrieval task from BEIR-NL. Queries are Dutch translations of factual claims, and documents are Dutch-translated Wikipedia evidence passages. The Nano split contains 200 queries, 10,000 documents, and 233 positive qrel rows. Most queries have one positive, but 27 queries have multiple positives, and the maximum number of positives for one query is six. It evaluates whether a retrieval model can find evidence-bearing passages for fact verification.

This task has very high retrieval scores across BM25, dense, and reranking_hybrid candidate sources. BM25 is slightly strongest in nDCG@10, and hybrid search has the highest recall@100 while matching BM25's hit@10. Dense retrieval is also strong but loses some recall. The task is useful for studying evidence retrieval where entity overlap is powerful but not sufficient: a model must retrieve passages that support or refute a claim, not merely pages that mention the same entity.

Details

What the Original Data Measures

FEVER: a Large-scale Dataset for Fact Extraction and Verification introduced FEVER as a fact-verification dataset built from claims generated by altering Wikipedia sentences and verifying them against Wikipedia evidence. Claims are labeled as supported, refuted, or not enough information, and the evidence often comes from one or more Wikipedia sentences. In BEIR, FEVER is used as a retrieval task: the claim is the query and the model retrieves evidence-bearing Wikipedia passages.

BEIR-NL translates public BEIR datasets into Dutch. This split should therefore be read as Dutch-translated FEVER evidence retrieval, not as a natively written Dutch fact-checking corpus. Translation preserves much of the entity structure, while changing the language surface around claims and evidence passages.

Observed Data Profile

The split has 200 claims and 10,000 documents. There are 233 positive qrel rows, with an average of 1.165 positives per query. The median is one positive, but 13.5% of queries have more than one positive. Queries average 54.87 characters, while documents average 445.71 characters and usually contain a Wikipedia page title plus a short explanatory passage.

Representative claims involve the Duke of York, Burbank, Adobe Photoshop, Joseph Merrick, and Vic Mensa. Some claims are supported by the positive passage, while others are contradicted by it. The retrieval task is not to predict the label directly; it is to retrieve the evidence needed for verification.

BM25 Evaluation Profile

BM25 reaches nDCG@10 = 0.9221, hit@10 = 0.9800, and recall@100 = 0.9700 over top-500 candidate lists. This is a very strong lexical baseline. FEVER claims often contain the named entity that identifies the relevant Wikipedia page, and the evidence passage repeats key names, dates, titles, or relations.

The remaining difficulty is evidence specificity. A lexically matching page can still be insufficient if the claim asks for a particular date, role, relation, or negated fact. BM25 can retrieve the right article title but may not always place the most useful evidence passage first when several passages share the same entity.

Dense Evaluation Profile

Dense retrieval with harrier_oss_v1_270m reaches nDCG@10 = 0.9207, hit@10 = 0.9400, and recall@100 = 0.9313. Dense retrieval is strong, but slightly below BM25 and hybrid retrieval on recall and hit rate. This suggests that the dense model captures claim-evidence meaning well, but exact entity and date matching remain critical in this FEVER-style corpus.

Dense retrieval is most valuable when the evidence relation is paraphrased or when the claim's wording differs from the passage. Its likely failure mode is semantic overgeneralization: retrieving a passage about the right entity but not the specific fact needed to verify the claim.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate column reaches nDCG@10 = 0.9215, hit@10 = 0.9800, and recall@100 = 0.9785, with exactly 100 candidates per query and no rank-101 safeguard rows. It has the best recall@100 and matches BM25's hit@10, while BM25 has a tiny nDCG@10 advantage. The hybrid profile is therefore the best candidate source for reranking, especially because some queries have multiple positives.

Hybrid search works well here because sparse and dense signals reinforce each other. BM25 captures exact entity and phrase overlap, while dense retrieval can add semantically related evidence passages. A reranker can then focus on which passage actually supports or refutes the claim.

Metric Interpretation for Model Researchers

Unlike the single-positive tasks, this split has 233 positives for 200 queries. nDCG@10 can reward ranking any relevant evidence passage highly, and recall@100 should be interpreted across all positive qrels, not just a single target per query. Multi-positive training objectives are more appropriate than forcing one evidence passage to represent the entire claim.

The near-tie among BM25, dense, and hybrid in nDCG@10 means that small differences should be interpreted carefully. The broader conclusion is that entity-aware retrieval is already strong, and remaining improvements come from evidence selection and relation-level matching.

Query and Relevance Type Tendencies

Queries are Dutch factual claims. They often contain a named entity plus a predicate such as a birth date, title, occupation, location, release fact, or historical relation. Relevant documents are Wikipedia evidence passages that can support or refute the claim.

Relevance is evidence bearing. A document about the same entity is not always enough; the passage must contain the fact needed to verify the claim. For multi-positive claims, several passages may independently provide enough evidence or contribute related evidence.

Representative Failure Modes

BM25 can fail when the right entity has many candidate passages and the evidence relation is not the most lexically obvious one. Dense retrieval can fail by retrieving a semantically related entity page or same-topic passage that lacks the verifying fact. Hybrid retrieval can still surface entity-near distractors, especially when several passages share names and dates.

Hard negatives should be related Wikipedia passages that mention the same entity but do not support or refute the claim. These are more useful than random negatives because the task is already easy at the entity-matching level.

Training Data That May Help

Useful training data includes official FEVER training claim-evidence pairs with overlap removed, Dutch and multilingual fact-verification retrieval data, non-overlapping Dutch Wikipedia claim-evidence pairs, and hard negatives from related Wikipedia pages. Training should exclude the translated FEVER test queries, qrels, and positive evidence passages used by this Nano split.

Synthetic data can be generated from Dutch encyclopedia passages outside the evaluation set. Create factual claims that are explicitly supported or contradicted by the passage, preserving entities, dates, and relations. Multi- positive examples should be retained when several passages genuinely support the same claim.

Model Improvement Notes

Improving this task requires evidence-aware reranking rather than basic entity retrieval. Models should learn to distinguish passages that merely mention an entity from passages that verify the claim's specific relation. Date, title, occupation, location, and negation cues are especially important.

For rerankers, the hybrid pool is a strong starting point. It provides high candidate coverage, and the remaining challenge is relation-level evidence selection. Multi-positive supervision should be preserved so that the reranker does not penalize alternative valid evidence passages.

Example Data

QueryPositive document
De huidige Hertog van York is een persoon. [42 chars]Hertog van York De Hertog van York is een adellijke titel in de Peerage van het Verenigd Koninkrijk. Sinds de 15e eeuw is deze titel, wanneer verleend, meestal gegeven aan de tweede zoon van Engelse (later Britse) monarchen. De equivalente titel in de Schotse peerage was Hertog van Albany. Aanvankelijk verleend in de 14e eeuw in de Peerage van Engeland, is de titel Hertog van York acht keer gecreëerd. Daarnaast is de titel Hertog van York en Albany drie keer gecreëerd. Dit gebeurde in de 18e eeuw, na de unificatie van het Koninkrijk Engeland en het Koninkrijk Schotland in 1707 tot één verenigd rijk. De dubbele naamgeving werd gedaan zodat een territoriale aanduiding uit elk van de voorheen afzonderlijke rijken kon worden opgenomen. De huidige Hertog van York is Prins Andrew, de tweede zoon van Koningin Elizabeth II. Prins Andrew heeft momenteel geen mannelijke erfgenamen en is sinds zijn scheiding in 1996 ongehuwd. [931 chars]
Burbank, Californië is altijd volledig verstoken geweest van industrie. [71 chars]Burbank, Californië Burbank is een stad in Los Angeles County in Zuid-Californië, Verenigde Staten, 19 km ten noordwesten van het centrum van Los Angeles. De bevolking bedroeg bij de volkstelling van 2010 103.340 inwoners. Bekend als de "Mediahoofdstad van de Wereld" en slechts een paar kilometer ten noordoosten van Hollywood, hebben talloze media- en entertainmentbedrijven hun hoofdkantoor of belangrijke productiefaciliteiten in Burbank, waaronder The Walt Disney Company, Warner Bros. Entertainment, Nickelodeon Animation Studios, NBC, Cartoon Network Studios met de West Coast-tak van Cartoon Network, en Insomniac Games. De stad herbergt ook Bob Hope Airport. Het was de locatie van Lockheed's Skunk Works, dat enkele van de meest geheime en technologisch geavanceerde vliegtuigen produceerde, waaronder de U-2 spionagevliegtuigen die in oktober 1962 de raketonderdelen van de Sovjet-Unie op Cuba ontdekten. Burbank bestaat uit twee verschillende gebieden: een centrum/voetheuvelgedeelte, in... [1,000 / 1,459 chars]
Er is software, genaamd Adobe Photoshop, waarvan de versies met een nummer worden aangeduid. [92 chars]Adobe Photoshop Adobe Photoshop is een rastergrafische editor ontwikkeld en uitgegeven door Adobe Systems voor macOS en Windows. Photoshop werd in 1988 gecreëerd door Thomas en John Knoll. Sindsdien is het de facto industriestandaard geworden in rastergrafische bewerking, zodanig dat het woord "photoshoppen" een werkwoord is geworden, zoals in "een afbeelding photoshoppen", "photoshoppen" en "photoshop-wedstrijd", hoewel Adobe dit gebruik afraadt. Het kan rasterafbeeldingen in meerdere lagen bewerken en samenstellen en ondersteunt maskers, alfakanaalcompositing en verschillende kleurmodellen, waaronder RGB, CMYK, CIELAB, spotkleur en duotonen. Photoshop heeft brede ondersteuning voor grafische bestandsformaten, maar gebruikt ook zijn eigen PSD- en PSB-bestandsformaten die alle bovengenoemde functies ondersteunen. Naast rastergraphics heeft het beperkte mogelijkheden om tekst, vectorgraphics (vooral via clipping paths), 3D-graphics en video te bewerken of te renderen. De functieset van... [1,000 / 2,249 chars]

Source Reference Table

TitleYearTypeURL
FEVER: a Large-scale Dataset for Fact Extraction and Verification2018ACL paperhttps://aclanthology.org/N18-1074/
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models2021arXiv paperhttps://arxiv.org/abs/2104.08663
BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language2025ACL paperhttps://aclanthology.org/2025.bucc-1.5/
clips/beir-nl-feverdataset cardhttps://huggingface.co/datasets/clips/beir-nl-fever

Dataset Information

FieldValue
Nano setNanoMTEB-Dutch
Backing datasetNanoMTEB-Dutch
Task / splitfever
Hugging Face datasethakari-bench/NanoMTEB-Dutch
Languagenl
Categorynatural_language
Queries200
Documents10,000
Positive qrels233
Positives / query avg1.17
Positives / query min1
Positives / query median1.00
Positives / query max6
Multi-positive queries27 (13.50%)
Query length avg chars54.87
Document length avg chars445.71

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.92210.98000.9700top-500
Denseharrier_oss_v1_270m0.92070.94000.9313top-500
Reranking hybridreranking_hybrid0.92150.98000.9785top-100

Training and Leakage Metadata