MNanoBEIR / NanoBEIR-sv / NanoFEVER

Overview

NanoFEVER in the Swedish NanoBEIR slice is a Wikipedia evidence retrieval task derived from FEVER. The queries are Swedish translated factual claims, and the corpus contains Swedish translated evidence passages. The retrieval goal is to find passages that can verify the claim, rather than simply retrieve a page on the same topic. This compact task is useful for evaluating claim-to-evidence retrieval, entity grounding, and factual search behavior in a multilingual setting.

Details

What the Original Data Measures

FEVER was designed for fact extraction and verification over Wikipedia. In the retrieval step, a system receives a claim and must retrieve evidence passages that contain enough information to assess it. The relevant passage is often centered on a named entity, event, work, person, or place, but relevance depends on the claim relation as well as topic.

In the Swedish translated version, the model must handle factual claims expressed in Swedish while many named entities, media titles, and organizations may retain international forms. This makes the task a mix of exact entity matching and evidence-sensitive semantic retrieval. A strong model should retrieve the passage that resolves the claim, not just a passage where the same entity name appears.

Observed Data Profile

The task contains 50 queries, 4,996 documents, and 57 relevance judgments. Most queries have one positive passage, with an average of 1.14 positives per query. The minimum is 1, the median is 1.0, the maximum is 3, and 6 queries have multiple positives, or 12.0% of the query set. This is therefore mostly a single-evidence retrieval task.

Queries average 44.64 characters, while documents average 1,166.66 characters. The claims are short and often contain distinctive entities or factual assertions. The relevant documents are much longer Wikipedia-style passages, so the model must locate an evidence-bearing passage inside a larger topical context.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.7512, hit@10 of 0.9200, and recall@100 of 0.9649 using the top-500 BM25 candidate subset. This is a very strong lexical baseline. FEVER-style claims often contain named entities and specific wording that also appears in the evidence passage, and BM25 can exploit those anchors effectively.

The remaining gap shows that exact terms are not enough for perfect ranking. Some claims require connecting a relation or factual attribute, and there may be several passages about the same entity. BM25 can find the right neighborhood, but it may still rank a related entity passage above the one that directly verifies the claim.

Dense Evaluation Profile

The dense harrier-oss-270m run reaches nDCG@10 of 0.8570, hit@10 of 0.9400, and recall@100 of 0.9298. Dense retrieval improves top-rank quality over BM25, although its recall@100 is slightly lower. This means embedding similarity is especially useful for ordering the most relevant evidence passages near the top.

The dense advantage likely comes from matching the factual meaning of the claim rather than only matching entity terms. It can distinguish a passage about a work, person, or location that better supports the claim from other passages with similar lexical content. The lower recall@100 suggests that exact lexical anchors still help recover some candidates that dense retrieval may miss.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate set reaches nDCG@10 of 0.8153, hit@10 of 0.9800, and recall@100 of 1.0000. It uses exactly 100 candidates per query, with no rank-101 safeguard rows. The hybrid profile has the strongest hit@10 and complete recall@100, while dense retrieval has the strongest nDCG@10.

This is a clear case where hybrid search improves evidence coverage by combining lexical and dense sources. The full recall@100 means the relevant passages are always present in the hybrid candidate pool. However, the dense ranking places them slightly better in the top 10. For a two-stage system, reranking_hybrid is an excellent candidate source; for first-stage ranking alone, dense retrieval has the strongest graded top-10 ordering.

Metric Interpretation for Model Researchers

Because most queries have a single positive, nDCG@10 and hit@10 directly reflect whether the evidence passage is usable without deep reranking. recall@100 indicates whether a downstream verifier or reranker has access to the evidence. In this task, all three methods are comparatively strong, but they reveal different tradeoffs.

BM25 shows that Swedish FEVER has strong lexical anchors. Dense retrieval shows that semantic ordering improves first-page quality. reranking_hybrid shows that combining the two is best for candidate completeness. A model researcher can use this task to test whether improvements come from finding missing evidence, ranking already-found evidence higher, or both.

Query and Relevance Type Tendencies

Queries are short factual claims such as whether Keith Godchaux knew Grateful Dead, whether a TV show is a sitcom, whether advanced aircraft were made in Burbank, whether Nero is human, or whether Scream 2 is only a German film. Relevant documents are Wikipedia-style passages that provide the factual context.

The task rewards precise entity grounding. Many claims are easy to mis-handle if the model retrieves the right entity but the wrong fact. It also rewards handling of negation-like or exclusivity wording, such as "only," because the relevant passage may need to contradict or qualify the claim rather than simply repeat it.

Representative Failure Modes

Likely failures include retrieving a passage about the correct entity that lacks the target fact, confusing similarly named works or people, over-ranking broad entity pages over specific evidence, and missing translated claim wording when entity names remain unchanged. BM25 may overvalue repeated names, while dense retrieval may occasionally miss exact lexical candidates with unusual titles or names.

Training Data That May Help

Useful training data includes claim-evidence retrieval, Wikipedia evidence mining, multilingual fact-checking, and hard negatives that share the same entity but do not verify the claim. Swedish Wikipedia and translated fact-checking examples can improve language coverage. For rerankers, same-entity negatives are particularly valuable because the main difficulty is often factual relation, not broad topic.

Model Improvement Notes

A model targeting this task should preserve strong named-entity recall while improving factual relation ranking. Sparse systems should maintain exact entity and title matching. Dense systems should improve coverage for rare names while keeping their strong top-rank semantic ordering. Hybrid systems are well aligned with the task, especially when followed by a verifier or reranker that can inspect the evidence relation.

Example Data

Query	Positive document
Keith Godchaux kände till Grateful Dead [39 chars]	The Grateful Dead var ett amerikanskt rockband som bildades 1965 i Palo Alto, Kalifornien. Bandet, som bestod av fem till sju medlemmar, är känt för sin unika och eklektiska stil, som blandade element från rock, psychedelia, experimentell musik, modal jazz, country, folk, bluegrass, blues, reggae och space rock. De var också kända för sina långa instrumentala jam-sessioner och sin lojala fanskara, som kallades "Deadheads". Som Lenny Kaye skriver, "Deras musik berör områden som de flesta andra band inte ens vet att de finns." Dessa olika influenser smälte samman till en mångsidig och psykedelisk helhet som gjorde The Grateful Dead till "pionjärerna och gudfäderna inom jam-band-världen". Bandet rankades som nummer 57 av Rolling Stone Magazine i deras lista över "The Greatest Artists of All Time". The Grateful Dead blev invalda i Rock and Roll Hall of Fame 1994, och en inspelning av deras konsert den 8 maj 1977 på Cornell Universitys Barton Hall lades till i National Recording Registry på... [1,000 / 2,895 chars]
Taarak Mehta Ka Ooltah Chashmah är en sitcom. [45 chars]	Taarak Mehta Ka Ooltah Chashmah (Engelska: Taarak Mehtas olika perspektiv) är Indiens längsta löpande sitcom, producerad av Neela Tele Films Private Limited. Serien sändes för första gången den 28 juli 2008. Den sänds från måndag till fredag klockan 20:30, med upprepad sändning klockan 23:00 och dagen därpå klockan 15:00 på SAB TV. Serien började sin omgång på Sony Pal den 2 november 2015 klockan 16:30 och 20:00 varje dag. Serien är baserad på kolumnen Duniya Ne Oondha Chashma, skriven av kolumnisten och journalisten Taarak Mehta för den gujaratiska veckotidningen Chitralekha. [583 chars]
Hemliga och tekniskt avancerade flygplan tillverkades i Burbank, Kalifornien. [77 chars]	Burbank är en stad i Los Angeles County i södra Kalifornien, USA, cirka 19 kilometer nordväst om centrala Los Angeles. Vid folkräkningen 2010 hade staden 103 340 invånare. Staden marknadsförs som "Världens Mediehuvudstad" och ligger bara några mil nordost om Hollywood. Många medie- och underhållningsföretag har sitt huvudkontor eller betydande produktionsanläggningar i Burbank, bland annat The Walt Disney Company, Warner Bros. Entertainment, Nickelodeon Animation Studios, NBC, Cartoon Network Studios med Cartoon Networks västkustkontor, och Insomniac Games. Staden är också hem för Bob Hope Airport. Här fanns tidigare Lockheed's Skunk Works, som producerade några av de mest hemliga och tekniskt avancerade flygplanen, inklusive U-2-spionplanen som avslöjade sovjetiska missilkomponenter på Kuba i oktober 1962. Burbank består av två distinkta områden: en central/bergsdel i Verdugo Mountains fot och en slättlandsdel. Burbank är den östligaste staden i San Fernando Valley. Grannstaden Glenda... [1,000 / 1,280 chars]

Source Reference Table

Item	Reference
Original dataset	FEVER
Retrieval benchmark framing	BEIR
Multilingual benchmark context	MMTEB
NanoBEIR collection	NanoBEIR on Hugging Face
NanoBEIR-sv dataset	hakari-bench/NanoBEIR-sv

Representative query and positive evidence snippets:

Query	Positive document snippet
Keith Godchaux kände till Grateful Dead	The Grateful Dead var ett amerikanskt rockband som bildades 1965 i Palo Alto, Kalifornien...
Taarak Mehta Ka Ooltah Chashmah är en sitcom.	Taarak Mehta Ka Ooltah Chashmah är Indiens längsta löpande sitcom...
Hemliga och tekniskt avancerade flygplan tillverkades i Burbank, Kalifornien.	Burbank är en stad i Los Angeles County i södra Kalifornien, USA...
Nero är en människa	Den julisk-claudiska dynastin syftar på de första fem romerska kejsarna...
Scream 2 är enbart en tysk film.	Scream 2 är en amerikansk slasherfilm från 1997, regisserad av Wes Craven...

Dataset Information

Field	Value
Nano set	MNanoBEIR
Backing dataset	NanoBEIR-sv
Task / split	NanoFEVER
Hugging Face dataset	hakari-bench/NanoBEIR-sv
Language	sv
Category	natural_language
Queries	50
Documents	4,996
Positive qrels	57
Positives / query avg	1.14
Positives / query min	1
Positives / query median	1.00
Positives / query max	3
Multi-positive queries	6 (12.00%)
Query length avg chars	44.64
Document length avg chars	1,166.66

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.7512	0.9200	0.9649	top-500
Dense	`harrier_oss_v1_270m`	0.8570	0.9400	0.9298	top-500
Reranking hybrid	`reranking_hybrid`	0.8153	0.9800	1.0000	top-100