MNanoBEIR / NanoBEIR-sr / NanoFEVER
Overview
NanoBEIR-sr NanoFEVER is a Serbian factual-claim evidence retrieval task derived from FEVER. Queries are short translated claims, and documents are translated Wikipedia-style evidence passages. The retrieval goal is to find the passage that can verify the claim before any support or refute decision. The task is useful for evaluating multilingual retrieval systems that must combine named-entity matching, semantic evidence matching, and precise claim-passage alignment in Serbian.
Details
What the Original Data Measures
FEVER was built for fact extraction and verification over Wikipedia, where systems retrieve evidence for claims before assigning verification labels. In BEIR, the evidence retrieval step is evaluated independently. The MNanoBEIR Serbian version keeps that structure after translation. It measures whether a retriever can connect a compact Serbian claim to the Wikipedia passage that contains the evidence needed for verification.
Observed Data Profile
This Nano subset contains 50 queries, 4,996 documents, and 57 positive qrels. Most queries have one positive, with a small multi-positive tail. The average is 1.14 positives per query, with a minimum of 1, median of 1.00, and maximum of 3. Six queries are multi-positive, covering 12.0% of the task. Queries average 46.14 characters, while documents average 1,184.60 characters. This creates a short-claim to long-evidence retrieval setting where early ranking of the correct passage is important.
BM25 Evaluation Profile
BM25 uses the bm25 top-500 candidate subset. It reaches nDCG@10 0.6486, hit@10 0.7800, and recall@100 0.8596. BM25 benefits from named entities, titles, and factual terms that appear in both claims and evidence passages. However, Serbian translated claims still require more than exact overlap in some cases. A passage can mention the right entity while lacking the specific fact, and title transliteration or paraphrase can reduce lexical matching. BM25 is a useful candidate generator but leaves room for semantic evidence ranking.
Dense Evaluation Profile
Dense retrieval uses the harrier_oss_v1_270m top-500 candidate subset. It scores nDCG@10 0.7611, hit@10 0.9000, and recall@100 0.8947, outperforming BM25 across all three reported metrics. Dense retrieval is better at connecting claim meaning to evidence context, especially when the passage contains the answer through paraphrase, description, or a longer explanation. This profile shows that Serbian FEVER-style retrieval benefits strongly from embedding similarity, while still retaining value from exact entity cues.
Reranking Hybrid Evaluation Profile
The reranking hybrid subset uses reranking_hybrid with top-100 candidates and an optional rank-101 safeguard. Candidate counts range from 100 to 101, with a mean of 100.04 and 2 safeguard rows. It reaches nDCG@10 0.7191, hit@10 0.8600, and recall@100 0.9474. Hybrid retrieval has the best top-100 coverage but does not match dense early ranking. This makes it a strong reranking input: the pool contains more positives, while the final ordering needs a model that can identify which passage actually verifies the claim.
Metric Interpretation for Model Researchers
Because most queries have one positive, hit@10 is close to a query-level first-page success signal, and recall@100 indicates whether the evidence is available to a reranker. nDCG@10 is the key early-ranking measure. Dense retrieval is strongest for ranking, while reranking hybrid is strongest for coverage. Researchers can use this task to study the tradeoff between semantic claim-evidence ranking and candidate recall in an entity-heavy fact-checking setting.
Query and Relevance Type Tendencies
Queries are short factual claims about people, media works, locations, historical figures, and films. Relevant documents are Wikipedia passages that contain the verification evidence. Examples include claims about Keith Godchaux and the Grateful Dead, a sitcom, advanced aircraft in Burbank, Nero, and the film Scream 2. The task favors models that preserve entity identity, title matching, and relation-specific evidence.
Representative Failure Modes
BM25 may retrieve the right entity page but not the passage that verifies the claim. Dense models may retrieve semantically related passages about the same entity or work while missing the requested fact. Hybrid retrieval improves candidate coverage but can include both lexical and semantic distractors. Serbian translation and transliteration can also create variants of names and titles that affect matching.
Training Data That May Help
Helpful training data includes non-overlapping claim-evidence retrieval, Serbian Wikipedia evidence mining, multilingual fact-checking, entity-centric QA, and hard-negative evidence selection. Hard negatives should come from related entities or neighboring events that share terms but fail to verify the claim. Training should exclude FEVER, BEIR, NanoBEIR, and direct translations of evaluation claims or evidence pages.
Model Improvement Notes
NanoFEVER-sr is a strong benchmark for answer and evidence-aware retrieval. Dense retrieval gives the best early ordering, while reranking hybrid gives the best evidence coverage. Improvements should focus on entity disambiguation, alias and transliteration handling, relation matching, and rerankers that check whether the passage actually verifies the claim. A practical system would use hybrid candidates for recall and a claim-evidence reranker for final ordering.
Example Data
| Query | Positive document |
| Kith Godčo je poznavao Grateful Dead. [37 chars] | Grateful Dead je bila američka rok grupa osnovana 1965. godine u Palo Altu u Kaliforniji. Sa sastavom koji je varirao od kvinteta do septeta, bend je poznat po svom jedinstvenom i eklektičnom stilu, koji je spajao elemente roka, psihedelije, eksperimentalne muzike, modalnog džeza, kantrija, folka, blugrejsa, bluza, regija i spejs roka, po živim nastupima sa dugim instrumentalnim džemovima i po svojoj posvećenoj bazi obožavatelja poznatih kao "Deadheads". "Njihova muzika", piše Lenny Kaye, "dodiruje temelje koje većina drugih grupa ni ne zna da postoje." Ovi različiti uticaji su destilisani u raznoliku i psihedeličnu celinu koja je učinila Grateful Dead "pionirskim kumovima sveta džem bendova". Časopis Rolling Stone svrstao je bend na 57. mesto u svojoj tematskoj svesci "Najveći umetnici svih vremena". Bend je primljen u Kuću slavnih rokenrola 1994. godine, a snimak njihovog nastupa od 8. maja 1977. na Barton Holu Univerziteta Cornell dodan je u Nacionalni registar snimaka Kongresne bib... [1,000 / 2,888 chars] |
| "Taarak Mehta Ka Ooltah Chashmah" je sitkom. [44 chars] | "Taarak Mehta Ka Ooltah Chashmah" (na engleskom: "Taarak Mehta's Different Perspective") je najduže trajuća indijska sitkom serija koju proizvodi Neela Tele Films Private Limited. Serija je počela sa emitovanjem 28. jula 2008. godine. Emituje se od ponedeljka do petka u 20:30, sa ponovnim prikazivanjem u 23:00 i sledećeg dana u 15:00 na SAB TV. Serija je počela sa reprizama na Sony Pal-u od 2. novembra 2015. svakodnevno u 16:30 i 20:00. Serija je bazirana na kolumni "Duniya Ne Oondha Chashma" koju je pisao kolumnista i novinar Taarak Mehta za gudžaratski nedeljni časopis Chitralekha. [590 chars] |
| Tajni i tehnološki napredni avioni proizvođeni su u Burbanku u Kaliforniji. [75 chars] | Burbank je grad u okrugu Los Anđeles u južnoj Kaliforniji, Sjedinjene Države, 19 km severozapadno od centra Los Anđelesa. Prema popisu iz 2010. godine, stanovništvo je iznosilo 103.340. Poznat kao "Medijska prestonica sveta" i udaljen samo nekoliko milja severoistočno od Holivuda, brojne medijske i zabavne kompanije imaju sedište ili značajne proizvodne objekte u Burbanku, uključujući The Walt Disney Company, Warner Bros. Entertainment, Nickelodeon Animation Studios, NBC, Cartoon Network Studios sa zapadnjačkom podružnicom Cartoon Network-a i Insomniac Games. Grad je takođe dom aerodroma Bob Hope. Bio je lokacija Lokidovog Skunk Works-a, koji je proizvodio neke od najtajnijih i tehnološki najnaprednijih aviona, uključujući špijunske avione U-2 koji su otkrili komponente sovjetskih raketa na Kubi u oktobru 1962. godine. Burbank se sastoji od dva različita područja: gradskog/predgorskog dela, u podnožju planina Verdugo, i ravničarskog dela. Burbank je najistočniji grad u dolini San Ferna... [1,000 / 1,321 chars] |
Source Reference Table
| Title | Year | Type | URL |
| FEVER: a Large-scale Dataset for Fact Extraction and VERification | 2018 | task paper | https://arxiv.org/abs/1803.05355 |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models | 2021 | benchmark paper | https://arxiv.org/abs/2104.08663 |
| MMTEB: Massive Multilingual Text Embedding Benchmark | 2025 | benchmark paper | https://arxiv.org/abs/2502.13595 |
| NanoBEIR: Smaller BEIR dataset subsets | 2024 | dataset collection | https://huggingface.co/collections/zeta-alpha-ai/nanobeir |
Dataset Information
| Field | Value |
| Nano set | MNanoBEIR |
| Backing dataset | NanoBEIR-sr |
| Task / split | NanoFEVER |
| Hugging Face dataset | hakari-bench/NanoBEIR-sr |
| Language | sr |
| Category | natural_language |
| Queries | 50 |
| Documents | 4,996 |
| Positive qrels | 57 |
| Positives / query avg | 1.14 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 3 |
| Multi-positive queries | 6 (12.00%) |
| Query length avg chars | 46.14 |
| Document length avg chars | 1,184.60 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.6486 | 0.7800 | 0.8596 | top-500 |
| Dense | harrier_oss_v1_270m | 0.7611 | 0.9000 | 0.8947 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.7191 | 0.8600 | 0.9474 | top-100 |