HAKARI-Bench

NanoMTEB-Dutch / sci_fact_nl

Overview

sci_fact_nl is the Dutch SciFact retrieval task from BEIR-NL. Queries are Dutch translations of scientific claims, and documents are translated scientific-paper abstracts. The Nano split contains 200 queries, 5,183 documents, and 226 positive qrel rows. Most claims have one positive abstract, but 16 queries have multiple positives, with at most five positives for one query. It evaluates scientific evidence retrieval for claim verification.

The task is harder than ordinary entity retrieval because relevance depends on whether an abstract supports or refutes a precise claim. BM25 is a useful baseline because gene names, diseases, interventions, and technical phrases often overlap. Dense retrieval with harrier_oss_v1_270m has the strongest nDCG@10 and hit@10, while reranking_hybrid has the highest recall@100. This is a strong example of dense retrieval helping with scientific relation matching, with hybrid search providing broader reranking coverage.

Details

What the Original Data Measures

Fact or Fiction: Verifying Scientific Claims introduced SciFact as a scientific claim-verification dataset with expert-written claims, evidence abstracts, support/refute labels, and evidence rationales. BEIR uses SciFact as a retrieval task: given a scientific claim, retrieve abstracts that provide evidence for or against it.

BEIR-NL translates public BEIR datasets into Dutch. This split is therefore Dutch-translated scientific evidence retrieval, not a natively authored Dutch scientific-claim corpus. Scientific terminology, names, abbreviations, and measurement expressions often remain important even after translation.

Observed Data Profile

The split contains 200 claims and 5,183 abstracts. Queries average 100.13 characters, much longer than ordinary web questions, and documents average 1,640.32 characters. Abstracts include titles, methods, measurements, interventions, populations, and findings. The claim usually states a specific scientific relation or result.

Representative claims concern metastatic colorectal cancer treatment, CRP as a predictor after coronary bypass surgery, the role of arginine 90 in p150 interaction with EB1, whether obesity is determined only by environmental factors, and the effect of febrile seizures on later epilepsy. These are precise claim-evidence relationships, not broad topic searches.

BM25 Evaluation Profile

BM25 reaches nDCG@10 = 0.6160, hit@10 = 0.7900, and recall@100 = 0.8363 over top-500 candidate lists. Sparse retrieval works reasonably well because claims and abstracts often share technical terms, disease names, genes, interventions, or outcome measures. Exact terminology is a real signal in scientific retrieval.

BM25's limitation is relation specificity. An abstract can share the same entities but report a different outcome, study design, or conclusion. A claim about increased risk, reduced effectiveness, or molecular interaction requires matching the finding, not just the vocabulary.

Dense Evaluation Profile

Dense retrieval with harrier_oss_v1_270m reaches nDCG@10 = 0.6758, hit@10 = 0.8300, and recall@100 = 0.9336. Dense retrieval is the strongest top-ranked candidate source. It appears to capture scientific claim-to-abstract semantics better than BM25, especially when the evidence is phrased differently from the claim.

The remaining dense errors are likely terminology-sharing hard negatives. Scientific abstracts can be close in embedding space because they mention the same disease, gene, or method while supporting a different conclusion. A strong model must preserve directionality, measurement, and evidence relation.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate column reaches nDCG@10 = 0.6709, hit@10 = 0.8200, and recall@100 = 0.9558, with 100 to 101 candidates per query and 10 rank-101 safeguard rows. It has the highest recall@100, while dense retrieval has slightly better top-10 ranking. This makes hybrid search the best reranking candidate pool.

Hybrid retrieval combines BM25's exact scientific terminology with dense semantic evidence matching. A reranker can then decide whether the candidate abstract actually supports or refutes the claim, rather than merely sharing scientific vocabulary.

Metric Interpretation for Model Researchers

The task has 226 positives for 200 queries, so it is mostly single-positive but not entirely. nDCG@10 and hit@10 are useful for first-stage ranking, while recall@100 measures candidate availability for a reranker. Dense retrieval is the best top-rank signal; hybrid retrieval gives broader coverage.

The important comparison is not only sparse versus dense. It is whether the system can retrieve evidence that matches the claim's scientific relation. Entity overlap alone is not enough.

Query and Relevance Type Tendencies

Queries are precise scientific claims. They often contain biomedical entities, interventions, outcomes, molecular relations, or statistical conclusions. Relevant documents are abstracts that support or refute the claim.

Relevance is evidence bearing. A document about the same disease or gene is not necessarily relevant unless it contains the finding needed to verify the claim.

Representative Failure Modes

BM25 can fail by retrieving abstracts with shared terminology but incompatible findings. Dense retrieval can fail by overgeneralizing among abstracts about the same topic. Hybrid retrieval can include both the right evidence and terminology-sharing distractors, so reranking remains important.

Hard negatives should share diseases, genes, methods, or interventions while changing the result or relation. These negatives are critical for claim verification retrieval.

Training Data That May Help

Useful training data includes official SciFact training data with overlap removed, scientific claim-verification retrieval datasets, biomedical abstract retrieval pairs, and Dutch or multilingual scientific evidence pairs. Training should exclude translated SciFact test claims, qrels, and evidence abstracts used by this Nano split.

Synthetic data can be generated from non-evaluation scientific abstracts. Create precise Dutch claims that are supported or refuted by explicit findings. Hard negatives should share terminology but imply a different result or relation.

Model Improvement Notes

Improving this task requires scientific relation modeling. Dense encoders should capture claim direction, entity roles, and outcome wording. Rerankers should compare the claim to the abstract's findings and not stop at keyword or topic overlap.

Hybrid retrieval is useful as a high-recall source, but final quality depends on evidence-aware ranking.

Example Data

QueryPositive document
Metastatische colorectale kanker behandeld met enkelvoudige fluoropyrimidinen resulteerde in verminderde werkzaamheid en lagere kwaliteit van leven in vergelijking met oxaliplatine-gebaseerde chemotherapie bij oudere patiënten. [227 chars]Chemotherapieopties bij oudere en kwetsbare patiënten met metastatische colorectale kanker (MRC FOCUS2): een open-label, gerandomiseerde factorieel onderzoek ACHTERGROND Oudere en kwetsbare patiënten met kanker, hoewel vaak behandeld met chemotherapie, zijn ondervertegenwoordigd in klinische studies. We hebben FOCUS2 ontworpen om chemotherapieopties met gereduceerde dosering te onderzoeken en om objectieve voorspellers van de uitkomst bij kwetsbare patiënten met gevorderde colorectale kanker te zoeken. METHODEN We voerden een open, 2 × 2 factorieel onderzoek uit in 61 Britse centra voor patiënten met eerder onbehandelde gevorderde colorectale kanker die niet geschikt werden geacht voor chemotherapie met volledige dosis. Na een uitgebreide gezondheidsbeoordeling (CHA) werden patiënten gerandomiseerd door minimalisatie aan: 48-uurs intraveneus fluorouracil met levofolinate (groep A); oxaliplatine en fluorouracil (groep B); capecitabine (groep C); of oxaliplatine en capecitabine (groep D)... [1,000 / 3,389 chars]
CRP is geen voorspeller van postoperatieve mortaliteit na een coronaire arteriële bypass graft (CABG) operatie. [111 chars]Beoordeling van de kosteneffectiviteit van het gebruik van prognostische biomarkers met beslissingsmodellen: een casestudy naar prioritering van patiënten in afwachting van coronaire bypass-chirurgie DOEL De effectiviteit en kosteneffectiviteit bepalen van het gebruik van informatie van circulerende biomarkers om het prioriteringsproces te informeren van patiënten met stabiele angina pectoris die wachten op een coronaire bypass-operatie. ONTWERP Beslissingsanalytisch model dat vier prioriteitsstrategieën zonder biomarkers vergelijkt (geen formele prioritering, twee urgentiescores en een risico-score) en drie strategieën gebaseerd op een risico-score met biomarkers: een routinematig beoordeelde biomarker (geschatte glomerulaire filtratiesnelheid), een nieuwe biomarker (C-reactief proteïne) of beide. De volgorde waarin een coronaire bypass-operatie in een cohort van patiënten werd uitgevoerd, werd bepaald door elke prioriteitsstrategie, en de gemiddelde levenslange kosten en kwaliteitsge... [1,000 / 3,250 chars]
Arginine 90 in p150n is belangrijk voor de interactie met EB1. [62 chars]Structurele basis voor de activatie van microtubulusassemblage door het EB1 en p150Glued complex. Plus-eind trackende eiwitten, zoals EB1 en het dyneïne/dynactine complex, reguleren microtubulusdynamiek. Aangenomen wordt dat deze eiwitten microtubuli stabiliseren door een plus-eind complex te vormen aan de groeiende uiteinden van microtubuli, met slecht gedefinieerde mechanismen. Hier rapporteren we de kristalstructuur van twee componenten van het plus-eind complex, het carboxy-terminale dimerisatiedomein van EB1 en het microtubulusbindende (CAP-Gly) domein van de dynactine-subeenheid p150Glued. Elk molecuul van het EB1 dimeer bevat twee helixen die een geconserveerd vier-helix bundel vormen, terwijl het ook p150Glued bindingsplaatsen in zijn flexibele staartregio verschaft. Door kristallografie, NMR en mutatieanalyses te combineren, onthullen onze studies de kritische interagerende elementen van zowel EB1 als p150Glued, waarvan mutatie de microtubuluspolymerisatie-activiteit verandert... [1,000 / 1,328 chars]

Source Reference Table

TitleYearTypeURL
Fact or Fiction: Verifying Scientific Claims2020arXiv paperhttps://arxiv.org/abs/2004.14974
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models2021arXiv paperhttps://arxiv.org/abs/2104.08663
BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language2025ACL paperhttps://aclanthology.org/2025.bucc-1.5/
clips/beir-nl-scifactdataset cardhttps://huggingface.co/datasets/clips/beir-nl-scifact

Dataset Information

FieldValue
Nano setNanoMTEB-Dutch
Backing datasetNanoMTEB-Dutch
Task / splitsci_fact_nl
Hugging Face datasethakari-bench/NanoMTEB-Dutch
Languagenl
Categorynatural_language
Queries200
Documents5,183
Positive qrels226
Positives / query avg1.13
Positives / query min1
Positives / query median1.00
Positives / query max5
Multi-positive queries16 (8.00%)
Query length avg chars100.13
Document length avg chars1,640.32

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.61600.79000.8363top-500
Denseharrier_oss_v1_270m0.67580.83000.9336top-500
Reranking hybridreranking_hybrid0.67090.82000.9558top-100

Training and Leakage Metadata