MNanoBEIR / NanoBEIR-no / NanoSciFact

Overview

NanoBEIR-no NanoSciFact is a Norwegian scientific claim verification retrieval task derived from SciFact. Queries are translated scientific claims, and documents are translated scientific abstracts that provide evidence supporting or refuting those claims. The task differs from general scientific related-paper retrieval because relevance depends on whether the abstract contains evidence for a specific claim. It is therefore a compact benchmark for scientific evidence retrieval, biomedical terminology, and claim-to-abstract matching in a multilingual setting.

Details

What the Original Data Measures

SciFact was introduced for verifying expert-written scientific claims against evidence abstracts, with support and refute labels plus rationales. In BEIR, SciFact is evaluated as a retrieval task: the system must first retrieve the abstracts that contain the evidence needed for verification. The MNanoBEIR Norwegian version keeps that claim-to-evidence structure after translation. It measures whether models can connect precise scientific assertions to abstracts that report the relevant findings, mechanisms, or experimental outcomes.

Observed Data Profile

This Nano subset contains 50 queries, 2,919 documents, and 56 positive qrels. Most queries have one positive, while 4 queries have multiple positives. The average is 1.12 positives per query, with a minimum of 1, median of 1.00, and maximum of 4. Queries average 96.18 characters and are often technical scientific claims. Documents are long abstracts averaging 1,424.51 characters. The task therefore requires matching a compact claim to a long abstract that contains the correct scientific evidence.

BM25 Evaluation Profile

BM25 uses the bm25 top-500 candidate subset. It reaches nDCG@10 0.5652, hit@10 0.7200, and recall@100 0.8571. This is a strong lexical profile for a scientific task. Claims and abstracts often share biomedical entities, abbreviations, proteins, diseases, or technical terms, giving BM25 useful anchors. However, evidence retrieval still requires more than term matching: abstracts may share terminology while reporting different findings, and a claim may be expressed as a paraphrase of the evidence. BM25 is a reliable candidate generator but can confuse same-domain abstracts with true evidence.

Dense Evaluation Profile

Dense retrieval uses the harrier_oss_v1_270m top-500 candidate subset. It scores nDCG@10 0.6217, hit@10 0.7600, and recall@100 0.9107, improving over BM25 on every metric. Dense retrieval helps because scientific claims often express a conclusion, while the evidence abstract may describe the experiment, mechanism, or result in different words. Embedding similarity can connect those forms of meaning better than exact lexical overlap alone. The remaining errors likely come from highly technical distinctions, negation, and abstracts that share entities but support different findings.

Reranking Hybrid Evaluation Profile

The reranking hybrid subset uses reranking_hybrid with top-100 candidates and an optional rank-101 safeguard. Candidate counts range from 100 to 101, with a mean of 100.08 and 4 safeguard rows. It reaches nDCG@10 0.6137, hit@10 0.7600, and recall@100 0.9286. The hybrid pool has the best recall and matches dense hit@10, while dense has a slightly higher nDCG@10. This is a strong example of hybrid search behaving as complementary evidence collection: BM25 contributes technical term coverage, dense retrieval contributes semantic claim-evidence matching, and the combined pool gives a reranker access to more positives.

Metric Interpretation for Model Researchers

Because most queries have one positive, hit@10 is close to a query-level success signal, while recall@100 indicates whether the evidence abstract is available for reranking. nDCG@10 matters because downstream verification needs the evidence near the top, not buried in a long candidate list. The observed scores show that dense retrieval is best for early ordering, while reranking hybrid is best for coverage. For research, this task is useful for separating scientific semantic understanding from exact terminology matching.

Query and Relevance Type Tendencies

Queries are atomic scientific claims, often about biological mechanisms, clinical interventions, gene expression, screening methods, or disease-related findings. Relevant documents are abstracts that contain evidence for or against the claim. The relation is evidence-specific: a document about the same protein, disease, or intervention is not enough unless it reports the relevant result. This favors models that can represent scientific predicates, experimental context, and the direction of a claim.

Representative Failure Modes

BM25 may retrieve abstracts that share rare biomedical terms but do not verify the claim. Dense systems may retrieve semantically related abstracts that are too broad or that discuss the same pathway without the specific finding. Hybrid systems improve coverage but can still mix true evidence with same-domain distractors. Translation may introduce additional difficulty for abbreviations, technical names, and nuanced scientific phrasing, especially where negation or causality is important.

Training Data That May Help

Helpful training data includes non-overlapping scientific fact verification, claim-evidence retrieval, biomedical abstract retrieval, scientific NLI, clinical trial evidence selection, and multilingual scientific retrieval. Hard negatives should come from the same discipline and share key terms while lacking the claim's specific finding. Training should exclude SciFact, BEIR, NanoBEIR, and overlapping translated abstracts.

Model Improvement Notes

NanoSciFact-no is a compact test of scientific evidence retrieval. Dense retrieval is strongest for ranking, while reranking hybrid provides the best top-100 evidence coverage. Improvements should focus on biomedical and scientific-domain embeddings, claim predicate modeling, negation and causal relation handling, and rerankers that compare a claim against abstract-level evidence. A strong retrieval system should combine exact technical term sensitivity with semantic evidence matching.

Example Data

Query	Positive document
Ly49Q styrer organiseringen av neutrofilmigrering til betennelsesområder ved å regulere membranraftfunksjoner. [110 chars]	Neutrofiler gjennomgår rask polarisering og rettet bevegelse for å trenge inn i infeksjons- og betennelsesområder. Her viser vi at en inhiberende MHC I-reseptor, Ly49Q, var avgjørende for rask polarisering og vevsinfiltrering av neutrofiler. Under stasjonær tilstand hemmet Ly49Q neutrofiladhæsjon ved å forhindre dannelse av fokalkomplekser, sannsynligvis ved å hemme Src- og PI3-kinaser. Imidlertid, i nærvær av betennelsesstimuli, medierte Ly49Q rask neutrofilpolarisering og vevsinfiltrering på en måte som er avhengig av ITIM-domener. Disse motstridende funksjonene synes å være mediert av forskjellig bruk av effektor fosfatase SHP-1 og SHP-2. Ly49Q-avhengig polarisering og migrasjon ble påvirket av Ly49Q-regulering av membranraftfunksjoner. Vi foreslår at Ly49Q er avgjørende for å omstille neutrofiler til deres polariserte morfologi og rask migrasjon ved betennelse, gjennom sin spatiotemporale regulering av membranrafter og raft-assosierte signalmolekyler. [969 chars]
Antiretroviral behandling reduserer forekomsten av tuberkulose i ulike CD4-nivåer. [82 chars]	BAKGRUNN Humant immunsviktvirus (HIV) infeksjon er den sterkeste risikofaktoren for å utvikle tuberkulose og har bidratt til en økning i forekomsten, spesielt i sub-Sahara-Afrika. I 2010 var det anslått 1,1 millioner nye tilfeller av tuberkulose blant de 34 millioner menneskene som levde med HIV i verden. Antiretroviral behandling har betydelig potensial for å forebygge HIV-relatert tuberkulose. Vi gjennomførte en systematisk gjennomgang av studier som analyserte effekten av antiretroviral behandling på forekomsten av tuberkulose hos voksne med HIV-infeksjon. METODER OG FUNN Vi gjennomførte systematiske søk i PubMed, Embase, African Index Medicus, LILACS og kliniske forsøksregister. Tilfeldig kontrollerte studier, prospektive kohortstudier og retrospektive kohortstudier ble inkludert hvis de sammenlignet forekomsten av tuberkulose basert på antiretroviral behandlingsstatus hos HIV-infiserte voksne i utviklingsland i en median på over 6 måneder. For meta-analysene var det fire kategorie... [1,000 / 2,211 chars]
Rask oppregulering og høyere basal ekspresjon av interferon-induserte gener reduserer overlevelsen av granulære celler i hjernen som er infisert med Vest-Nil-virus. [164 chars]	Selv om neuroners følsomhet for mikrobiell infeksjon i hjernen er en viktig faktor for klinisk utfall, er det lite kjent om de molekylære faktorer som styrer denne sårbarheten. Her viser vi at to typer neuroner fra forskjellige hjerneregioner viste forskjellig tillatelse til replikasjon av flere positive-strengede RNA-virus. Granulær cellene i lillehjernen og kortikale neuroner fra storhjernen har unike medfødte immunprogrammer som gir forskjellig følsomhet for viral infeksjon både ex vivo og in vivo. Ved å introdusere gener i kortikale neuroner som var mer uttrykt i granulær cellene, identifiserte vi tre interferon-stimulerte gener (ISG; Ifi27, Irg1 og Rsad2 (også kjent som Viperin)) som medierte antivirale effekter mot forskjellige nevrotrope virus. Videre fant vi at den epigenetiske tilstanden og miRNA-mediert regulering av ISG korrelerer med forbedret antiviral respons i granulær cellene. Dermed har neuroner fra evolusjonært forskjellige hjerneregioner unike medfødte immunsignature... [1,000 / 1,072 chars]

Source Reference Table

Title	Year	Type	URL
Fact or Fiction: Verifying Scientific Claims	2020	task paper	https://arxiv.org/abs/2004.14974
SciFact repository		project page	https://github.com/allenai/scifact
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models	2021	benchmark paper	https://arxiv.org/abs/2104.08663
MMTEB: Massive Multilingual Text Embedding Benchmark	2025	benchmark paper	https://arxiv.org/abs/2502.13595
NanoBEIR: Smaller BEIR dataset subsets	2024	dataset collection	https://huggingface.co/collections/zeta-alpha-ai/nanobeir

Dataset Information

Field	Value
Nano set	MNanoBEIR
Backing dataset	NanoBEIR-no
Task / split	NanoSciFact
Hugging Face dataset	hakari-bench/NanoBEIR-no
Language	no
Category	natural_language
Queries	50
Documents	2,919
Positive qrels	56
Positives / query avg	1.12
Positives / query min	1
Positives / query median	1.00
Positives / query max	4
Multi-positive queries	4 (8.00%)
Query length avg chars	96.18
Document length avg chars	1,424.51

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.5652	0.7200	0.8571	top-500
Dense	`harrier_oss_v1_270m`	0.6217	0.7600	0.9107	top-500
Reranking hybrid	`reranking_hybrid`	0.6137	0.7600	0.9286	top-100