NanoMedical / NanoSciFact

Overview

NanoMedical / NanoSciFact is an English scientific claim-evidence retrieval task derived from SciFact. Queries are atomic biomedical or scientific claims, and documents are research abstracts that contain evidence supporting or refuting those claims. The original SciFact benchmark was introduced for scientific claim verification, including abstract retrieval, support/refute classification, and rationale selection. This Nano split focuses on the retrieval component: given a claim, retrieve the abstract that contains the evidence. It is useful for evaluating whether retrieval models can connect compact scientific claims to evidence-bearing abstracts while preserving directionality, mechanism, population, and outcome details.

Details

What the Original Data Measures

SciFact measures scientific claim verification. Claims are written from scientific citation contexts, and abstracts are labeled according to whether they support, refute, or provide no information for the claim. The full task includes rationale identification, but this Nano task isolates the evidence-retrieval step.

The retrieval target is not a general paper about the same topic. A relevant abstract must contain evidence for or against the specific claim.

Observed Data Profile

The Nano split contains 200 queries, 5,183 documents, and 226 positive qrel rows. Queries have 1.13 positives on average, with a median of 1 and a maximum of 5. There are 16 multi-positive queries, or 8.0% of the set. Queries average 90.07 characters, while documents average 1,499.41 characters.

The examples include claims about metastatic colorectal cancer treatment, CRP and CABG mortality, p150 and EB1 interaction, obesity genetics, and febrile seizures. Documents are long article-title plus abstract passages containing methods, results, and conclusions.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.7017, hit@10 of 0.8650, and recall@100 of 0.9425. BM25 is strong because claims often include distinctive gene names, clinical terms, interventions, or outcomes that also appear in the evidence abstract.

The main difficulty is not finding the topic but matching the exact evidence relation. Same-topic abstracts may differ in direction, population, mechanism, or outcome. Sparse retrieval can over-rank an abstract that shares terminology but does not support or refute the claim.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.7334, hit@10 of 0.8800, and recall@100 of 0.9336. Dense retrieval improves top-rank quality and hit rate over BM25, though BM25 has slightly higher recall@100.

This suggests that semantic matching helps rank evidence-bearing abstracts above same-topic negatives. Dense models still need to preserve fine-grained scientific directionality and not collapse all related abstracts together.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 5 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.7506, hit@10 of 0.8850, and recall@100 of 0.9779. This is the strongest overall profile, combining BM25's exact terminology coverage with dense retrieval's semantic ranking.

The hybrid result is well suited for downstream reranking and verification. It exposes nearly all positives while placing evidence abstracts early enough for a reranker to inspect.

Metric Interpretation for Model Researchers

Most queries have one positive abstract, so nDCG@10 largely reflects whether the correct evidence abstract is ranked early. Recall@100 matters for evidence verification pipelines because the verifier cannot classify a claim if the evidence abstract is absent.

The strong hybrid profile shows that exact scientific terms and semantic relation matching are both valuable.

Query and Relevance Type Tendencies

Queries are compact scientific claims with biomedical entities, mechanisms, interventions, associations, or clinical outcomes. Relevant documents are abstracts containing evidence for or against the claim.

The relevance relation is evidence sufficiency. A related paper is not enough unless it contains evidence bearing on the claim.

Representative Failure Modes

Common failures include retrieving same-topic abstracts with the wrong outcome, missing negation or directionality, confusing related genes or pathways, and ignoring population or experimental condition differences. Sparse systems over-match terminology; dense systems may over-match broad semantic similarity.

Training Data That May Help

Useful training data includes non-overlapping scientific claim-evidence pairs, biomedical citation-to-abstract retrieval data, SciFact-style rationale and verification data outside the evaluation split, and same-topic biomedical hard negatives. Training should exclude SciFact evaluation claims, positive abstracts, and near-duplicate claims derived from the same citances.

Model Improvement Notes

Models should learn evidence relation, not just topic. Hard negatives should share disease, gene, intervention, or method terms while changing direction, condition, population, or outcome. Rerankers should be trained with support/refute-aware examples even if the retrieval metric itself is label-agnostic.

Example Data

Query	Positive document
Metastatic colorectal cancer treated with a single agent fluoropyrimidines resulted in reduced efficacy and lower quality of life when compared with oxaliplatin-based chemotherapy in elderly patients. [200 chars]	Chemotherapy options in elderly and frail patients with metastatic colorectal cancer (MRC FOCUS2): an open-label, randomised factorial trial BACKGROUND Elderly and frail patients with cancer, although often treated with chemotherapy, are under-represented in clinical trials. We designed FOCUS2 to investigate reduced-dose chemotherapy options and to seek objective predictors of outcome in frail patients with advanced colorectal cancer. METHODS We undertook an open, 2 × 2 factorial trial in 61 UK centres for patients with previously untreated advanced colorectal cancer who were considered unfit for full-dose chemotherapy. After comprehensive health assessment (CHA), patients were randomly assigned by minimisation to: 48-h intravenous fluorouracil with levofolinate (group A); oxaliplatin and fluorouracil (group B); capecitabine (group C); or oxaliplatin and capecitabine (group D). Treatment allocation was not masked. Starting doses were 80% of standard doses, with discretionary escalation... [1,000 / 3,063 chars]
CRP is not predictive of postoperative mortality following Coronary Artery Bypass Graft (CABG) surgery. [103 chars]	Assessing the cost effectiveness of using prognostic biomarkers with decision models: case study in prioritising patients waiting for coronary artery surgery OBJECTIVE To determine the effectiveness and cost effectiveness of using information from circulating biomarkers to inform the prioritisation process of patients with stable angina awaiting coronary artery bypass graft surgery. DESIGN Decision analytical model comparing four prioritisation strategies without biomarkers (no formal prioritisation, two urgency scores, and a risk score) and three strategies based on a risk score using biomarkers: a routinely assessed biomarker (estimated glomerular filtration rate), a novel biomarker (C reactive protein), or both. The order in which to perform coronary artery bypass grafting in a cohort of patients was determined by each prioritisation strategy, and mean lifetime costs and quality adjusted life years (QALYs) were compared. DATA SOURCES Swedish Coronary Angiography and Angioplasty Regi... [1,000 / 2,937 chars]
Arginine 90 in p150n is important for interaction with EB1. [59 chars]	Structural basis for the activation of microtubule assembly by the EB1 and p150Glued complex. Plus-end tracking proteins, such as EB1 and the dynein/dynactin complex, regulate microtubule dynamics. These proteins are thought to stabilize microtubules by forming a plus-end complex at microtubule growing ends with ill-defined mechanisms. Here we report the crystal structure of two plus-end complex components, the carboxy-terminal dimerization domain of EB1 and the microtubule binding (CAP-Gly) domain of the dynactin subunit p150Glued. Each molecule of the EB1 dimer contains two helices forming a conserved four-helix bundle, while also providing p150Glued binding sites in its flexible tail region. Combining crystallography, NMR, and mutational analyses, our studies reveal the critical interacting elements of both EB1 and p150Glued, whose mutation alters microtubule polymerization activity. Moreover, removal of the key flexible tail from EB1 activates microtubule assembly by EB1 alone, sug... [1,000 / 1,198 chars]

Source Reference Table

Title	Year	Type	URL
Fact or Fiction: Verifying Scientific Claims	2020	arXiv paper	https://arxiv.org/abs/2004.14974
Fact or Fiction: Verifying Scientific Claims	2020	ACL Anthology paper	https://aclanthology.org/2020.emnlp-main.609/

Dataset Information

Field	Value
Nano set	NanoMedical
Backing dataset	NanoMedical
Task / split	NanoSciFact
Hugging Face dataset	hakari-bench/NanoMedical
Language	en
Category	natural_language
Queries	200
Documents	5,183
Positive qrels	226
Positives / query avg	1.13
Positives / query min	1
Positives / query median	1.00
Positives / query max	5
Multi-positive queries	16 (8.00%)
Query length avg chars	90.06
Document length avg chars	1,499.41

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.7017	0.8650	0.9425	top-500
Dense	`harrier_oss_v1_270m`	0.7334	0.8800	0.9336	top-500
Reranking hybrid	`reranking_hybrid`	0.7506	0.8850	0.9779	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: SciFact evidence retrieval split sampled into NanoMedical
Train/eval overlap audit: not_audited
Leakage note: exclude SciFact evaluation claims, positive abstracts, and near-duplicate claims derived from the same source citances
Multi-positive training: mostly single-positive with limited multi-positive support
Useful training data: non-overlapping scientific claim-evidence pairs, biomedical citation-to-abstract retrieval data, SciFact-style rationale and verification data outside the evaluation split, same-topic biomedical hard negatives