HAKARI-Bench

NanoMedical / NanoSciFact

Overview

NanoMedical / NanoSciFact is an English scientific claim-evidence retrieval task derived from SciFact. Queries are atomic biomedical or scientific claims, and documents are research abstracts that contain evidence supporting or refuting those claims. The original SciFact benchmark was introduced for scientific claim verification, including abstract retrieval, support/refute classification, and rationale selection. This Nano split focuses on the retrieval component: given a claim, retrieve the abstract that contains the evidence. It is useful for evaluating whether retrieval models can connect compact scientific claims to evidence-bearing abstracts while preserving directionality, mechanism, population, and outcome details.

Details

What the Original Data Measures

SciFact measures scientific claim verification. Claims are written from scientific citation contexts, and abstracts are labeled according to whether they support, refute, or provide no information for the claim. The full task includes rationale identification, but this Nano task isolates the evidence-retrieval step.

The retrieval target is not a general paper about the same topic. A relevant abstract must contain evidence for or against the specific claim.

Observed Data Profile

The Nano split contains 200 queries, 5,183 documents, and 226 positive qrel rows. Queries have 1.13 positives on average, with a median of 1 and a maximum of 5. There are 16 multi-positive queries, or 8.0% of the set. Queries average 90.07 characters, while documents average 1,499.41 characters.

The examples include claims about metastatic colorectal cancer treatment, CRP and CABG mortality, p150 and EB1 interaction, obesity genetics, and febrile seizures. Documents are long article-title plus abstract passages containing methods, results, and conclusions.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.7017, hit@10 of 0.8650, and recall@100 of 0.9425. BM25 is strong because claims often include distinctive gene names, clinical terms, interventions, or outcomes that also appear in the evidence abstract.

The main difficulty is not finding the topic but matching the exact evidence relation. Same-topic abstracts may differ in direction, population, mechanism, or outcome. Sparse retrieval can over-rank an abstract that shares terminology but does not support or refute the claim.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.7334, hit@10 of 0.8800, and recall@100 of 0.9336. Dense retrieval improves top-rank quality and hit rate over BM25, though BM25 has slightly higher recall@100.

This suggests that semantic matching helps rank evidence-bearing abstracts above same-topic negatives. Dense models still need to preserve fine-grained scientific directionality and not collapse all related abstracts together.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 5 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.7506, hit@10 of 0.8850, and recall@100 of 0.9779. This is the strongest overall profile, combining BM25's exact terminology coverage with dense retrieval's semantic ranking.

The hybrid result is well suited for downstream reranking and verification. It exposes nearly all positives while placing evidence abstracts early enough for a reranker to inspect.

Metric Interpretation for Model Researchers

Most queries have one positive abstract, so nDCG@10 largely reflects whether the correct evidence abstract is ranked early. Recall@100 matters for evidence verification pipelines because the verifier cannot classify a claim if the evidence abstract is absent.

The strong hybrid profile shows that exact scientific terms and semantic relation matching are both valuable.

Query and Relevance Type Tendencies

Queries are compact scientific claims with biomedical entities, mechanisms, interventions, associations, or clinical outcomes. Relevant documents are abstracts containing evidence for or against the claim.

The relevance relation is evidence sufficiency. A related paper is not enough unless it contains evidence bearing on the claim.

Representative Failure Modes

Common failures include retrieving same-topic abstracts with the wrong outcome, missing negation or directionality, confusing related genes or pathways, and ignoring population or experimental condition differences. Sparse systems over-match terminology; dense systems may over-match broad semantic similarity.

Training Data That May Help

Useful training data includes non-overlapping scientific claim-evidence pairs, biomedical citation-to-abstract retrieval data, SciFact-style rationale and verification data outside the evaluation split, and same-topic biomedical hard negatives. Training should exclude SciFact evaluation claims, positive abstracts, and near-duplicate claims derived from the same citances.

Model Improvement Notes

Models should learn evidence relation, not just topic. Hard negatives should share disease, gene, intervention, or method terms while changing direction, condition, population, or outcome. Rerankers should be trained with support/refute-aware examples even if the retrieval metric itself is label-agnostic.

Example Data

QueryPositive document
Metastatic colorectal cancer treated with a single agent fluoropyrimidines resulted in reduced efficacy and lower quality of life when compared with oxaliplatin-based chemotherapy in elderly patients. [200 chars]Chemotherapy options in elderly and frail patients with metastatic colorectal cancer (MRC FOCUS2): an open-label, randomised factorial trial BACKGROUND Elderly and frail patients with cancer, although often treated with chemotherapy, are under-represented in clinical trials. We designed FOCUS2 to investigate reduced-dose chemotherapy options and to seek objective predictors of outcome in frail patients with advanced colorectal cancer. METHODS We undertook an open, 2 × 2 factorial trial in 61 UK centres for patients with previously untreated advanced colorectal cancer who were considered unfit for full-dose chemotherapy. After comprehensive health assessment (CHA), patients were randomly assigned by minimisation to: 48-h intravenous fluorouracil with levofolinate (group A); oxaliplatin and fluorouracil (group B); capecitabine (group C); or oxaliplatin and capecitabine (group D). Treatment allocation was not masked. Starting doses were 80% of standard doses, with discretionary escalation... [1,000 / 3,063 chars]
CRP is not predictive of postoperative mortality following Coronary Artery Bypass Graft (CABG) surgery. [103 chars]Assessing the cost effectiveness of using prognostic biomarkers with decision models: case study in prioritising patients waiting for coronary artery surgery OBJECTIVE To determine the effectiveness and cost effectiveness of using information from circulating biomarkers to inform the prioritisation process of patients with stable angina awaiting coronary artery bypass graft surgery. DESIGN Decision analytical model comparing four prioritisation strategies without biomarkers (no formal prioritisation, two urgency scores, and a risk score) and three strategies based on a risk score using biomarkers: a routinely assessed biomarker (estimated glomerular filtration rate), a novel biomarker (C reactive protein), or both. The order in which to perform coronary artery bypass grafting in a cohort of patients was determined by each prioritisation strategy, and mean lifetime costs and quality adjusted life years (QALYs) were compared. DATA SOURCES Swedish Coronary Angiography and Angioplasty Regi... [1,000 / 2,937 chars]
Arginine 90 in p150n is important for interaction with EB1. [59 chars]Structural basis for the activation of microtubule assembly by the EB1 and p150Glued complex. Plus-end tracking proteins, such as EB1 and the dynein/dynactin complex, regulate microtubule dynamics. These proteins are thought to stabilize microtubules by forming a plus-end complex at microtubule growing ends with ill-defined mechanisms. Here we report the crystal structure of two plus-end complex components, the carboxy-terminal dimerization domain of EB1 and the microtubule binding (CAP-Gly) domain of the dynactin subunit p150Glued. Each molecule of the EB1 dimer contains two helices forming a conserved four-helix bundle, while also providing p150Glued binding sites in its flexible tail region. Combining crystallography, NMR, and mutational analyses, our studies reveal the critical interacting elements of both EB1 and p150Glued, whose mutation alters microtubule polymerization activity. Moreover, removal of the key flexible tail from EB1 activates microtubule assembly by EB1 alone, sug... [1,000 / 1,198 chars]

Source Reference Table

TitleYearTypeURL
Fact or Fiction: Verifying Scientific Claims2020arXiv paperhttps://arxiv.org/abs/2004.14974
Fact or Fiction: Verifying Scientific Claims2020ACL Anthology paperhttps://aclanthology.org/2020.emnlp-main.609/

Dataset Information

FieldValue
Nano setNanoMedical
Backing datasetNanoMedical
Task / splitNanoSciFact
Hugging Face datasethakari-bench/NanoMedical
Languageen
Categorynatural_language
Queries200
Documents5,183
Positive qrels226
Positives / query avg1.13
Positives / query min1
Positives / query median1.00
Positives / query max5
Multi-positive queries16 (8.00%)
Query length avg chars90.06
Document length avg chars1,499.41

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.70170.86500.9425top-500
Denseharrier_oss_v1_270m0.73340.88000.9336top-500
Reranking hybridreranking_hybrid0.75060.88500.9779top-100

Training and Leakage Metadata