HAKARI-Bench

NanoIFIR / NanoIFIRNFCorpus

Overview

NanoIFIRNFCorpus is an English medical and nutrition literature retrieval task in NanoIFIR. The queries are layperson-style health and nutrition topics, and the documents are medical research article titles and abstracts.

This task evaluates retrieval across the lay-to-biomedical vocabulary gap. The user-facing query may use accessible health language, while relevant documents use PubMed-style scientific terminology, mechanisms, exposures, endpoints, and trial language.

Details

What the Original Data Measures

IFIR uses NFCorpus in a health-related expert retrieval setting, where the goal is to retrieve scientific literature tailored to a research or information need.

NFCorpus is a medical learning-to-rank dataset built from NutritionFacts.org pages written in lay English and linked to PubMed or PMC research articles. The original task emphasizes the lexical gap between consumer health topics and biomedical literature, with graded or multi-positive relevance derived from links between health content and scientific articles.

Observed Data Profile

This Nano split contains 86 queries, 3,593 documents, and 242 positive qrels. Queries have 2.81 positives on average, with a minimum of 1, a median of 3.0, and a maximum of 8. There are 64 multi-positive queries, or 74.42% of the split. Queries average 37.84 characters, and documents average 1,589.52 characters.

Observed queries include topics such as curcumin safety, ulcerative colitis prevention with diet, autophagy and longevity, the Swank diet for multiple sclerosis, and curcumin bioavailability. Documents are PubMed-like titles and abstracts.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.3338, hit@10 of 0.6628, and recall@100 of 0.6488 with a top-500 candidate pool. Lexical matching is useful when consumer and biomedical wording overlaps, such as curcumin, ulcerative colitis, multiple sclerosis, or diet.

The task remains difficult for BM25 because many relevant abstracts use technical terms that differ from the lay query. A consumer phrase such as "live longer" may correspond to mechanisms involving mTOR, autophagy, aging, or nutrient signaling rather than exact phrase overlap.

Dense Evaluation Profile

The dense harrier-oss-270m profile reaches nDCG@10 of 0.4580, hit@10 of 0.7326, and recall@100 of 0.8306. Dense retrieval is the strongest profile across the main metrics.

This shows that embedding similarity helps bridge lay health wording and scientific abstracts. Dense retrieval can connect consumer topics to biomedical mechanisms, trial endpoints, disease categories, and molecular terminology that are not obvious lexical matches.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.4108, hit@10 of 0.7209, and recall@100 of 0.7975. It uses 100 candidates per query, with nine rank-101 safeguard positives.

Hybrid retrieval is strong but below dense retrieval. Lexical anchors help for named nutrients or diseases, but dense semantic matching is more important overall. The hybrid pool remains useful for reranking because it combines term precision with broad biomedical semantic coverage.

Metric Interpretation for Model Researchers

NanoIFIRNFCorpus is a dense-favored health literature retrieval task. The main benchmark signal is whether a model can bridge lay topics to biomedical evidence. BM25 is a useful baseline, but dense retrieval provides much better relevant coverage.

Because most queries have multiple positives, recall@100 matters. nDCG@10 measures whether the model ranks scientifically useful abstracts early enough for evidence review.

Query and Relevance Type Tendencies

Queries are short health or nutrition titles written in accessible language. Documents are longer biomedical abstracts with scientific terminology, methods, results, and mechanistic claims.

The relevance relation is scientific support for the health topic. A relevant abstract may discuss a nutrient, disease, mechanism, trial, or epidemiological association related to the query.

Representative Failure Modes

BM25 may miss relevant abstracts that use different scientific terms from the lay query. Dense retrieval may retrieve medically adjacent abstracts that do not actually address the consumer health topic. Hybrid retrieval can still over-rank articles that share a nutrient or disease but answer a different mechanistic question.

Short queries also create ambiguity. A title like a broad health claim may map to several mechanisms, interventions, or outcomes.

Training Data That May Help

Useful training data includes non-overlapping NFCorpus train pairs, PubMed abstract retrieval pairs, consumer-health to biomedical query rewriting, and same-topic biomedical hard negatives.

Training should preserve multiple relevant medical abstracts where available and exclude NanoIFIRNFCorpus queries, qrels, and positive PubMed or PMC abstracts.

Model Improvement Notes

Improving this task requires biomedical semantic matching and lay-language query understanding. Models should represent nutrients, diseases, mechanisms, clinical outcomes, and study designs, while also handling consumer phrasing.

For reranking, the model should verify that the abstract scientifically addresses the query topic, not merely that it mentions a related nutrient or disease.

Example Data

QueryPositive document
Who Should be Careful About Curcumin? [37 chars]Curcumin as "Curecumin": from kitchen to clinic. Although turmeric (Curcuma longa; an Indian spice) has been described in Ayurveda, as a treatment for inflammatory diseases and is referred by different names in different cultures, the active principle called curcumin or diferuloylmethane, a yellow pigment present in turmeric (curry powder) has been shown to exhibit numerous activities. Extensive research over the last half century has revealed several important functions of curcumin. It binds to a variety of proteins and inhibits the activity of various kinases. By modulating the activation of various transcription factors, curcumin regulates the expression of inflammatory enzymes, cytokines, adhesion molecules, and cell survival proteins. Curcumin also downregulates cyclin D1, cyclin E and MDM2; and upregulates p21, p27, and p53. Various preclinical cell culture and animal studies suggest that curcumin has potential as an antiproliferative, anti-invasive, and antiangiogenic agent; as... [1,000 / 1,773 chars]
Preventing Ulcerative Colitis with Diet [39 chars]A diet high in fat and meat but low in dietary fibre increases the genotoxic potential of 'faecal water'. To determine the effects of different diets on the genotoxicity of human faecal water, a diet rich in fat, meat and sugar but poor in vegetables and free of wholemeal products (diet 1) was consumed by seven healthy volunteers over a period of 12 days. One week after the end of this period, the volunteers started to consume a diet enriched with vegetables and wholemeal products but poor in fat and meat (diet 2) over a second period of 12 days. The genotoxic effect of faecal waters obtained after both diets was assessed with the single cell gel electrophoresis (Comet assay) using the human colon adenocarcinoma cell line HT29 clone 19a as a target. The fluorescence and length of the tails of the comet images reflects the degree of DNA damage in single cells. The mean DNA damage, expressed as the ratio of tail intensity (fluorescence in the tail) to total intensity of the comet after i... [1,000 / 1,604 chars]
Exploiting Autophagy to Live Longer [35 chars]mTOR: from growth signal integration to cancer, diabetes and ageing Preface In all eukaryotes, the target of rapamycin (TOR) signaling pathway couples energy and nutrient abundance to the execution of cell growth and division, owing to the ability of TOR protein kinase to simultaneously sense energy, nutrients and stress, and, in metazoan, growth factors. Mammalian TOR complexes 1 and 2 (mTORC1 and mTORC2) exert their actions by regulating other important kinases, such as S6K and Akt. In the last few years, a significant advance in our understanding of the regulation and functions of mTOR has revealed its critical involvement in the onset and progression of diabetes, cancer and ageing. [694 chars]

Source Reference Table

SourceRole
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information RetrievalExpert-domain instruction-following IR benchmark paper.
NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information RetrievalOriginal NFCorpus medical retrieval paper.
NFCorpus project pageOriginal dataset project page.
hakari-bench/NanoIFIRNano benchmark dataset containing this split.

Dataset Information

FieldValue
Nano setNanoIFIR
Backing datasetNanoIFIR
Task / splitNanoIFIRNFCorpus
Hugging Face datasethakari-bench/NanoIFIR
Languageen
Categorynatural_language
Queries86
Documents3,593
Positive qrels242
Positives / query avg2.81
Positives / query min1
Positives / query median3.00
Positives / query max8
Multi-positive queries64 (74.42%)
Query length avg chars37.84
Document length avg chars1,589.52

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.33380.66280.6488top-500
Denseharrier_oss_v1_270m0.45800.73260.8306top-500
Reranking hybridreranking_hybrid0.41080.72090.7975top-100

Training and Leakage Metadata