NanoMedical / NanoCUREv1

Overview

NanoMedical / NanoCUREv1 is an English clinical passage retrieval task derived from CURE, a benchmark for clinical understanding and retrieval evaluation. Queries are healthcare-provider-style clinical questions, and relevant documents are biomedical article passages that contain evidence for diagnosis, treatment, contraindication, measurement, or clinical implications. The original CURE dataset was designed for point-of-care retrieval across multiple medical domains, including dentistry, dermatology, gastroenterology, genetics, neurology, orthopedics, otorhinolaryngology, plastic surgery, psychiatry, pulmonology, and related specialties. This Nano split is strongly multi-positive, making it a useful test of both evidence ranking and broad clinical candidate coverage.

Details

What the Original Data Measures

CURE measures clinical retrieval for healthcare providers. The original benchmark contains expert-written queries and article-derived passages, with relevance labels for passages that answer or partially address the clinical information need. Unlike general biomedical search, CURE is oriented toward practical clinical questions, including treatment choice, surgical technique, contraindications, diagnosis, and specialty-specific evidence.

The source corpus is drawn from biomedical articles, so documents usually contain article titles followed by evidence-bearing passages. The task requires linking a concise clinical question to the relevant passage content, not only to a shared medical topic.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 5,181 positive qrel rows. Queries have 25.905 positives on average, with a median of 18 and a maximum of 100. There are 171 multi-positive queries, or 85.5% of the query set. Queries average 75.89 characters, while documents average 604.21 characters.

The examples include intermaxillary fixation screw placement, 3D printed splints in orthognathic surgery, endoscopic treatment of massive arterial epistaxis, temporomandibular joint symptoms, and tooth whitening compounds. Many documents contain specialist terminology, abbreviations, study-design language, and clinical outcome statements.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.4693, hit@10 of 0.9000, and recall@100 of 0.5314. BM25 is useful because clinical questions often repeat procedure names, anatomy, abbreviations, and treatment terms. It can usually find at least one relevant passage for many queries.

The main limitation is fine-grained clinical relevance. Shared medical terminology does not guarantee that a passage answers the exact clinical question. Abbreviations can be ambiguous, and same-topic passages may differ by indication, contraindication, patient population, outcome, or procedure step.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.5003, hit@10 of 0.8700, and recall@100 of 0.5862. Dense retrieval improves nDCG@10 and recall@100 over BM25, but BM25 has a slightly higher hit@10. This indicates complementary strengths: dense retrieval better captures clinical meaning and passage-level evidence, while sparse retrieval can still find exact terminology quickly.

The dense gains are important because many questions ask for implications or clinical relations that are not expressed with identical wording in the passage.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 14 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.5262, hit@10 of 0.9000, and recall@100 of 0.6126. This is the strongest overall profile, combining BM25's exact clinical term coverage with dense retrieval's semantic evidence matching.

The hybrid pool is therefore a strong candidate source for reranking. It exposes more relevant clinical passages while preserving high first-page hit behavior.

Metric Interpretation for Model Researchers

This is a many-positive clinical retrieval task. Hit@10 indicates whether the system finds at least one useful passage, but recall@100 is essential because a clinical question may have many valid passages across studies. nDCG@10 measures whether high-quality evidence appears early enough for a provider-facing search workflow.

Hybrid retrieval is the best candidate-generation baseline in this split, while dense retrieval provides the strongest standalone semantic signal.

Query and Relevance Type Tendencies

Queries are concise clinical questions, often asking "which", "what", "how", or "is" about procedures, symptoms, contraindications, measurements, or treatment implications. Relevant documents are biomedical article-title plus passage snippets.

The relevance relation is clinical answerability. A passage should answer the clinical information need or supply relevant evidence, not merely mention the same disease or procedure.

Representative Failure Modes

Common failures include abbreviation ambiguity, retrieving a same-procedure passage with the wrong clinical relation, confusing indication and contraindication, missing specialty-specific terminology, and over-ranking broad review passages. Sparse systems may over-match procedure names; dense systems may under-rank exact acronyms or rare specialist terms.

Training Data That May Help

Useful training data includes non-overlapping clinical question-to-passage pairs, biomedical evidence retrieval data, medical QA retrieval data with passage-level grounding, and clinical abbreviation or specialty-specific hard-negative training. CURE evaluation queries, CURE positive passages, and near-duplicate mined biomedical passages should be excluded for clean evaluation.

Model Improvement Notes

Models should preserve exact clinical terminology while improving semantic relation matching. Hard negatives should share the same medical topic but differ by diagnosis, treatment, contraindication, patient context, or outcome. Multi-positive training is important because many questions have numerous relevant passages.

Example Data

Query	Positive document
Which are the factors that should be taken in consideration when deciding the location of IMF screws placement? [111 chars]	The Use of MMF Screws: Surgical Technique, Indications, Contraindications, and Common Problems in Review of the Literature The anatomical site for the placement of MMF screws is chosen with respect to a given fracture location, the dentition, the extent of surgical exposure, the availability and the quality of bone in the direct proximity of the fracture line. [362 chars]
Which are the disadvantages of 3D printed splints in orthognathic surgery? [74 chars]	Comparison between Additive and Subtractive CAD-CAM Technique to Produce Orthognathic Surgical Splints: A Personalized Approach The findings of the present investigation would suggest that surgical splints are more accurate when produced by milling technology, according to the greater percentage of matching found in relation to the original digital project. Clinicians should be aware of this when referring to the lab technician for the construction of the appliance. [470 chars]
Which are the advantages of endoscopic approach to treat massive arterial epistaxis? [84 chars]	Success Rate of Endoscopic Sphenopalatine Artery Ligation for the Management of Refractory Posterior Epistaxis Patients in a Tertiary Care Hospital: A Descriptive Cross-sectional Study The findings of the study conclude that ESPAL has a high success rate in patients with intractile posterior epistaxis. From our study, we would like to recommend that endoscopic sphenopalatine artery ligation or cauterization should be preferred as first-line treatment for posterior epistaxis. This study will be beneficial for the development of knowledge by healthcare professionals for the management of posterior epistaxis. [613 chars]

Source Reference Table

Title	Year	Type	URL
CURE: A Dataset for Clinical Understanding & Retrieval Evaluation	2024	arXiv paper	https://arxiv.org/abs/2412.06954
CURE: A Dataset for Clinical Understanding & Retrieval Evaluation	2025	KDD proceedings DOI	https://doi.org/10.1145/3711896.3737435
clinia/CUREv1	2024	source dataset	https://huggingface.co/datasets/clinia/CUREv1

Dataset Information

Field	Value
Nano set	NanoMedical
Backing dataset	NanoMedical
Task / split	NanoCUREv1
Hugging Face dataset	hakari-bench/NanoMedical
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	5,181
Positives / query avg	25.91
Positives / query min	1
Positives / query median	18.00
Positives / query max	100
Multi-positive queries	171 (85.50%)
Query length avg chars	75.89
Document length avg chars	604.21

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4693	0.9000	0.5314	top-500
Dense	`harrier_oss_v1_270m`	0.5003	0.8700	0.5862	top-500
Reranking hybrid	`reranking_hybrid`	0.5262	0.9000	0.6126	top-100

Training and Leakage Metadata

Original train split: unavailable
Evaluation split origin: CURE benchmark test collection sampled into NanoMedical
Train/eval overlap audit: not_audited
Leakage note: exclude CURE evaluation queries, CURE positive passages, and near-duplicate mined biomedical passages when training for clean evaluation
Multi-positive training: train with multi-positive labels and same-topic hard negatives
Useful training data: non-overlapping clinical question-to-passage retrieval pairs, biomedical evidence retrieval data grounded in article passages, medical QA retrieval data with passage-level evidence, clinical abbreviation and specialty-specific hard-negative training