HAKARI-Bench

NanoR2MED

Overview

NanoR2MED is the Nano task group for R2MED, a reasoning-driven medical retrieval benchmark. It contains eight English retrieval tasks spanning biomedical StackExchange-style reference search, diagnostic and examination evidence retrieval, treatment evidence retrieval, and clinical case retrieval. The group is deliberately difficult because many queries require an implicit medical, scientific, or clinical inference before the relevant passage can be identified.

The group contains 876 queries, 80,000 task-local documents, and 2,678 positive qrel rows. Every task uses a 10,000-document candidate pool, but query count and positive density vary. NanoR2MED should be treated as a research evaluation resource, not as a clinical decision system.

What This Group Measures

R2MED focuses on reasoning-driven medical retrieval: relevance is not simply semantic similarity between a query and a document. A retriever may need to infer a biological concept, diagnosis, examination, treatment decision, or same-diagnosis case group before it can find the supporting evidence. The Nano group preserves this structure across eight compact splits.

The Q&A reference tasks retrieve answer-supporting references for bioinformatics, biology, and medical-sciences questions. The diagnostic and examination tasks retrieve evidence for clinical vignettes. The treatment task retrieves PubMed Central passages that support management reasoning. The clinical case tasks retrieve similar cases where relevance is mediated by diagnosis or clinical similarity rather than surface symptom overlap alone.

Task Families

Dataset Shape

All tasks are English and each uses 10,000 candidate documents. The group is multi-positive, with an average of 3.06 positives per query. NanoR2MEDMedQADiag has the highest average positive density at 4.42 positives per query, while the clinical and treatment splits generally have two to four positives per query.

Query length varies heavily. Some Q&A tasks are long but still question-like, while NanoR2MEDIIYiClinical and NanoR2MEDPMCTreatment use long structured case summaries. Documents range from biomedical reference passages to long clinical records. This makes the group sensitive to both medical entity understanding and long-context representation.

Retrieval Behavior

BM25 Profile

BM25 is not the best profile for any task in the current Nano data. It is strongest on NanoR2MEDPMCClinical, where case reports often repeat distinctive anatomy, disease, imaging, and symptom terms. It is weakest on exam and diagnostic evidence retrieval, especially NanoR2MEDMedXpertQAExam, where surface overlap does not reveal the examination or diagnostic bridge.

The group-level BM25 nDCG@10 is 0.2110. This low value is expected for a reasoning-driven medical benchmark. Sparse retrieval can find medically related documents, but it often misses the latent concept: a diagnosis hidden in a vignette, a treatment implication, or the practical resource needed for a bioinformatics problem.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.2980 nDCG@10. It is best for six tasks: Bioinformatics, Biology, MedXpertQA Exam, Medical Sciences, PMC Treatment, and the broad Q&A-style biomedical reference tasks. Dense retrieval helps bridge paraphrase and implicit answerability, especially when query wording differs from the evidence passage.

Absolute scores remain modest. Even dense retrieval struggles with diagnostic and examination tasks because the model must infer an intermediate medical concept before retrieving evidence. NanoR2MED is therefore a hard benchmark even for embedding-based retrieval.

Reranking Hybrid Profile

The reranking hybrid profile is best for NanoR2MEDIIYiClinical, NanoR2MEDMedQADiag, and NanoR2MEDPMCClinical, and it has the strongest group-level hit@10 and recall@100. This suggests that hybrid candidate generation helps when exact clinical terms and semantic case similarity are both useful, especially in case retrieval and diagnosis-oriented tasks.

Hybrid is below dense on the group-level nDCG@10, but its recall@100 is higher. For R2MED-style retrieval, that distinction matters: hybrid search may be a good first-stage retriever for medically plausible candidates, while dense or specialized reranking may still be needed for final top-10 ordering.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoR2MEDBioinformaticsBiomedical reference retrievalen7710,0002262.940.21890.34250.2623Dense
NanoR2MEDBiologyBiomedical reference retrievalen10310,0003743.630.34550.49530.4722Dense
NanoR2MEDIIYiClinicalClinical case retrievalen12910,0004573.540.14820.18700.1975Reranking hybrid
NanoR2MEDMedQADiagDiagnostic evidence retrievalen11810,0005224.420.07000.12540.1406Reranking hybrid
NanoR2MEDMedXpertQAExamExamination evidence retrievalen9710,0002923.010.02770.15990.0979Dense
NanoR2MEDMedicalSciencesBiomedical reference retrievalen8810,0002442.770.21400.35670.3320Dense
NanoR2MEDPMCClinicalClinical case retrievalen11410,0002482.180.39330.35840.4477Reranking hybrid
NanoR2MEDPMCTreatmentTreatment evidence retrievalen15010,0003152.100.25800.38010.3555Dense

Interpretation Notes for Model Researchers

NanoR2MED is a hard reasoning benchmark. Low BM25 scores do not merely indicate poor tokenization; they show that relevance often depends on diagnosis, treatment, case similarity, or biomedical concept inference. Dense retrieval is the strongest single profile, but hybrid retrieval improves candidate coverage and leads several clinical case or diagnostic tasks.

Researchers should inspect task families separately. A model that improves Bioinformatics or Biology may be learning biomedical reference retrieval, while improvement on MedQA diagnosis or PMC Clinical may indicate better clinical reasoning or case similarity. The aggregate score alone hides these differences.

Training and Leakage Notes

Useful training data includes biomedical Q&A reference pairs, tool-documentation retrieval, medical exam vignettes with evidence passages, diagnosis-labeled case matching, PubMed Central treatment evidence, PICO-style intervention retrieval, and hard negatives that share symptoms or disease terms but support a different clinical conclusion.

Leakage control should exclude NanoR2MED evaluation queries, qrels, positive documents, same-source near duplicates, public benchmark examples, and clinical case records with overlapping diagnoses and text. Synthetic data should preserve the reasoning bridge rather than producing generic symptom-passage similarity.

Source Reference Table

SourceYearTypeURL
R2MED: A Benchmark for Reasoning-Driven Medical Retrieval2025benchmark paperhttps://arxiv.org/abs/2505.14558
R2MED project page2025project pagehttps://r2med.github.io/
R2MED GitHub repository2025source repositoryhttps://github.com/R2MED/R2MED
R2MED source datasets2025dataset collectionhttps://huggingface.co/R2MED

Metadata Summary

FieldValue
Task pages8
Queries876
Split-local documents80,000
Positive qrels2,678
Languagesen
Categoriesnatural_language
Positives / query avg3.06

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoR2MEDBioinformaticsNanoR2MEDennatural_language7710,0002260.21890.34250.2623Dense
NanoR2MEDBiologyNanoR2MEDennatural_language10310,0003740.34550.49530.4722Dense
NanoR2MEDIIYiClinicalNanoR2MEDennatural_language12910,0004570.14820.18700.1975Reranking hybrid
NanoR2MEDMedicalSciencesNanoR2MEDennatural_language8810,0002440.21400.35670.3320Dense
NanoR2MEDMedQADiagNanoR2MEDennatural_language11810,0005220.07000.12540.1406Reranking hybrid
NanoR2MEDMedXpertQAExamNanoR2MEDennatural_language9710,0002920.02770.15990.0979Dense
NanoR2MEDPMCClinicalNanoR2MEDennatural_language11410,0002480.39330.35840.4477Reranking hybrid
NanoR2MEDPMCTreatmentNanoR2MEDennatural_language15010,0003150.25800.38010.3555Dense