NanoR2MED
Overview
NanoR2MED is the Nano task group for R2MED, a reasoning-driven medical retrieval benchmark. It contains eight English retrieval tasks spanning biomedical StackExchange-style reference search, diagnostic and examination evidence retrieval, treatment evidence retrieval, and clinical case retrieval. The group is deliberately difficult because many queries require an implicit medical, scientific, or clinical inference before the relevant passage can be identified.
The group contains 876 queries, 80,000 task-local documents, and 2,678 positive qrel rows. Every task uses a 10,000-document candidate pool, but query count and positive density vary. NanoR2MED should be treated as a research evaluation resource, not as a clinical decision system.
What This Group Measures
R2MED focuses on reasoning-driven medical retrieval: relevance is not simply semantic similarity between a query and a document. A retriever may need to infer a biological concept, diagnosis, examination, treatment decision, or same-diagnosis case group before it can find the supporting evidence. The Nano group preserves this structure across eight compact splits.
The Q&A reference tasks retrieve answer-supporting references for bioinformatics, biology, and medical-sciences questions. The diagnostic and examination tasks retrieve evidence for clinical vignettes. The treatment task retrieves PubMed Central passages that support management reasoning. The clinical case tasks retrieve similar cases where relevance is mediated by diagnosis or clinical similarity rather than surface symptom overlap alone.
Task Families
- Biomedical Q&A reference retrieval:
NanoR2MEDBioinformatics,NanoR2MEDBiology, andNanoR2MEDMedicalSciencesretrieve supporting references for biomedical community questions. - Diagnostic and examination evidence retrieval:
NanoR2MEDMedQADiagandNanoR2MEDMedXpertQAExamretrieve medical evidence for exam-style vignettes. - Treatment evidence retrieval:
NanoR2MEDPMCTreatmentretrieves PMC passages for treatment or management reasoning. - Clinical case retrieval:
NanoR2MEDPMCClinicalandNanoR2MEDIIYiClinicalretrieve clinically similar cases.
Dataset Shape
All tasks are English and each uses 10,000 candidate documents. The group is multi-positive, with an average of 3.06 positives per query. NanoR2MEDMedQADiag has the highest average positive density at 4.42 positives per query, while the clinical and treatment splits generally have two to four positives per query.
Query length varies heavily. Some Q&A tasks are long but still question-like, while NanoR2MEDIIYiClinical and NanoR2MEDPMCTreatment use long structured case summaries. Documents range from biomedical reference passages to long clinical records. This makes the group sensitive to both medical entity understanding and long-context representation.
Retrieval Behavior
BM25 Profile
BM25 is not the best profile for any task in the current Nano data. It is strongest on NanoR2MEDPMCClinical, where case reports often repeat distinctive anatomy, disease, imaging, and symptom terms. It is weakest on exam and diagnostic evidence retrieval, especially NanoR2MEDMedXpertQAExam, where surface overlap does not reveal the examination or diagnostic bridge.
The group-level BM25 nDCG@10 is 0.2110. This low value is expected for a reasoning-driven medical benchmark. Sparse retrieval can find medically related documents, but it often misses the latent concept: a diagnosis hidden in a vignette, a treatment implication, or the practical resource needed for a bioinformatics problem.
Dense Profile
Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.2980 nDCG@10. It is best for six tasks: Bioinformatics, Biology, MedXpertQA Exam, Medical Sciences, PMC Treatment, and the broad Q&A-style biomedical reference tasks. Dense retrieval helps bridge paraphrase and implicit answerability, especially when query wording differs from the evidence passage.
Absolute scores remain modest. Even dense retrieval struggles with diagnostic and examination tasks because the model must infer an intermediate medical concept before retrieving evidence. NanoR2MED is therefore a hard benchmark even for embedding-based retrieval.
Reranking Hybrid Profile
The reranking hybrid profile is best for NanoR2MEDIIYiClinical, NanoR2MEDMedQADiag, and NanoR2MEDPMCClinical, and it has the strongest group-level hit@10 and recall@100. This suggests that hybrid candidate generation helps when exact clinical terms and semantic case similarity are both useful, especially in case retrieval and diagnosis-oriented tasks.
Hybrid is below dense on the group-level nDCG@10, but its recall@100 is higher. For R2MED-style retrieval, that distinction matters: hybrid search may be a good first-stage retriever for medically plausible candidates, while dense or specialized reranking may still be needed for final top-10 ordering.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoR2MEDBioinformatics | Biomedical reference retrieval | en | 77 | 10,000 | 226 | 2.94 | 0.2189 | 0.3425 | 0.2623 | Dense |
| NanoR2MEDBiology | Biomedical reference retrieval | en | 103 | 10,000 | 374 | 3.63 | 0.3455 | 0.4953 | 0.4722 | Dense |
| NanoR2MEDIIYiClinical | Clinical case retrieval | en | 129 | 10,000 | 457 | 3.54 | 0.1482 | 0.1870 | 0.1975 | Reranking hybrid |
| NanoR2MEDMedQADiag | Diagnostic evidence retrieval | en | 118 | 10,000 | 522 | 4.42 | 0.0700 | 0.1254 | 0.1406 | Reranking hybrid |
| NanoR2MEDMedXpertQAExam | Examination evidence retrieval | en | 97 | 10,000 | 292 | 3.01 | 0.0277 | 0.1599 | 0.0979 | Dense |
| NanoR2MEDMedicalSciences | Biomedical reference retrieval | en | 88 | 10,000 | 244 | 2.77 | 0.2140 | 0.3567 | 0.3320 | Dense |
| NanoR2MEDPMCClinical | Clinical case retrieval | en | 114 | 10,000 | 248 | 2.18 | 0.3933 | 0.3584 | 0.4477 | Reranking hybrid |
| NanoR2MEDPMCTreatment | Treatment evidence retrieval | en | 150 | 10,000 | 315 | 2.10 | 0.2580 | 0.3801 | 0.3555 | Dense |
Interpretation Notes for Model Researchers
NanoR2MED is a hard reasoning benchmark. Low BM25 scores do not merely indicate poor tokenization; they show that relevance often depends on diagnosis, treatment, case similarity, or biomedical concept inference. Dense retrieval is the strongest single profile, but hybrid retrieval improves candidate coverage and leads several clinical case or diagnostic tasks.
Researchers should inspect task families separately. A model that improves Bioinformatics or Biology may be learning biomedical reference retrieval, while improvement on MedQA diagnosis or PMC Clinical may indicate better clinical reasoning or case similarity. The aggregate score alone hides these differences.
Training and Leakage Notes
Useful training data includes biomedical Q&A reference pairs, tool-documentation retrieval, medical exam vignettes with evidence passages, diagnosis-labeled case matching, PubMed Central treatment evidence, PICO-style intervention retrieval, and hard negatives that share symptoms or disease terms but support a different clinical conclusion.
Leakage control should exclude NanoR2MED evaluation queries, qrels, positive documents, same-source near duplicates, public benchmark examples, and clinical case records with overlapping diagnoses and text. Synthetic data should preserve the reasoning bridge rather than producing generic symptom-passage similarity.
Source Reference Table
| Source | Year | Type | URL |
| R2MED: A Benchmark for Reasoning-Driven Medical Retrieval | 2025 | benchmark paper | https://arxiv.org/abs/2505.14558 |
| R2MED project page | 2025 | project page | https://r2med.github.io/ |
| R2MED GitHub repository | 2025 | source repository | https://github.com/R2MED/R2MED |
| R2MED source datasets | 2025 | dataset collection | https://huggingface.co/R2MED |
Metadata Summary
| Field | Value |
| Task pages | 8 |
| Queries | 876 |
| Split-local documents | 80,000 |
| Positive qrels | 2,678 |
| Languages | en |
| Categories | natural_language |
| Positives / query avg | 3.06 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoR2MEDBioinformatics | NanoR2MED | en | natural_language | 77 | 10,000 | 226 | 0.2189 | 0.3425 | 0.2623 | Dense |
| NanoR2MEDBiology | NanoR2MED | en | natural_language | 103 | 10,000 | 374 | 0.3455 | 0.4953 | 0.4722 | Dense |
| NanoR2MEDIIYiClinical | NanoR2MED | en | natural_language | 129 | 10,000 | 457 | 0.1482 | 0.1870 | 0.1975 | Reranking hybrid |
| NanoR2MEDMedicalSciences | NanoR2MED | en | natural_language | 88 | 10,000 | 244 | 0.2140 | 0.3567 | 0.3320 | Dense |
| NanoR2MEDMedQADiag | NanoR2MED | en | natural_language | 118 | 10,000 | 522 | 0.0700 | 0.1254 | 0.1406 | Reranking hybrid |
| NanoR2MEDMedXpertQAExam | NanoR2MED | en | natural_language | 97 | 10,000 | 292 | 0.0277 | 0.1599 | 0.0979 | Dense |
| NanoR2MEDPMCClinical | NanoR2MED | en | natural_language | 114 | 10,000 | 248 | 0.3933 | 0.3584 | 0.4477 | Reranking hybrid |
| NanoR2MEDPMCTreatment | NanoR2MED | en | natural_language | 150 | 10,000 | 315 | 0.2580 | 0.3801 | 0.3555 | Dense |