NanoR2MED / NanoR2MEDMedicalSciences

Overview

NanoR2MEDMedicalSciences is an English Q&A reference retrieval task from R2MED. Queries are Medical Sciences StackExchange posts about health, physiology, nutrition, symptoms, and medical practice. Documents are external webpages or medical-reference passages that support accepted answers. The task tests whether a retriever can map consumer-facing or practitioner-facing questions to the specific evidence needed for an answer. Dense retrieval is clearly strongest, BM25 is weaker but not negligible, and the hybrid pool gives broad coverage but lower top-rank quality than dense.

Details

What the Original Data Measures

R2MED positions its Q&A reference retrieval tasks as reasoning-driven retrieval: the relevant document supports the answer to a forum question, even when it does not directly paraphrase the query. Medical Sciences uses StackExchange posts with accepted or highly upvoted answers and external links.

The benchmark pipeline expands candidate positives from query, answer, and reasoning-path retrieval views, then filters them through relevance assessment and expert review. The answer acts as an implicit bridge between the user question and supporting evidence.

Observed Data Profile

The Nano split contains 88 queries, 10,000 documents, and 244 positive qrel rows. Queries average 477.62 characters, and documents average 678.60 characters. Questions often include user context, misconceptions, or forum boilerplate.

Each query has 2.77 positives on average, with a median of 2 and a maximum of 8. Multi-positive queries account for 58 of 88 queries, or 65.91%. Examples ask whether microwave cooking is less healthy, whether acne can increase during weight loss, why pills are harder to swallow than food, how often to drink water, and whether protein supplements provide usable nutrients.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.2140, hit@10 of 0.4091, and recall@100 of 0.6598. BM25 is useful when a question contains distinctive terms such as microwave, acne, swallowing, water intake, or amino acids.

Its limitations come from everyday phrasing and broad health vocabulary. A query may use common words while the positive passage discusses a mechanism, physiological process, or nutrition concept. BM25 can find the topic but miss the evidence that actually supports the answer.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.3567, hit@10 of 0.7045, and recall@100 of 0.8197. Dense retrieval improves substantially over BM25 on all reported metrics.

This indicates that embedding similarity is better at linking informal medical questions to explanatory passages. Dense retrieval can connect a question about microwave health to nutrient loss, or a pill-swallowing question to bolus mechanics, even when surface overlap is limited.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with three rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.3320, hit@10 of 0.6136, and recall@100 of 0.7910. Hybrid retrieval is better than BM25 but below dense retrieval for top-rank quality and coverage.

The result suggests that sparse evidence adds some useful anchors but also introduces candidates that are topically similar without being answer-supporting. Dense retrieval is the stronger standalone first-stage method for this split.

Metric Interpretation for Model Researchers

This split is multi-positive. nDCG@10 measures how well several supporting passages are ranked early, hit@10 measures whether at least one useful source is retrieved near the top, and recall@100 measures evidence coverage for reranking.

For Medical Sciences, dense retrieval is the primary baseline to beat. Hybrid retrieval remains useful as a coverage-oriented comparison, but a good reranker must prioritize answer-supporting mechanisms over broad health-topic overlap.

Query and Relevance Type Tendencies

Queries are natural-language health questions, often written by non-specialists. Relevant passages can be medical encyclopedia sections, physiology explanations, nutrition references, or evidence summaries. The relevant document usually supports the accepted answer rather than directly mirroring the question.

Relevance depends on the specific mechanism, population, or recommendation. A passage about the same symptom or food category is not sufficient if it does not answer the user's question.

Representative Failure Modes

Common failures include retrieving broad pages about COVID, allergies, computers, food, or nutrition when the answer requires a specific mechanism; matching forum boilerplate instead of the medical intent; and selecting passages with the same everyday word but a different recommendation. BM25 is vulnerable to broad health terms; dense retrieval can still overgeneralize among adjacent mechanisms.

Training Data That May Help

Useful training data includes non-overlapping Medical Sciences StackExchange answer-link retrieval, consumer health QA with cited evidence, medical encyclopedia and physiology section retrieval, and hard negatives from adjacent health topics. Evaluation queries, qrels, and positive passages should be excluded.

Model Improvement Notes

Models should learn to map lay medical questions to evidence passages that support the actual answer. Multi-positive objectives are useful because several sources can support one response. Hard negatives should share common health vocabulary while differing in mechanism, population, or recommendation.

Example Data

Query	Positive document
Is food prepared in a microwave oven less healthy?/nThere are people who avoid preparing their food in microwave ovens for various health-related reasons. The claims most often stated are: Microwave radiation is harmful. Microwaving destroys vitamins and other nutrients. Is there any scientific evidence to suggest that microwaved food is less healthy compared to food prepared in more conventional ways? [405 chars]	How much water you're cooking with. Water-soluble vitamins like B and C can leach out of the vegetables and into the cooking water. For this reason, cooking methods that use little or no water are the best ones. Boiling (unless you're making a stew and so will be consuming the vitamin-filled water) is not a great choice. Microwaving and flash-steaming score high. Fat-soluble vitamins, such as A, D, E, K and carotenoids, are less affected by this factor. How much heat you're cooking with. Heat has been proven to degrade nutrients. Of course, it's tough to avoid heat when cooking. To retain as much of the good stuff as possible, simply limit the amount of time the vegetables are exposed to heat. Which brings us to … [729 chars]
Increase in acne during weight loss. Is it normal?/nI'm 100kg male. I'm losing 1kg-2kg per week with cardio and diet. Recently I noticed an increase in acnes around my arm and back. Also my forehad is more oily than usual. Are these changes normal for someone who is losing weight? I am overweight because of junk food and I still eat junk food. 1/3 of my daily intake is chips, pizza etc. but it is limited to 500cal a day on average. [435 chars]	Once the bacteria is on the scene, it attracts white blood cells to help fight back. This is what ultimately causes the inflammation that we refer to as a pimple. Blocked pores, dirty pores and an excess of oil production can all lead to acne. Some people may notice that exercise and perspiration causes an outbreak. There's long been a rumor that a good sweat will actually clean out your pores, but science says that's not the case. Sweat glands and oil pores are two different things, so not only does sweat not clean out oil pores, but it might actually make things worse. For one, irritants like dust and dirt are more likely to stick to moist skin, which can lead to clogged pores. [691 chars]
Why is it so much harder to swallow pills than it is to swallow food?/nI don't have any real trouble swallowing pills, and I do it several times a day. But when I try to swallow a pill without food or water in my mouth, it is a bit tricky. We're not talking about huge horse pills either, just regular, relatively small pills. I can swallow a whole raw oyster, which is the size of hundreds of pills combined, but a single little capsule or tablet is too much for me to consume without food or water?... [500 / 927 chars]	(A) The bolus is held between the anterior surface of the tongue and hard palate, in a “swallow ready” position (end of oral preparatory stage). The tongue presses against the palate both in front of and behind the bolus to prevent spillage. (B) The bolus is propelled from the oral cavity to the pharynx through the fauces (Oral propulsive stage). The anterior tongue pushes the bolus against the hard palate just behind the upper incisors while posterior tongue drops away from the palate. (C-D) Pharyngeal stage. (C) The soft palate elevates, closing off the nasopharynx. The area of tongue-palate contact spreads posteriorly, squeezing the bolus into the pharynx. The larynx is displaced upward and forward as the epiglottis tilts backward. (D) The upper esophageal sphincter opens. [786 chars]

Source Reference Table

Title	Year	Type	URL
R2MED: A Benchmark for Reasoning-Driven Medical Retrieval	2025	arXiv paper	https://arxiv.org/abs/2505.14558
R2MED project page	2025	project page	https://r2med.github.io/
R2MED GitHub repository	2025	source repository	https://github.com/R2MED/R2MED
R2MED/Medical-Sciences	2025	dataset card	https://huggingface.co/datasets/R2MED/Medical-Sciences

Dataset Information

Field	Value
Nano set	NanoR2MED
Backing dataset	NanoR2MED
Task / split	NanoR2MEDMedicalSciences
Hugging Face dataset	hakari-bench/NanoR2MED
Language	en
Category	natural_language
Queries	88
Documents	10,000
Positive qrels	244
Positives / query avg	2.77
Positives / query min	1
Positives / query median	2.00
Positives / query max	8
Multi-positive queries	58 (65.91%)
Query length avg chars	477.62
Document length avg chars	678.60

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.2140	0.4091	0.6598	top-500
Dense	`harrier_oss_v1_270m`	0.3567	0.7045	0.8197	top-500
Reranking hybrid	`reranking_hybrid`	0.3320	0.6136	0.7910	top-100

Training and Leakage Metadata

Original train split: not_found
Evaluation split origin: R2MED benchmark release sampled into NanoR2MED
Train/eval overlap audit: not_audited
Leakage note: exclude R2MED Medical Sciences evaluation queries, qrels, and positive passages
Multi-positive training: multi_positive_objective
Useful training data: non-overlapping Medical Sciences StackExchange answer-link retrieval, consumer health QA with cited evidence, medical encyclopedia and physiology section retrieval, hard negatives from adjacent health topics