NanoMedical
Overview
NanoMedical is a multilingual medical, biomedical, and public-health retrieval group. It covers Chinese medical consultation answer selection, clinician-facing clinical passage retrieval, consumer medical FAQ retrieval, nutrition and health literature search, public-health FAQ retrieval in Arabic, scientific claim evidence retrieval, and COVID-19 literature retrieval in English and Polish. The group is useful because it treats medical retrieval as several different evidence-matching problems rather than one domain.
The group contains 1,586 queries, 66,052 task-local documents, and 10,438 positive qrel rows. It should be read as a retrieval benchmark, not as a clinical decision tool. Some tasks retrieve scientific abstracts, some retrieve public guidance, and some retrieve online consultation answers. Those settings have different risks, document styles, and training requirements.
What This Group Measures
NanoMedical measures whether retrieval systems can connect medical questions, claims, and information needs to the right evidence surface. NanoCUREv1 retrieves biomedical passages for clinician-oriented questions. NanoNFCorpus maps short lay health topics to scientific articles. NanoSciFact and NanoSciFactPL retrieve abstracts that support or refute biomedical claims. NanoTRECCOVID and NanoTRECCOVIDPL retrieve COVID-19 literature records. NanoMedicalQA and NanoPublicHealthQA retrieve trusted-source answers, while NanoCmedqa and NanoCMedQAv2reranking retrieve Chinese consultation answers.
The group also measures multilingual robustness. English, Chinese, Arabic, and Polish are all present, and the translated Polish scientific tasks behave differently from their English counterparts. A model that handles English medical abstracts well may still fail on Chinese patient-style questions or Arabic public-health FAQ wording.
Task Families
- Clinical passage retrieval:
NanoCUREv1retrieves biomedical passages for clinician-oriented questions. - Consumer medical QA and FAQ retrieval:
NanoMedicalQAandNanoPublicHealthQAretrieve guidance or public-health answers. - Chinese consultation answer retrieval:
NanoCmedqaandNanoCMedQAv2rerankingretrieve answer candidates for patient-style Chinese questions. - Medical and nutrition literature retrieval:
NanoNFCorpusretrieves biomedical literature for lay health topics. - Scientific claim evidence retrieval:
NanoSciFactandNanoSciFactPLretrieve biomedical abstracts for claims. - COVID-19 literature retrieval:
NanoTRECCOVIDandNanoTRECCOVIDPLretrieve pandemic literature records.
Dataset Shape
The group has ten task pages. Positive density varies widely. NanoCUREv1 averages 25.91 positives per query and NanoNFCorpus averages 18.59, so those tasks evaluate many-relevant-document ranking. NanoMedicalQA, NanoPublicHealthQA, and both TREC-COVID Nano splits are single-positive in the metadata. The Chinese consultation tasks and SciFact variants sit between those extremes.
Document types are also different. Chinese consultation answers are short. Public-health and MedicalQA answers are medium-length guidance passages. NFCorpus, SciFact, and TREC-COVID use scientific or biomedical article records. CURE uses clinical passages with many positives per query. The group therefore tests answer selection, evidence retrieval, and literature search at the same time.
Retrieval Behavior
BM25 Profile
BM25 is the best nDCG@10 profile only for NanoTRECCOVID, where exact COVID-19 terminology, intervention terms, and biomedical phrases provide useful sparse anchors. It is also strong on NanoSciFact, NanoSciFactPL, and NanoPublicHealthQA, where disease names, claim terms, and public-health phrasing often overlap with the target document.
BM25 is much weaker on Chinese consultation answer selection and broad health literature retrieval. NanoCmedqa and NanoCMedQAv2reranking require matching patient symptoms to useful advice, not only shared Chinese terms. NanoNFCorpus uses short lay topics against technical literature, so many documents share medical terms but differ in actual relevance. The group-level BM25 nDCG@10 is 0.4288.
Dense Profile
Dense retrieval with harrier-oss-270m is best for four tasks: NanoCMedQAv2reranking, NanoCmedqa, NanoMedicalQA, and NanoPublicHealthQA. These tasks rely on answerability and semantic matching between a question and an answer passage or consultation reply. The gains on Chinese medical QA are especially clear: both Chinese tasks roughly double nDCG@10 compared with BM25.
Dense has the highest group-level nDCG@10 at 0.5138. It also improves MedicalQA and PublicHealthQA, suggesting that semantic answer retrieval is important for patient- and public-facing medical questions. Dense is not always best for scientific literature and claim evidence, where exact biomedical terms and hybrid candidate coverage remain important.
Reranking Hybrid Profile
The reranking hybrid profile is best for NanoCUREv1, NanoNFCorpus, NanoSciFact, NanoSciFactPL, and NanoTRECCOVIDPL. These are mostly scientific or literature-style tasks where exact biomedical terms and semantic relatedness both matter. Hybrid also has the best query-weighted recall@100 at 0.7507, which is important for tasks with many positives per query.
Hybrid is not the best profile for Chinese consultation QA or public-health FAQ retrieval, where dense semantic matching leads top-10 ranking. It is also below BM25 on English NanoTRECCOVID. The practical reading is that hybrid is strong for biomedical candidate generation and claim/literature retrieval, while dense retrieval is more important for question-to-answer medical matching.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoCMedQAv2reranking | Chinese medical answer retrieval | zh | 200 | 10,000 | 377 | 1.89 | 0.1527 | 0.3209 | 0.2529 | Dense |
| NanoCUREv1 | Clinical passage retrieval | en | 200 | 10,000 | 5,181 | 25.91 | 0.4693 | 0.5003 | 0.5262 | Reranking hybrid |
| NanoCmedqa | Chinese consultation answer retrieval | zh | 200 | 10,000 | 324 | 1.62 | 0.1669 | 0.3380 | 0.2591 | Dense |
| NanoMedicalQA | Medical FAQ retrieval | en | 200 | 2,007 | 200 | 1.00 | 0.5439 | 0.7308 | 0.6510 | Dense |
| NanoNFCorpus | Biomedical literature retrieval | en | 200 | 3,593 | 3,718 | 18.59 | 0.2921 | 0.3070 | 0.3182 | Reranking hybrid |
| NanoPublicHealthQA | Public-health FAQ retrieval | ar | 86 | 86 | 86 | 1.00 | 0.7379 | 0.8176 | 0.7847 | Dense |
| NanoSciFact | Scientific claim evidence retrieval | en | 200 | 5,183 | 226 | 1.13 | 0.7017 | 0.7334 | 0.7506 | Reranking hybrid |
| NanoSciFactPL | Scientific claim evidence retrieval | pl | 200 | 5,183 | 226 | 1.13 | 0.5750 | 0.6061 | 0.6538 | Reranking hybrid |
| NanoTRECCOVID | COVID-19 literature retrieval | en | 50 | 10,000 | 50 | 1.00 | 0.3983 | 0.3875 | 0.3193 | BM25 |
| NanoTRECCOVIDPL | COVID-19 literature retrieval | pl | 50 | 10,000 | 50 | 1.00 | 0.3266 | 0.3585 | 0.3864 | Reranking hybrid |
Interpretation Notes for Model Researchers
NanoMedical should be interpreted by retrieval surface. Dense-led gains on Chinese consultation, MedicalQA, and Arabic PublicHealthQA indicate better question-to-answer matching. Hybrid-led gains on CURE, NFCorpus, SciFact, and Polish TREC-COVID indicate better combination of biomedical term matching and semantic evidence retrieval. BM25 remains a meaningful baseline where exact biomedical terminology is central, but it does not explain most of the group.
The high positive density of CURE and NFCorpus also changes what "good" means. For those tasks, retrieving one relevant passage is not enough; ranking many relevant biomedical documents early matters. Medical model comparisons should therefore inspect per-task nDCG@10 and recall@100 instead of relying only on the aggregate group score.
Training and Leakage Notes
Useful training data includes clinical question-passage pairs, PubMed and PMC retrieval, CORD-19 judgments, NFCorpus-style health-topic supervision, SciFact-style claim-evidence pairs, trusted medical FAQ data, Arabic public-health QA, Chinese medical consultation QA, and Polish biomedical retrieval pairs. Multi-positive tasks should preserve all relevant passages or abstracts when possible.
Leakage control should exclude Nano evaluation queries, qrels, positive documents, answer strings, consultation replies, source FAQ pages, and translated variants. Medical datasets often contain repeated guidance templates or translated near-duplicates, so overlap checks should include semantic and document-level duplication, not just exact query text.
Source Reference Table
| Source | Year | Type | URL |
| CURE: A Dataset for Clinical Understanding & Retrieval Evaluation | 2025 | benchmark paper | https://doi.org/10.1145/3711896.3737435 |
| A Full-Text Learning to Rank Dataset for Medical Information Retrieval | 2016 | source task paper | http://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf |
| Searching for Scientific Evidence in a Pandemic: An Overview of TREC-COVID | 2021 | source task paper | https://arxiv.org/abs/2104.09632 |
| BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language | 2024 | benchmark paper | https://aclanthology.org/2024.lrec-main.194/ |
| Fact or Fiction: Verifying Scientific Claims | 2020 | source task paper | https://aclanthology.org/2020.emnlp-main.609/ |
| A Question-Entailment Approach to Question Answering | 2019 | source task paper | https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4 |
| publichealth-qa | 2024 | dataset card | https://huggingface.co/datasets/xhluca/publichealth-qa |
| DuReader_retrieval | 2022 | source task paper | https://aclanthology.org/2022.emnlp-main.357/ |
| Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection | 2018 | source task paper | https://doi.org/10.1109/ACCESS.2018.2883637 |
Metadata Summary
| Field | Value |
| Task pages | 10 |
| Queries | 1,586 |
| Split-local documents | 66,052 |
| Positive qrels | 10,438 |
| Languages | ar, en, pl, zh |
| Categories | natural_language |
| Positives / query avg | 6.58 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoCmedqa | NanoMedical | zh | natural_language | 200 | 10,000 | 324 | 0.1669 | 0.3380 | 0.2591 | Dense |
| NanoCMedQAv2reranking | NanoMedical | zh | natural_language | 200 | 10,000 | 377 | 0.1527 | 0.3209 | 0.2529 | Dense |
| NanoCUREv1 | NanoMedical | en | natural_language | 200 | 10,000 | 5,181 | 0.4693 | 0.5003 | 0.5262 | Reranking hybrid |
| NanoMedicalQA | NanoMedical | en | natural_language | 200 | 2,007 | 200 | 0.5439 | 0.7308 | 0.6510 | Dense |
| NanoNFCorpus | NanoMedical | en | natural_language | 200 | 3,593 | 3,718 | 0.2921 | 0.3070 | 0.3182 | Reranking hybrid |
| NanoPublicHealthQA | NanoMedical | ar | natural_language | 86 | 86 | 86 | 0.7379 | 0.8176 | 0.7847 | Dense |
| NanoSciFact | NanoMedical | en | natural_language | 200 | 5,183 | 226 | 0.7017 | 0.7334 | 0.7506 | Reranking hybrid |
| NanoSciFactPL | NanoMedical | pl | natural_language | 200 | 5,183 | 226 | 0.5750 | 0.6061 | 0.6538 | Reranking hybrid |
| NanoTRECCOVID | NanoMedical | en | natural_language | 50 | 10,000 | 50 | 0.3983 | 0.3875 | 0.3193 | BM25 |
| NanoTRECCOVIDPL | NanoMedical | pl | natural_language | 50 | 10,000 | 50 | 0.3266 | 0.3585 | 0.3864 | Reranking hybrid |