HAKARI-Bench

NanoMedical

Overview

NanoMedical is a multilingual medical, biomedical, and public-health retrieval group. It covers Chinese medical consultation answer selection, clinician-facing clinical passage retrieval, consumer medical FAQ retrieval, nutrition and health literature search, public-health FAQ retrieval in Arabic, scientific claim evidence retrieval, and COVID-19 literature retrieval in English and Polish. The group is useful because it treats medical retrieval as several different evidence-matching problems rather than one domain.

The group contains 1,586 queries, 66,052 task-local documents, and 10,438 positive qrel rows. It should be read as a retrieval benchmark, not as a clinical decision tool. Some tasks retrieve scientific abstracts, some retrieve public guidance, and some retrieve online consultation answers. Those settings have different risks, document styles, and training requirements.

What This Group Measures

NanoMedical measures whether retrieval systems can connect medical questions, claims, and information needs to the right evidence surface. NanoCUREv1 retrieves biomedical passages for clinician-oriented questions. NanoNFCorpus maps short lay health topics to scientific articles. NanoSciFact and NanoSciFactPL retrieve abstracts that support or refute biomedical claims. NanoTRECCOVID and NanoTRECCOVIDPL retrieve COVID-19 literature records. NanoMedicalQA and NanoPublicHealthQA retrieve trusted-source answers, while NanoCmedqa and NanoCMedQAv2reranking retrieve Chinese consultation answers.

The group also measures multilingual robustness. English, Chinese, Arabic, and Polish are all present, and the translated Polish scientific tasks behave differently from their English counterparts. A model that handles English medical abstracts well may still fail on Chinese patient-style questions or Arabic public-health FAQ wording.

Task Families

Dataset Shape

The group has ten task pages. Positive density varies widely. NanoCUREv1 averages 25.91 positives per query and NanoNFCorpus averages 18.59, so those tasks evaluate many-relevant-document ranking. NanoMedicalQA, NanoPublicHealthQA, and both TREC-COVID Nano splits are single-positive in the metadata. The Chinese consultation tasks and SciFact variants sit between those extremes.

Document types are also different. Chinese consultation answers are short. Public-health and MedicalQA answers are medium-length guidance passages. NFCorpus, SciFact, and TREC-COVID use scientific or biomedical article records. CURE uses clinical passages with many positives per query. The group therefore tests answer selection, evidence retrieval, and literature search at the same time.

Retrieval Behavior

BM25 Profile

BM25 is the best nDCG@10 profile only for NanoTRECCOVID, where exact COVID-19 terminology, intervention terms, and biomedical phrases provide useful sparse anchors. It is also strong on NanoSciFact, NanoSciFactPL, and NanoPublicHealthQA, where disease names, claim terms, and public-health phrasing often overlap with the target document.

BM25 is much weaker on Chinese consultation answer selection and broad health literature retrieval. NanoCmedqa and NanoCMedQAv2reranking require matching patient symptoms to useful advice, not only shared Chinese terms. NanoNFCorpus uses short lay topics against technical literature, so many documents share medical terms but differ in actual relevance. The group-level BM25 nDCG@10 is 0.4288.

Dense Profile

Dense retrieval with harrier-oss-270m is best for four tasks: NanoCMedQAv2reranking, NanoCmedqa, NanoMedicalQA, and NanoPublicHealthQA. These tasks rely on answerability and semantic matching between a question and an answer passage or consultation reply. The gains on Chinese medical QA are especially clear: both Chinese tasks roughly double nDCG@10 compared with BM25.

Dense has the highest group-level nDCG@10 at 0.5138. It also improves MedicalQA and PublicHealthQA, suggesting that semantic answer retrieval is important for patient- and public-facing medical questions. Dense is not always best for scientific literature and claim evidence, where exact biomedical terms and hybrid candidate coverage remain important.

Reranking Hybrid Profile

The reranking hybrid profile is best for NanoCUREv1, NanoNFCorpus, NanoSciFact, NanoSciFactPL, and NanoTRECCOVIDPL. These are mostly scientific or literature-style tasks where exact biomedical terms and semantic relatedness both matter. Hybrid also has the best query-weighted recall@100 at 0.7507, which is important for tasks with many positives per query.

Hybrid is not the best profile for Chinese consultation QA or public-health FAQ retrieval, where dense semantic matching leads top-10 ranking. It is also below BM25 on English NanoTRECCOVID. The practical reading is that hybrid is strong for biomedical candidate generation and claim/literature retrieval, while dense retrieval is more important for question-to-answer medical matching.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoCMedQAv2rerankingChinese medical answer retrievalzh20010,0003771.890.15270.32090.2529Dense
NanoCUREv1Clinical passage retrievalen20010,0005,18125.910.46930.50030.5262Reranking hybrid
NanoCmedqaChinese consultation answer retrievalzh20010,0003241.620.16690.33800.2591Dense
NanoMedicalQAMedical FAQ retrievalen2002,0072001.000.54390.73080.6510Dense
NanoNFCorpusBiomedical literature retrievalen2003,5933,71818.590.29210.30700.3182Reranking hybrid
NanoPublicHealthQAPublic-health FAQ retrievalar8686861.000.73790.81760.7847Dense
NanoSciFactScientific claim evidence retrievalen2005,1832261.130.70170.73340.7506Reranking hybrid
NanoSciFactPLScientific claim evidence retrievalpl2005,1832261.130.57500.60610.6538Reranking hybrid
NanoTRECCOVIDCOVID-19 literature retrievalen5010,000501.000.39830.38750.3193BM25
NanoTRECCOVIDPLCOVID-19 literature retrievalpl5010,000501.000.32660.35850.3864Reranking hybrid

Interpretation Notes for Model Researchers

NanoMedical should be interpreted by retrieval surface. Dense-led gains on Chinese consultation, MedicalQA, and Arabic PublicHealthQA indicate better question-to-answer matching. Hybrid-led gains on CURE, NFCorpus, SciFact, and Polish TREC-COVID indicate better combination of biomedical term matching and semantic evidence retrieval. BM25 remains a meaningful baseline where exact biomedical terminology is central, but it does not explain most of the group.

The high positive density of CURE and NFCorpus also changes what "good" means. For those tasks, retrieving one relevant passage is not enough; ranking many relevant biomedical documents early matters. Medical model comparisons should therefore inspect per-task nDCG@10 and recall@100 instead of relying only on the aggregate group score.

Training and Leakage Notes

Useful training data includes clinical question-passage pairs, PubMed and PMC retrieval, CORD-19 judgments, NFCorpus-style health-topic supervision, SciFact-style claim-evidence pairs, trusted medical FAQ data, Arabic public-health QA, Chinese medical consultation QA, and Polish biomedical retrieval pairs. Multi-positive tasks should preserve all relevant passages or abstracts when possible.

Leakage control should exclude Nano evaluation queries, qrels, positive documents, answer strings, consultation replies, source FAQ pages, and translated variants. Medical datasets often contain repeated guidance templates or translated near-duplicates, so overlap checks should include semantic and document-level duplication, not just exact query text.

Source Reference Table

SourceYearTypeURL
CURE: A Dataset for Clinical Understanding & Retrieval Evaluation2025benchmark paperhttps://doi.org/10.1145/3711896.3737435
A Full-Text Learning to Rank Dataset for Medical Information Retrieval2016source task paperhttp://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf
Searching for Scientific Evidence in a Pandemic: An Overview of TREC-COVID2021source task paperhttps://arxiv.org/abs/2104.09632
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language2024benchmark paperhttps://aclanthology.org/2024.lrec-main.194/
Fact or Fiction: Verifying Scientific Claims2020source task paperhttps://aclanthology.org/2020.emnlp-main.609/
A Question-Entailment Approach to Question Answering2019source task paperhttps://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4
publichealth-qa2024dataset cardhttps://huggingface.co/datasets/xhluca/publichealth-qa
DuReader_retrieval2022source task paperhttps://aclanthology.org/2022.emnlp-main.357/
Multi-Scale Attentive Interaction Networks for Chinese Medical Question Answer Selection2018source task paperhttps://doi.org/10.1109/ACCESS.2018.2883637

Metadata Summary

FieldValue
Task pages10
Queries1,586
Split-local documents66,052
Positive qrels10,438
Languagesar, en, pl, zh
Categoriesnatural_language
Positives / query avg6.58

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoCmedqaNanoMedicalzhnatural_language20010,0003240.16690.33800.2591Dense
NanoCMedQAv2rerankingNanoMedicalzhnatural_language20010,0003770.15270.32090.2529Dense
NanoCUREv1NanoMedicalennatural_language20010,0005,1810.46930.50030.5262Reranking hybrid
NanoMedicalQANanoMedicalennatural_language2002,0072000.54390.73080.6510Dense
NanoNFCorpusNanoMedicalennatural_language2003,5933,7180.29210.30700.3182Reranking hybrid
NanoPublicHealthQANanoMedicalarnatural_language8686860.73790.81760.7847Dense
NanoSciFactNanoMedicalennatural_language2005,1832260.70170.73340.7506Reranking hybrid
NanoSciFactPLNanoMedicalplnatural_language2005,1832260.57500.60610.6538Reranking hybrid
NanoTRECCOVIDNanoMedicalennatural_language5010,000500.39830.38750.3193BM25
NanoTRECCOVIDPLNanoMedicalplnatural_language5010,000500.32660.35850.3864Reranking hybrid