HAKARI-Bench

NanoMedical / NanoMedicalQA

Overview

NanoMedical / NanoMedicalQA is an English medical FAQ retrieval task. Queries are short consumer-health questions, and candidate documents are trusted-source answer passages. The task is connected to medical question answering work using Recognizing Question Entailment and MedQuAD-style trusted medical QA resources. In this Nano split, each query has exactly one positive document, so the main challenge is retrieving the right answer type for a disease or condition. It is useful for studying consumer-health retrieval where disease names are strong lexical anchors but relevance depends on whether the passage answers symptoms, diagnosis, prevention, treatment, or definition.

Details

What the Original Data Measures

The source paper studies medical question answering through question entailment and reliable answer retrieval. It introduces MedQuAD, a large collection of medical question-answer pairs from trusted sources, and emphasizes matching user questions to already answered medical questions and their answers.

This retrieval task presents the same problem in direct form: the query is a medical question, and the target document is an answer-bearing passage. The answer text is consumer-facing guidance rather than a scientific abstract.

Observed Data Profile

The Nano split contains 200 queries, 2,007 documents, and 200 positive qrel rows. Each query has exactly one positive. Queries average 54.23 characters, while documents average 1,102.43 characters.

The examples ask about nocardiosis symptoms, babesiosis treatment, zoonotic hookworm diagnosis, lymphatic filariasis prevention, and hookworm prevention. Many queries follow FAQ templates such as What are the symptoms, How to diagnose, How to prevent, or What are the treatments.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.5439, hit@10 of 0.8550, and recall@100 of 0.9200. BM25 performs well because disease names, parasite names, and answer-type words often repeat in the answer passage.

Its main weakness is answer-type confusion. A passage about the correct disease may describe symptoms when the query asks for treatment, or define a condition when the query asks for prevention. Sparse matching can find the disease neighborhood but may not select the exact FAQ section.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.7308, hit@10 of 0.8850, and recall@100 of 0.9250. Dense retrieval is clearly stronger than BM25 in top-rank quality, while recall@100 is only slightly higher.

This indicates that semantic modeling helps distinguish answer types and match question intent. Dense retrieval can better rank a treatment answer above a definition answer when both share the same condition name.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 6 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.6510, hit@10 of 0.8650, and recall@100 of 0.9700. Hybrid retrieval has the best recall@100, but dense retrieval remains strongest by nDCG@10 and hit@10.

This shows that lexical and dense signals are complementary for coverage, while dense retrieval is better at top-rank answer-type selection. A reranker can benefit from the hybrid pool if it explicitly learns FAQ answer categories.

Metric Interpretation for Model Researchers

This is a single-positive task, so nDCG@10 and hit@10 are direct measures of whether the one correct answer passage appears early. Recall@100 shows candidate-generation completeness for reranking.

Because questions are templated, high performance can hide answer-type errors. Researchers should inspect whether failures retrieve the right condition but wrong section.

Query and Relevance Type Tendencies

Queries are concise medical FAQ questions about symptoms, treatments, diagnosis, prevention, definitions, or causes. Relevant documents are longer consumer-health guidance passages from trusted medical sources.

The relevance relation is exact answer type for the condition. A same-disease passage is not enough if it answers a different question.

Representative Failure Modes

Common failures include retrieving the correct disease but wrong answer type, confusing similar parasites or infections, ranking broad overview passages above specific treatment or diagnosis passages, and over-weighting repeated organism names. Dense systems can still miss exact disease variants; sparse systems often miss the question intent.

Training Data That May Help

Useful training data includes non-overlapping medical FAQ retrieval pairs, consumer-health question-answer datasets, MedQuAD-style trusted-source QA pairs, and answer-type reranking data for definition, diagnosis, prevention, symptoms, and treatment. Overlapping MedQuAD pages and near-duplicate templated questions should be excluded for clean evaluation.

Model Improvement Notes

Models should learn both disease matching and answer-type discrimination. Hard negatives should use the same disease name but a different answer type. Rerankers should treat question words such as diagnose, prevent, symptoms, and treatments as central relevance signals, not incidental tokens.

Example Data

QueryPositive document
What are the symptoms of Nocardiosis ? [38 chars]The symptoms of nocardiosis vary depending on which part of your body is affected. Nocardiosis infection most commonly occurs in the lung. If your lungs are infected, you can experience: - Fever - Weight loss - Night sweats - Cough - Chest pain - Pneumonia When lung infections occur, the infection commonly spreads to the brain. If your central nervous system (brain and spinal cord) is infected, you can experience: - Headache - Weakness - Confusion - Seizures (sudden, abnormal electrical activity in the brain) Skin infections can occur when open wounds or cuts come into contact with contaminated soil. If your skin is affected, you can experience: - Ulcers - Nodules sometimes draining and spreading along lymph nodes [823 chars]
What are the treatments for Parasites - Babesiosis ? [52 chars]Effective treatments are available. People who do not have any symptoms or signs of babesiosis usually do not need to be treated. Before considering treatment, the first step is to make sure the diagnosis is correct. For more information, people should talk to their health care provider. More on: Resources for Health Professionals: Treatment [358 chars]
How to diagnose Parasites - Zoonotic Hookworm ? [47 chars]Cutaneous larva migrans (CLM) is a clinical diagnosis based on the presence of the characteristic signs and symptoms, and exposure history to zoonotic hookworm. For example, the diagnosis can be made based on finding red, raised tracks in the skin that are very itchy. This is usually found on the feet or lower part of the legs on persons who have recently traveled to tropical areas and spent time at the beach. There is no blood test for zoonotic hookworm infection. Persons who think they have CLM should consult their health care provider for accurate diagnosis. [567 chars]

Source Reference Table

TitleYearTypeURL
A question-entailment approach to question answering2019BMC Bioinformatics articlehttps://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-019-3119-4
A question-entailment approach to question answering2019DOIhttps://doi.org/10.1186/s12859-019-3119-4

Dataset Information

FieldValue
Nano setNanoMedical
Backing datasetNanoMedical
Task / splitNanoMedicalQA
Hugging Face datasethakari-bench/NanoMedical
Languageen
Categorynatural_language
Queries200
Documents2,007
Positive qrels200
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars54.23
Document length avg chars1,102.43

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.54390.85500.9200top-500
Denseharrier_oss_v1_270m0.73080.88500.9250top-500
Reranking hybridreranking_hybrid0.65100.86500.9700top-100

Training and Leakage Metadata