HAKARI-Bench

NanoMMTEB-v2 / treccovid

Overview

NanoMMTEB-v2 / treccovid is an English biomedical ad-hoc retrieval task from the TREC-COVID challenge. Queries are COVID-19 information needs, and documents are scientific article titles and abstracts from pandemic literature. The Nano split has 50 queries, 10,000 documents, and 4,527 positive qrel rows. Every query is heavily multi-positive, averaging 90.54 relevant articles. Current diagnostics show reranking_hybrid as the strongest profile, dense retrieval second, and BM25 third. All methods find at least one relevant article for most queries, but ranking many relevant biomedical articles remains difficult.

Details

What the Original Data Measures

TREC-COVID constructed an information retrieval test collection over CORD-19 literature during the COVID-19 pandemic. It used evolving search topics, pooled systems, and relevance judgments from biomedical assessors. The task reflects real pandemic information needs where terminology, evidence, and literature coverage changed quickly.

This retrieval task measures biomedical literature search. A model must return articles addressing questions about treatments, transmission, testing, serology, biomarkers, epidemiology, social distancing, or clinical outcomes.

Observed Data Profile

The Nano split contains 50 queries, 10,000 documents, and 4,527 positive qrel rows. Every query has multiple positives: the average is 90.54 positives per query, with a minimum of 22, median of 100, and maximum of 100. Queries average 69.24 characters, while documents average 1,321.57 characters.

Documents are biomedical title/abstract records. Example topics ask about dexamethasone treatment, coronavirus stability on surfaces, social distancing, serological tests, and biomarkers predicting severe clinical course.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.3627, hit@10 = 0.9200, and recall@100 = 0.2319. BM25 is a useful biomedical baseline because disease names, drug names, mechanistic terms, and exact topic wording often appear in article abstracts.

However, BM25 is the weakest of the three provided profiles. Many relevant articles discuss the same biomedical evidence need using different terminology, abbreviations, or clinical framing. Ranking among dozens of relevant articles requires more than term frequency.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.4266, hit@10 = 0.9400, and recall@100 = 0.2576. Dense retrieval improves over BM25 across all reported metrics.

This suggests that embedding similarity helps connect information needs to biomedical abstracts when wording differs. Dense retrieval can capture relationships among interventions, outcomes, pathogens, evidence types, and clinical mechanisms, though recall@100 remains limited because each query has many positives.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains 100 candidates per query and achieves nDCG@10 = 0.4505, hit@10 = 0.9800, and recall@100 = 0.2761. Hybrid retrieval is the strongest observed profile across the reported metrics.

This is a clear hybrid-search success case. BM25 contributes exact biomedical terms, while dense retrieval contributes semantic evidence matching. The hybrid pool still covers only a minority of all positives because there are many relevant articles, but it ranks strong evidence near the top more effectively than either component alone.

Metric Interpretation for Model Researchers

This is an extremely multi-positive task. Hit@10 is easy to satisfy because most queries have many relevant documents. nDCG@10 is more informative because it rewards putting relevant biomedical evidence early. Recall@100 should be read relative to the large positive set; even 100 candidates cannot cover all relevant articles for many topics.

Researchers should focus on top-rank evidence quality and diversity, not only whether any relevant article is found.

Query and Relevance Type Tendencies

Queries are English biomedical information needs about COVID-19. They mention treatments, virus stability, distancing, antibody tests, biomarkers, clinical course, transmission, prevention, and public-health measures. Relevant documents are article titles and abstracts that address the evidence need.

The task rewards both exact biomedical terminology and semantic understanding of populations, interventions, outcomes, mechanisms, and evidence type.

Representative Failure Modes

BM25 can over-rank articles that share disease terms but address a different clinical question. Dense retrieval can retrieve broadly related COVID-19 papers without matching the specific intervention, outcome, or population. Hybrid retrieval can still miss relevant diversity when many articles satisfy the same topic.

Rerankers should distinguish clinical intervention, mechanistic evidence, observational findings, diagnostics, and epidemiological claims.

Training Data That May Help

Useful training data includes biomedical literature retrieval, CORD-19 and PubMed title-abstract retrieval, TREC-style judged biomedical topics outside this split, and disease or drug matched hard negatives. The Nano split's TREC-COVID topics, qrels, and judged article records should be excluded from training.

Synthetic data can generate biomedical titles and abstracts with clinical and mechanistic evidence. Questions should ask COVID-19 information needs about populations, interventions, outcomes, and mechanisms. Hard negatives should share SARS-CoV-2 terminology but differ in clinical or mechanistic focus.

Model Improvement Notes

Dense retrievers should improve biomedical synonymy, abbreviation handling, and evidence-type discrimination. Sparse systems should preserve exact disease, drug, and outcome terms. Rerankers should evaluate whether the article addresses the specific PICO-style evidence need.

For hybrid systems, NanoMMTEB-v2 / treccovid is a strong positive case: reranking_hybrid gives the best nDCG@10, hit@10, and recall@100. The next challenge is ranking diverse high-quality evidence among many relevant articles.

Example Data

QueryPositive document
what evidence is there for dexamethasone as a treatment for COVID-19? [69 chars]Targeting inflammation and cytokine storm in COVID-19 [53 chars]
how long does coronavirus remain stable on surfaces? [53 chars]Body fluids may contribute to human-to-human transmission of severe acute respiratory syndrome coronavirus 2: evidence and practical experience BACKGROUND: In December 2019, an unbelievable outbreak of pneumonia associated with coronavirus was reported in the city of Wuhan, Hubei Province. This virus was called severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). Although much effort has been spent on clarifying the transmission route of SARS-CoV-2, but, very little evidence is available regarding the relationship between human body fluids and transmission of SARS-CoV-2 virus. Considerable evidence from hospital in Wuhan indicates that strict rules to avoid occupational exposure to patients’ body fluids in healthcare settings, particularly among every medical staff, limited person-to-person transmission of nosocomial infections by direct or indirect contact. CONCLUSION: We tried to provide important information for understanding the possible transmission routes of SARS-CoV-2 v... [1,000 / 1,172 chars]
has social distancing had an impact on slowing the spread of COVID-19? [70 chars]A first study on the impact of current and future control measures on the spread of COVID-19 in Germany The novel coronavirus (SARS-CoV-2), identified in China at the end of December 2019 and causing the disease COVID-19, has meanwhile led to outbreaks all over the globe, with about 571,700 confirmed cases and about 26,500 deaths as of March 28th, 2020. We present here the preliminary results of a mathematical study directed at informing on the possible application or lifting of control measures in Germany. The developed mathematical models allow to study the spread of COVID-19 among the population in Germany and to asses the impact of non-pharmaceutical interventions. The overall goal is to suggest strategies for the mitigation of the current outbreak, slowing down the spread of the virus and thus reducing the peak in daily diagnosed cases, the demand for hospitalization or intensive care units admissions, and eventually fatalities. [948 chars]

Source Reference Table

TitleYearTypeURL
TREC-COVID: Constructing a Pandemic Information Retrieval Test Collection2020task paperhttps://arxiv.org/abs/2005.04474
NIST TREC-COVID challenge page2020challenge pagehttps://ir.nist.gov/covidSubmit/index.html
mteb/trec-covid2024dataset cardhttps://huggingface.co/datasets/mteb/trec-covid

Dataset Information

FieldValue
Nano setNanoMMTEB-v2
Backing datasetNanoMMTEB-v2
Task / splittreccovid
Hugging Face datasethakari-bench/NanoMMTEB-v2
Languageen
Categorynatural_language
Queries50
Documents10,000
Positive qrels4,527
Positives / query avg90.54
Positives / query min22
Positives / query median100.00
Positives / query max100
Multi-positive queries50 (100.00%)
Query length avg chars69.24
Document length avg chars1,321.57

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.36270.92000.2319top-500
Denseharrier_oss_v1_270m0.42660.94000.2576top-500
Reranking hybridreranking_hybrid0.45050.98000.2761top-100

Training and Leakage Metadata