NanoMTEB-Dutch / nfcorpus_nl
Overview
nfcorpus_nl is the Dutch NFCorpus retrieval task from BEIR-NL. Queries are short consumer-health or nutrition topics, and documents are translated medical or biomedical passages. The Nano split contains 199 queries, 3,593 documents, and 5,880 positive qrel rows. It is strongly multi-positive: the average query has 29.55 positives, the median is 15, and 181 of 199 queries have more than one positive.
This task is very different from single-positive retrieval. Queries are often extremely short, averaging only 18.51 characters, while documents average 1,743.72 characters and can contain technical biomedical terminology. BM25 has the best nDCG@10, dense retrieval has better recall@100 than BM25, and reranking_hybrid has the best hit@10 and recall@100. Because each query can map to many relevant medical documents, recall should be interpreted as coverage over a relevance set rather than success on one target passage.
Details
What the Original Data Measures
NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval describes a medical information retrieval dataset built from NutritionFacts.org topics linked to PubMed and PubMed Central evidence. The original task focuses on the gap between lay health information needs and biomedical literature, using relevance signals that can connect one topic to many documents.
BEIR includes NFCorpus as a medical retrieval task, and BEIR-NL translates public BEIR datasets into Dutch. This Nano task should therefore be read as translated medical retrieval. Some technical names, biomedical terms, and English or multilingual artifacts can remain visible, while the surrounding text is Dutch-translated.
Observed Data Profile
The split has 199 queries and 3,593 documents. It contains 5,880 positive qrels, which is much denser than most Nano retrieval tasks. The maximum number of positives for a single query is 100. Documents are long biomedical or health passages, often resembling translated abstracts with exposure, outcome, method, or population details.
Representative queries include short topics such as bagels, grapes, Dr. Walter Willett, chronic headache and pork parasites, and Native Americans. The positive documents cover topics such as poppy seeds and opiate testing, plant polyphenols and cognition, coconut oil and lipid profiles, neurocysticercosis, and diet-related disease. The query-to-document bridge is often conceptual rather than a direct wording match.
BM25 Evaluation Profile
BM25 reaches nDCG@10 = 0.2683, hit@10 = 0.6181, and recall@100 = 0.1371 over top-500 candidate lists. BM25 is the best nDCG@10 source, which shows that exact medical, nutrition, and food terms are valuable when they appear in both query and document. It can rank some highly relevant documents near the top when there is direct term overlap.
The low recall@100 is the more important warning. Because each query can have many relevant documents, recovering one or a few lexical matches does not cover the full relevance set. Short lay queries often do not share terminology with technical biomedical abstracts, and translations can introduce additional variation. BM25 is useful but incomplete.
Dense Evaluation Profile
Dense retrieval with harrier_oss_v1_270m reaches nDCG@10 = 0.2590, hit@10 = 0.6181, and recall@100 = 0.1757. Dense retrieval has slightly lower nDCG@10 than BM25 but better recall@100. This suggests that dense similarity helps recover medically related documents that do not share exact query terms, even if its top ranking is not as sharp as BM25 for lexical matches.
Dense retrieval should be especially useful for lay-to-technical mappings: food names to biomedical exposures, symptoms to diseases, or health topics to research abstracts. Its failure mode is broad concept matching. It may retrieve documents that are medically related but not relevant under the original NFCorpus relevance judgments.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate column reaches nDCG@10 = 0.2656, hit@10 = 0.6231, and recall@100 = 0.1815, with 100 to 101 candidates per query and 37 rank-101 safeguard rows. It has the best hit@10 and recall@100, while BM25 has the highest nDCG@10. This pattern shows the tradeoff between sharp lexical top ranking and broader semantic coverage.
Hybrid search is useful because medical retrieval needs both exact terminology and concept matching. BM25 can preserve precise biomedical terms, while dense retrieval can bridge lay wording and technical passages. A reranker can then decide which medically related candidates are genuinely relevant.
Metric Interpretation for Model Researchers
This task should not be read like a single-positive benchmark. With 5,880 positive qrels for 199 queries, recall@100 can be low even when hit@10 is moderate, because the system must cover many relevant documents. nDCG@10 reflects the quality of the top-ranked subset, while recall@100 reflects how much of the broad relevance set is captured.
Multi-positive training objectives are a better match than single-positive contrastive sampling. Models should learn to rank several relevant biomedical documents for one lay topic, not only the nearest passage.
Query and Relevance Type Tendencies
Queries are very short health, nutrition, food, symptom, or named-entity topics. Documents are long biomedical passages or abstracts. Relevant documents may vary in specificity: some are direct evidence for a topic, while others are related medical sources connected through the original NFCorpus judgments.
The relevance type is topic-to-evidence coverage. A query such as a food name or symptom can map to many relevant medical documents. Retrieval systems should therefore handle broad but controlled semantic expansion.
Representative Failure Modes
BM25 can fail when lay terms and biomedical terminology diverge, or when many relevant documents do not repeat the short query token. Dense retrieval can fail by retrieving broadly related medical passages that are not judged relevant. Hybrid retrieval can still miss much of the positive set because only 100 candidates are retained for reranking.
Hard negatives should be concept-near: related diseases, compounds, foods, or interventions that are medically adjacent but not relevant to the query's topic.
Training Data That May Help
Useful training data includes official NFCorpus training data with overlap removed, Dutch biomedical QA and evidence retrieval pairs, non-overlapping health topic to medical article pairs, and multilingual medical retrieval data adapted to Dutch. Training should exclude translated NFCorpus test queries, qrels, and medical documents used by this Nano split.
Synthetic data can be generated from Dutch biomedical abstracts or patient- facing health passages outside the evaluation set. Generate short lay health topics and questions that map to multiple technical passages. Include hard negatives from adjacent conditions, compounds, interventions, or food topics.
Model Improvement Notes
Improving this task requires lay-to-biomedical semantic bridging and multi- positive ranking. Dense encoders should learn medical synonymy and concept relations, while sparse signals should preserve exact terminology for top precision. Rerankers should not assume one best answer; several documents may be relevant to the same health topic.
The task is a good stress test for retrieval coverage. A model can look acceptable by hit@10 while still covering only a small fraction of the relevant medical evidence.
Example Data
| Query | Positive document |
| bagels [6 chars] | Papaverzaadproducten en opiaten drugstesten – waar staan we nu? Zaden van de opiumpapaverplant worden legaal verkocht en veel geconsumeerd als voedsel. Door contaminatie tijdens de oogst kunnen de zaden morfine en andere opiaatalkaloïden bevatten. Het doel van deze studie is de toxicologie van papaverzaadproducten te beoordelen met betrekking tot de invloed op opiaten drugstesten. Een computergestuurde literatuurstudie resulteerde in 95 geïdentificeerde referenties. Normale consumptie van papaverzaad wordt over het algemeen als veilig beschouwd. Tijdens de voedselverwerking wordt het morfinegehalte aanzienlijk verlaagd (tot 90%). De mogelijkheid van vals-positieve opiaten drugstesten na inname van papavervoedsel bestaat. Er zijn geen eenduidige markers beschikbaar om inname van papavervoedsel te onderscheiden van heroïne- of farmaceutische morfinegebruik. Dit is ook een probleem in heroïne-geassisteerde onderhoudsprogramma's. Een fundamentele eis in dergelijke substitutieprogramma's is... [1,000 / 1,902 chars] |
| druiven [7 chars] | Een beslist prikkelend idee: de potentiële rol van plantaardige polyfenolen bij de behandeling van leeftijdsgebonden cognitieve stoornissen. Tegenwoordig lijden tientallen miljoenen ouderen wereldwijd aan dementie. Hoewel de pathogenese van dementie complex en onvolledig begrepen is, kan het, althans tot op zekere hoogte, een gevolg zijn van systemische vasculaire pathologie. Het metabool syndroom en de afzonderlijke componenten ervan induceren een pro-inflammatoire toestand die bloedvaten beschadigt. Deze toestand van chronische ontsteking kan de vasculatuur van de hersenen beschadigen of direct neurotoxisch zijn. Er zijn verbanden vastgesteld tussen het metabool syndroom, de bestanddelen ervan en dementie. Er is ook een verband waargenomen tussen bepaalde dieetfactoren, zoals bestanddelen van het 'mediterrane dieet', en het metabool syndroom; soortgelijke verbanden zijn opgemerkt tussen deze dieetfactoren en dementie. Fruchtsappen en -extracten worden onderzocht als behandelingen voo... [1,000 / 1,953 chars] |
| Dr. Walter Willett [18 chars] | Cocountolie voorspelt een gunstig lipidenprofiel bij premenopauzale vrouwen in de Filipijnen Cocountolie is een veelgebruikte eetbare olie in veel landen, en er is gemengd bewijs voor de effecten ervan op lipidenprofielen en het risico op hart- en vaatziekten. Hier onderzoeken we het verband tussen consumptie van cocountolie en lipidenprofielen in een cohort van 1839 Filipijnse vrouwen (leeftijd 35–69 jaar) die deelnamen aan het Cebu Longitudinal Health and Nutrition Survey, een community-based studie in Metropolitan Cebu City. De inname van cocountolie werd gemeten als individuele inname van cocountolie, berekend met behulp van twee 24-uurs voedselinnames (9,54 ± 8,92 gram). Cholesterolprofielen werden gemeten in plasmamonsters die werden verzameld na een nachtelijk vasten. Gemiddelde lipidewaarden in dit monster waren totaal cholesterol (TC) (186,52 ± 38,86 mg/dL), high-density lipoprotein cholesterol (HDL-c) (40,85 ± 10,30 mg/dL), low-density lipoprotein cholesterol (LDL-c) (119,42... [1,000 / 1,491 chars] |
Source Reference Table
| Title | Year | Type | URL |
| NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval | 2016 | paper PDF | https://www.cl.uni-heidelberg.de/~riezler/publications/papers/ECIR2016.pdf |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models | 2021 | arXiv paper | https://arxiv.org/abs/2104.08663 |
| BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language | 2025 | ACL paper | https://aclanthology.org/2025.bucc-1.5/ |
| clips/beir-nl-nfcorpus | dataset card | https://huggingface.co/datasets/clips/beir-nl-nfcorpus |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-Dutch |
| Backing dataset | NanoMTEB-Dutch |
| Task / split | nfcorpus_nl |
| Hugging Face dataset | hakari-bench/NanoMTEB-Dutch |
| Language | multilingual |
| Category | natural_language |
| Queries | 199 |
| Documents | 3,593 |
| Positive qrels | 5,880 |
| Positives / query avg | 29.55 |
| Positives / query min | 1 |
| Positives / query median | 15.00 |
| Positives / query max | 100 |
| Multi-positive queries | 181 (90.95%) |
| Query length avg chars | 18.51 |
| Document length avg chars | 1,743.72 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.2683 | 0.6181 | 0.1371 | top-500 |
| Dense | harrier_oss_v1_270m | 0.2590 | 0.6181 | 0.1757 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.2656 | 0.6231 | 0.1815 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: translated BEIR-NL NFCorpus test split from clips/beir-nl-nfcorpus
- Train/eval overlap audit: not_audited
- Leakage note: Exclude translated NFCorpus test queries, qrels, and medical documents used by this Nano split.
- Multi-positive training: multi_positive_objective
- Useful training data: official NFCorpus training data with overlap removed, Dutch biomedical QA and evidence retrieval pairs, non-overlapping health topic to medical article pairs, multilingual medical retrieval data adapted to Dutch