NanoMTEB-Dutch
Overview
NanoMTEB-Dutch is a compact Dutch retrieval group covering translated BEIR-NL tasks, native Dutch MTEB tasks, cross-lingual Belebele retrieval, legal and medical retrieval, scientific evidence retrieval, news and tender retrieval, web FAQ retrieval, and Dutch Wikipedia retrieval. Ten of the twenty-seven tasks are Dutch CQADupStack duplicate-question splits, but the group is broader than duplicate QA.
The group should be read as both a Dutch-language benchmark and a translation robustness benchmark. Some tasks are native Dutch resources, such as legal, news, tender, FAQ, or bibliography retrieval. Others carry BEIR-style relevance relations into Dutch. BM25 exposes exact Dutch terms and named entities; dense retrieval tests paraphrase, translation, and cross-lingual matching; reranking_hybrid is useful when sparse and dense candidates recover different positives.
What This Group Measures
NanoMTEB-Dutch follows Dutch retrieval coverage assembled for MTEB-NL and BEIR-NL. It includes translated BEIR-style datasets, cross-lingual Belebele directions, Dutch legal resources, Dutch news, public procurement, Flemish academic bibliography, FAQ retrieval, and Wikipedia retrieval.
The group measures Dutch retrieval across task semantics. A relevant document may be a duplicate question, a statute article, a FAQ answer, a news article, a procurement notice, a scientific paper, a medical abstract, or a Wikipedia passage. A model must preserve both Dutch lexical details and the original task relation after translation or cross-lingual mapping.
Task Families
- Duplicate and argument retrieval:
argu_ana_nl, the tencqadupstack_*tasks, andquora. - Cross-lingual and monolingual reading comprehension: the three Belebele directions.
- Legal retrieval:
b_bsardnlandlegal_qanl. - Open-domain and encyclopedic retrieval:
nq,fever, andwikipedia_multilingual_nl. - Scientific and medical retrieval:
nfcorpus_nl,sci_fact_nl, andscidocs_nl. - News, procurement, bibliography, and FAQ retrieval:
dutch_news_articles,open_tender,vabb, andweb_faq_nld.
Dataset Shape
NanoMTEB-Dutch contains 27 task pages, 5,299 queries, 227,987 split-local documents, and 13,018 positive qrel rows. Most tasks have 200 queries. The group has a mix of single-positive and multi-positive tasks; NFCorpus, SCIDOCS, Quora, LegalQA-NL, bBSARD, FEVER, and NQ require multi-positive interpretation.
Text lengths and formats are diverse. Argument queries are long, FAQ and web queries are short, legal tasks contain formal statute language, CQADupStack contains technical forum text, and scientific tasks contain paper or abstract language. This makes task-family breakdown more useful than one aggregate Dutch score.
Retrieval Behavior
BM25 Profile
BM25 is strongest on tasks with direct named entities, titles, article terms, or Dutch surface overlap. FEVER, Dutch news, Wikipedia, Quora, LegalQA-NL, WebFAQ, OpenTender, and VABB all show substantial sparse signal. BM25 is also useful on same-language Belebele where question and passage share Dutch answer context.
BM25 is weaker on cross-lingual Belebele, bBSARD, SCIDOCS, and many CQADupStack technical duplicate tasks. These require language alignment, paraphrase, legal concept mapping, scientific relatedness, or duplicate intent beyond exact word overlap.
Dense Profile
Dense retrieval is the best profile for many Dutch tasks, especially cross-lingual Belebele, Quora, WebFAQ, Wikipedia, VABB, NQ, SciFact, and several CQADupStack splits. It helps when the Dutch query and document express the same intent with different words, or when translated benchmark artifacts weaken literal matching.
Dense retrieval still needs exact anchors. Legal articles, procurement notices, scientific terms, and technical forum posts can depend on specific names, codes, product terms, or statute phrases. Dense gains should be checked against candidate recall for those anchors.
Reranking Hybrid Profile
reranking_hybrid is strongest on same-language Belebele, LegalQA-NL, and several CQADupStack tasks. It is most useful where exact Dutch terms and dense paraphrase matching are both needed. For FAQ, Quora, and Wikipedia-style tasks, hybrid remains competitive even when dense has the best nDCG@10.
For reranker experiments, multi-positive tasks such as NFCorpus, SCIDOCS, Quora, and legal retrieval should be read with recall and candidate diversity in mind, not only first-hit success.
Task Summary
| Task | Family | Queries | Docs | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| argu_ana_nl | argument retrieval | 199 | 8,624 | 0.2970 | 0.3723 | 0.3529 | Dense |
| b_bsardnl | legal retrieval | 200 | 10,000 | 0.1249 | 0.2749 | 0.2234 | Dense |
| belebele_eng_latn_nld_latn | cross-lingual QA | 200 | 488 | 0.4738 | 0.8918 | 0.6283 | Dense |
| belebele_nld_latn_eng_latn | cross-lingual QA | 200 | 488 | 0.3288 | 0.9306 | 0.4456 | Dense |
| belebele_nld_latn_nld_latn | Dutch QA | 200 | 488 | 0.8364 | 0.8899 | 0.8999 | Reranking hybrid |
| cqadupstack_android | duplicate question | 200 | 10,000 | 0.2944 | 0.3862 | 0.3836 | Dense |
| cqadupstack_english | duplicate question | 200 | 10,000 | 0.2769 | 0.3587 | 0.3248 | Dense |
| cqadupstack_gis | duplicate question | 200 | 10,000 | 0.2790 | 0.3202 | 0.3272 | Reranking hybrid |
| cqadupstack_mathematica | duplicate question | 200 | 10,000 | 0.1826 | 0.1992 | 0.2181 | Reranking hybrid |
| cqadupstack_physics | duplicate question | 200 | 10,000 | 0.3269 | 0.4020 | 0.3756 | Dense |
| cqadupstack_programmers | duplicate question | 200 | 10,000 | 0.2991 | 0.3906 | 0.3638 | Dense |
| cqadupstack_stats | duplicate question | 200 | 10,000 | 0.2827 | 0.3224 | 0.3337 | Reranking hybrid |
| cqadupstack_tex | duplicate question | 200 | 10,000 | 0.2106 | 0.2611 | 0.2926 | Reranking hybrid |
| cqadupstack_webmasters | duplicate question | 200 | 10,000 | 0.2307 | 0.2947 | 0.2968 | Reranking hybrid |
| cqadupstack_wordpress | duplicate question | 200 | 10,000 | 0.2608 | 0.3057 | 0.3371 | Reranking hybrid |
| dutch_news_articles | news retrieval | 200 | 10,000 | 0.8868 | 0.8996 | 0.8954 | Dense |
| fever | evidence retrieval | 200 | 10,000 | 0.9221 | 0.9207 | 0.9215 | BM25 |
| legal_qanl | legal QA retrieval | 102 | 10,000 | 0.8143 | 0.8050 | 0.8455 | Reranking hybrid |
| nfcorpus_nl | biomedical retrieval | 199 | 3,593 | 0.2683 | 0.2590 | 0.2656 | BM25 |
| nq | open-domain QA | 200 | 10,000 | 0.4505 | 0.6335 | 0.5473 | Dense |
| open_tender | tender retrieval | 199 | 10,000 | 0.6712 | 0.6044 | 0.6556 | BM25 |
| quora | duplicate question | 200 | 10,000 | 0.8391 | 0.9289 | 0.8772 | Dense |
| sci_fact_nl | scientific evidence | 200 | 5,183 | 0.6160 | 0.6758 | 0.6709 | Dense |
| scidocs_nl | related scientific documents | 200 | 10,000 | 0.1335 | 0.2264 | 0.1835 | Dense |
| vabb | bibliography retrieval | 200 | 9,123 | 0.6952 | 0.7804 | 0.7540 | Dense |
| web_faq_nld | FAQ retrieval | 200 | 10,000 | 0.7698 | 0.8776 | 0.8442 | Dense |
| wikipedia_multilingual_nl | Wikipedia QA retrieval | 200 | 10,000 | 0.8444 | 0.8948 | 0.8840 | Dense |
Interpretation Notes for Model Researchers
NanoMTEB-Dutch should be interpreted by source type. Translated BEIR tasks stress whether the original relevance relation survives in Dutch. Native Dutch legal, FAQ, news, tender, and bibliography tasks stress domain vocabulary and local data conventions. Cross-lingual Belebele should be read separately from same-language Dutch retrieval.
The BM25/dense profile is a useful diagnostic. BM25-led tasks show direct Dutch surface anchors. Dense-led tasks show paraphrase, translation, or semantic intent matching. Hybrid-led tasks show that both exact terms and semantic alignment are needed for candidate generation.
Training and Leakage Notes
Useful training data includes Dutch search logs, legal question-article pairs, FAQ retrieval, translated BEIR-NL training data, duplicate-question pairs, scientific and biomedical retrieval, and cross-lingual QA pairs. Hard negatives should be drawn from same legal article families, same technical forums, same scientific topic, or same answer-bearing article.
Exclude NanoMTEB-Dutch evaluation queries, positives, qrels, duplicate clusters, translated test examples, statute articles, and source rows. Cross-lingual tasks should avoid direct translations of evaluation queries as synthetic training seeds.
Source Reference Table
| Source | Year | Type | URL |
| MTEB: Massive Text Embedding Benchmark | 2022 | paper | https://arxiv.org/abs/2210.07316 |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models | 2021 | paper | https://arxiv.org/abs/2104.08663 |
| Belebele: a Parallel Reading Comprehension Dataset in 122 Language Variants | 2023 | paper | https://arxiv.org/abs/2308.16884 |
Metadata Summary
| Field | Value |
| Task pages | 27 |
| Queries | 5,299 |
| Split-local documents | 227,987 |
| Positive qrels | 13,018 |
| Languages | multilingual, nl |
| Categories | natural_language |
| Positives / query avg | 2.46 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| argu_ana_nl | NanoMTEB-Dutch | nl | natural_language | 199 | 8,624 | 199 | 0.2970 | 0.3723 | 0.3529 | Dense |
| b_bsardnl | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 923 | 0.1249 | 0.2749 | 0.2234 | Dense |
| belebele_eng_latn_nld_latn | NanoMTEB-Dutch | multilingual | natural_language | 200 | 488 | 200 | 0.4738 | 0.8918 | 0.6283 | Dense |
| belebele_nld_latn_eng_latn | NanoMTEB-Dutch | multilingual | natural_language | 200 | 488 | 200 | 0.3288 | 0.9306 | 0.4456 | Dense |
| belebele_nld_latn_nld_latn | NanoMTEB-Dutch | nl | natural_language | 200 | 488 | 200 | 0.8364 | 0.8899 | 0.8999 | Reranking hybrid |
| cqadupstack_android | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2944 | 0.3862 | 0.3836 | Dense |
| cqadupstack_english | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2769 | 0.3587 | 0.3248 | Dense |
| cqadupstack_gis | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2790 | 0.3202 | 0.3272 | Reranking hybrid |
| cqadupstack_mathematica | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.1826 | 0.1992 | 0.2181 | Reranking hybrid |
| cqadupstack_physics | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.3269 | 0.4020 | 0.3756 | Dense |
| cqadupstack_programmers | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2991 | 0.3906 | 0.3638 | Dense |
| cqadupstack_stats | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2827 | 0.3224 | 0.3337 | Reranking hybrid |
| cqadupstack_tex | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2106 | 0.2611 | 0.2926 | Reranking hybrid |
| cqadupstack_webmasters | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2307 | 0.2947 | 0.2968 | Reranking hybrid |
| cqadupstack_wordpress | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.2608 | 0.3057 | 0.3371 | Reranking hybrid |
| dutch_news_articles | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.8868 | 0.8996 | 0.8954 | Dense |
| fever | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 233 | 0.9221 | 0.9207 | 0.9215 | BM25 |
| legal_qanl | NanoMTEB-Dutch | nl | natural_language | 102 | 10,000 | 157 | 0.8143 | 0.8050 | 0.8455 | Reranking hybrid |
| nfcorpus_nl | NanoMTEB-Dutch | multilingual | natural_language | 199 | 3,593 | 5,880 | 0.2683 | 0.2590 | 0.2656 | BM25 |
| nq | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 242 | 0.4505 | 0.6335 | 0.5473 | Dense |
| open_tender | NanoMTEB-Dutch | nl | natural_language | 199 | 10,000 | 199 | 0.6712 | 0.6044 | 0.6556 | BM25 |
| quora | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 573 | 0.8391 | 0.9289 | 0.8772 | Dense |
| sci_fact_nl | NanoMTEB-Dutch | nl | natural_language | 200 | 5,183 | 226 | 0.6160 | 0.6758 | 0.6709 | Dense |
| scidocs_nl | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 986 | 0.1335 | 0.2264 | 0.1835 | Dense |
| vabb | NanoMTEB-Dutch | nl | natural_language | 200 | 9,123 | 200 | 0.6952 | 0.7804 | 0.7540 | Dense |
| web_faq_nld | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.7698 | 0.8776 | 0.8442 | Dense |
| wikipedia_multilingual_nl | NanoMTEB-Dutch | nl | natural_language | 200 | 10,000 | 200 | 0.8444 | 0.8948 | 0.8840 | Dense |