HAKARI-Bench

NanoMTEB-Dutch / wikipedia_multilingual_nl

Overview

wikipedia_multilingual_nl is the Dutch subset of WikipediaRetrievalMultilingual from MTEB-NL. Queries are Dutch synthetic information-seeking questions grounded in Wikipedia articles, and documents are Dutch Wikipedia-style passages. The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows, with exactly one positive passage per query. It evaluates broad Dutch encyclopedic passage retrieval.

The task has high scores across all candidate sources. BM25 is strong because synthetic questions often contain entities and terms that appear in the positive passage. Dense retrieval with harrier_oss_v1_270m is strongest in nDCG@10 and hit@10, while reranking_hybrid has the highest recall@100. The task is useful for measuring whether a model can retrieve the exact passage that answers a grounded factual question, not merely a page about the same entity.

Details

What the Original Data Measures

MTEB-NL and E5-NL describes WikipediaRetrievalMultilingual as a multilingual retrieval dataset constructed from Wikipedia-grounded questions generated by a multilingual LLM. The task is designed to resemble SQuAD-style passage retrieval. Source metadata points to the ellamind/wikipedia-2023-11-retrieval-multilingual-queries dataset and MTEB's WikipediaRetrievalMultilingual.

No standalone paper for this exact dataset was confirmed. The task should be read as synthetic but grounded encyclopedic retrieval. The questions are more directly tied to the positive passage than organic web search queries, but they still test relation and entity matching.

Observed Data Profile

The split has 200 queries over 10,000 documents. Queries average 63.54 characters, and documents average 381.01 characters. Documents are concise Wikipedia-style passages with entities, dates, facts, definitions, and relations.

Representative examples ask what a "mandiel" is and who wears it, the conditions for a two-sample t-test, methods for pain control in pancreatitis, who Bongo was in Apenheul, and when the last wire-radio broadcast in Delft took place. The positive passage usually contains the answer explicitly.

BM25 Evaluation Profile

BM25 reaches nDCG@10 = 0.8444, hit@10 = 0.9250, and recall@100 = 0.9600 over top-500 candidate lists. This is a high lexical baseline. The generated questions often reuse distinctive entities, technical terms, or answer-bearing phrases from the positive passage.

BM25's remaining errors are likely passage-selection errors. A candidate can share an entity or topic but not contain the answer to the specific question. Relation words such as who, when, which, and what condition still matter.

Dense Evaluation Profile

Dense retrieval with harrier_oss_v1_270m reaches nDCG@10 = 0.8948, hit@10 = 0.9550, and recall@100 = 0.9750. Dense retrieval is the strongest top-ranked candidate source. It improves over BM25 by matching the question's semantic relation to the answer-bearing passage, especially when wording differs.

Dense retrieval is not merely retrieving a page topic; it must select the passage that supports the answer. Remaining errors likely involve same-entity passages or closely related factual passages.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate column reaches nDCG@10 = 0.8840, hit@10 = 0.9500, and recall@100 = 0.9950, with 100 to 101 candidates per query and one rank-101 safeguard row. Hybrid retrieval has the best recall@100, while dense retrieval has the best top-10 ranking.

This makes hybrid search a strong candidate source for reranking. BM25 contributes exact entities and terms, and dense retrieval contributes semantic question-passage matching. A reranker can then choose the exact answer-bearing passage among same-entity candidates.

Metric Interpretation for Model Researchers

This is a single-positive benchmark, so nDCG@10 directly reflects the rank of the answer passage. Hit@10 measures whether the answer passage is visible in a short list, and recall@100 measures reranking coverage. Dense retrieval is the best first-stage ranker; hybrid retrieval is almost complete as a top-100 candidate pool.

Because scores are high, the task is best used to compare fine passage-ranking differences among otherwise strong Dutch retrieval models.

Query and Relevance Type Tendencies

Queries are Dutch factual questions. They ask for definitions, conditions, people, dates, entities, and factual relations. Relevant documents are concise Wikipedia passages that explicitly answer the question.

Relevance is answer-bearing passage identity. A passage about the same entity is not sufficient unless it contains the requested answer.

Representative Failure Modes

BM25 can fail when the same entity appears in multiple passages or when the question's relation is expressed differently. Dense retrieval can fail by retrieving semantically adjacent passages that do not contain the exact answer. Hybrid retrieval can include both exact lexical matches and semantic matches, requiring reranking among near positives.

Hard negatives should mention the same entity or topic but answer a different relation.

Training Data That May Help

Useful training data includes non-overlapping Dutch Wikipedia question-passage pairs, multilingual Wikipedia retrieval datasets, SQuAD-style synthetic QA pairs from non-evaluation passages, and entity-near hard negatives from Dutch Wikipedia. Training should exclude WikipediaRetrievalMultilingual Dutch test queries, positives, and qrels used by this Nano split.

Synthetic data can be generated from non-evaluation Dutch Wikipedia passages. Create factual questions grounded in one passage, varying who, when, which, what, and condition-style prompts. Hard negatives should share entities or topics without answering the requested relation.

Model Improvement Notes

Improving this task requires precise answer-passage ranking. Dense encoders should preserve question relation and answerability. Rerankers should compare the query against the candidate passage and verify that the answer is explicit.

Hybrid retrieval is useful for near-complete candidate coverage, while dense retrieval currently provides the strongest top order.

Example Data

QueryPositive document
Wat is een "mandiel" en wie dragen het? [39 chars]De druzenvrouwen dragen een "mandiel" (transparante losse witte sluier) vooral in het bijzijn van religieuze personen. Zij worden in alle aspecten als gelijkwaardig aan de mannen beschouwd. Het is hen mogelijk deel te hebben aan de "Raad van de Oudsten". [254 chars]
Wat zijn de voorwaarden voor het uitvoeren van een t-toets bij twee steekproeven? [81 chars]In het geval van twee steekproeven dienen beide steekproeven uit een normale verdeling te komen. De twee steekproeven moeten óf onafhankelijk van elkaar zijn, óf zogenaamd gepaard zijn. In het geval van twee onafhankelijke steekproeven dienen bij toepassing van de standaard t-toets de beide populaties dezelfde variantie te hebben. Wanneer beide populaties een verschillende variantie [385 chars]
Welke methoden worden genoemd voor pijnbestrijding bij pancreatitis? [68 chars]Pijnbestrijding via orale inname van pijnstillers, de (tijdelijke) verbranding -door alcohol- van de zenuwen rond het pancreas, een geïmplanteerde morfinepomp. [159 chars]

Source Reference Table

TitleYearTypeURL
MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch2025arXiv paperhttps://arxiv.org/abs/2509.12340
ellamind/wikipedia-2023-11-retrieval-multilingual-queriesdataset cardhttps://huggingface.co/datasets/ellamind/wikipedia-2023-11-retrieval-multilingual-queries
mteb/WikipediaRetrievalMultilingualdataset cardhttps://huggingface.co/datasets/mteb/WikipediaRetrievalMultilingual
MTEB project repositoryrepositoryhttps://github.com/embeddings-benchmark/mteb

Dataset Information

FieldValue
Nano setNanoMTEB-Dutch
Backing datasetNanoMTEB-Dutch
Task / splitwikipedia_multilingual_nl
Hugging Face datasethakari-bench/NanoMTEB-Dutch
Languagenl
Categorynatural_language
Queries200
Documents10,000
Positive qrels200
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars63.54
Document length avg chars381.01

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.84440.92500.9600top-500
Denseharrier_oss_v1_270m0.89480.95500.9750top-500
Reranking hybridreranking_hybrid0.88400.95000.9950top-100

Training and Leakage Metadata