HAKARI-Bench

NanoMTEB-Dutch / scidocs_nl

Overview

scidocs_nl is the Dutch SCIDOCS retrieval task from BEIR-NL. Queries are Dutch-translated scientific-paper titles, and documents are translated paper titles and abstracts. The Nano split contains 200 queries, 10,000 documents, and 986 positive qrel rows. Every query has multiple positives: the average is 4.93 positives per query, the median is five, and all 200 queries are multi-positive.

This is one of the hardest Dutch retrieval tasks in the current batch. The relevance relation is scientific relatedness, citation, co-citation, or paper recommendation, not answer containment. BM25 is weak because related papers may share a method, background problem, or citation context without sharing title terms. Dense retrieval with harrier_oss_v1_270m is strongest across nDCG@10, hit@10, and recall@100 among the individual final orders, while reranking_hybrid improves over BM25 but does not beat dense. The task strongly rewards scientific semantic representations.

Details

What the Original Data Measures

SPECTER: Document-level Representation Learning using Citation-informed Transformers introduced SciDocs as a benchmark for scientific document representation, including citation prediction, co-citation, recommendation, and classification. In BEIR-style retrieval, the query is a paper title, and relevant documents are scientifically related papers.

MTEB-NL describes SCIDOCS-NL as a machine-translated Dutch adaptation from BEIR-NL. The task asks a model to retrieve documents that are cited by, should be cited by, or are otherwise related to a query paper. This is closer to paper recommendation than ordinary fact retrieval.

Observed Data Profile

Queries average 77.73 characters and are scientific titles. Documents average 1,331.57 characters and are abstract-like scientific records. Each query has between three and five positives, so the benchmark expects several related papers to appear near the top.

Representative topics include log mining for system management, search-result visualization, distributed caching protocols, architectural experience from a neurophysiological perspective, and robot control with gravity compensation. The related documents can be conceptually or citation-related rather than lexically similar to the title.

BM25 Evaluation Profile

BM25 reaches nDCG@10 = 0.1335, hit@10 = 0.4250, and recall@100 = 0.2698 over top-500 candidate lists. This low score is expected for citation-style scientific retrieval. A paper can cite or recommend another paper because it uses a related method, dataset, theoretical framing, or application area, even if the title words differ substantially.

BM25 succeeds mainly when query and positive share distinctive technical terms. It fails when the relation is methodological or bibliographic rather than lexical. The all-multi-positive setup also makes it harder: finding one overlapping abstract is not enough when several related papers should be ranked.

Dense Evaluation Profile

Dense retrieval with harrier_oss_v1_270m reaches nDCG@10 = 0.2264, hit@10 = 0.6400, and recall@100 = 0.4564. Dense retrieval is clearly stronger than BM25 because it can connect papers by topic, method, and contribution beyond exact title overlap. It is the best candidate profile for top-ranked results in this task.

The absolute score remains low, which shows how difficult scientific recommendation is. A model must recognize relatedness among abstracts from the same research area while distinguishing different methods and contributions. Generic semantic similarity may still retrieve same-field papers that are not among the judged positives.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate column reaches nDCG@10 = 0.1835, hit@10 = 0.5700, and recall@100 = 0.4280, with 100 to 101 candidates per query and 26 rank-101 safeguard rows. It improves over BM25 but trails dense retrieval. This indicates that sparse lexical evidence adds some useful candidates but also brings many title-term distractors that do not match the citation-style relatedness relation.

For reranking, the hybrid pool is useful but not ideal as a final order. A reranker must learn that shared title terms are weaker evidence than method, problem, and citation-context similarity.

Metric Interpretation for Model Researchers

Every query is multi-positive, so nDCG@10 and recall@100 should be read as related-paper set retrieval metrics. Hit@10 is less informative by itself because returning one related paper does not mean the system has captured the recommended set. Multi-positive or listwise training is strongly aligned with the benchmark.

The main lesson is that dense retrieval is essential here. BM25 is not a strong proxy for scientific relatedness, and hybrid search only helps if a reranker can suppress lexical but non-related scientific papers.

Query and Relevance Type Tendencies

Queries are title-like scientific strings. Documents are paper records with titles and abstracts. Relevance indicates scientific relatedness: cited papers, co-cited papers, recommended papers, or papers that belong to the same research neighborhood.

The model must infer relatedness from method, task, dataset, field, and contribution. Exact word overlap is helpful but often insufficient.

Representative Failure Modes

BM25 fails when related papers have different titles or use different terminology for the same research area. Dense retrieval can fail by retrieving broadly same-field papers that are not citation-related. Hybrid retrieval can over-rank abstracts with shared title words but weak bibliographic relation.

Hard negatives should come from the same discipline or research problem but use a different method or contribution.

Training Data That May Help

Useful training data includes non-overlapping citation graph pairs, scientific paper recommendation datasets, title-to-cited-paper and title-to-abstract retrieval pairs, and multilingual scientific retrieval data with overlap removed. Training should exclude SCIDOCS-NL evaluation titles, qrels, and positive scientific documents used in this Nano split.

Synthetic data can create clusters of scientific titles and abstracts around a shared method, dataset, or research problem. Each query title should have several related-paper positives plus same-field hard negatives.

Model Improvement Notes

Improving this task requires scientific document representations trained on citation and recommendation signals. Dense encoders should model document-level relatedness rather than only local sentence semantics. Rerankers should compare methods, research problems, datasets, and contributions across title and abstract text.

This task is a strong diagnostic for whether a model can support scientific literature discovery in Dutch-translated settings.

Example Data

QueryPositive document
Een geïntegreerd raamwerk voor het delven van logbestanden voor systeembeheer. [78 chars]Machine learning in geautomatiseerde tekstcategorisering De geautomatiseerde categorisering (of classificatie) van teksten in vooraf gedefinieerde categorieën heeft de afgelopen 10 jaar een explosieve groei doorgemaakt, te danken aan de toegenomen beschikbaarheid van documenten in digitale vorm en de daaruit voortvloeiende behoefte om deze te organiseren. In de onderzoeksgemeenschap is de dominante aanpak van dit probleem gebaseerd op machine learning technieken: een algemeen inductief proces bouwt automatisch een classifier door, uit een set van vooraf geclassificeerde documenten, de kenmerken van de categorieën te leren. De voordelen van deze aanpak ten opzichte van de knowledge engineering aanpak (bestaande uit de handmatige definitie van een classifier door domeinexperts) zijn een zeer goede effectiviteit, aanzienlijke besparingen in termen van expertise-arbeidskracht, en eenvoudige overdraagbaarheid naar verschillende domeinen. Dit overzicht bespreekt de belangrijkste benaderingen... [1,000 / 1,249 chars]
Onderwerp-Relevantiekaart: Visualisatie voor Verbetering van het Begrip van Zoekresultaten [90 chars]Ontwerpen voor explorerend zoeken op touchscreen-apparaten Explorerend zoeken confronteert gebruikers met uitdagingen bij het uitdrukken van zoekintenties, aangezien huidige zoekinterfaces het onderzoeken van resultatenlijsten vereisen om zoekrichtingen te identificeren, iteratief typen en het herformuleren van zoekopdrachten. We presenteren het ontwerp van Exploration Wall, een op aanraking gebaseerde zoekgebruikersinterface die incrementele exploratie en betekenisgeving van grote informatieruimten mogelijk maakt door entiteit zoeken, flexibel gebruik van resulterende entiteiten als zoekparameters en ruimtelijke configuratie van zoekstromen die voor interactie worden gevisualiseerd, te combineren. Entiteiten kunnen flexibel worden hergebruikt om nieuwe zoekstromen te wijzigen en te creëren, en gemanipuleerd worden om hun relaties met andere entiteiten te inspecteren. Gegevens bestaande uit op taken gebaseerde experimenten die Exploration Wall vergelijken met een conventionele zoekgebr... [1,000 / 1,428 chars]
Algoritmische Brokjes in Contentlevering [40 chars]Consistente Hashing en Willekeurige Bomen: Gedistribueerde Cachingprotocollen voor het Verminderen van Hotspots op het World Wide Web We beschrijven een familie van cachingprotocollen voor gedistribueerde netwerken die kunnen worden gebruikt om het voorkomen van hotspots in het netwerk te verminderen of te elimineren. Onze protocollen zijn speciaal ontworpen voor gebruik met zeer grote netwerken zoals het internet, waar vertragingen veroorzaakt door hotspots ernstig kunnen zijn, en waar het niet haalbaar is voor elke server om complete informatie te hebben over de huidige staat van het hele netwerk. De protocollen zijn eenvoudig te implementeren met behulp van bestaande netwerkprotocollen zoals TCP/IP, en vereisen zeer weinig overhead. De protocollen werken met lokale controle, maken efficiënt gebruik van bestaande resources en schalen soepel naarmate het netwerk groeit. Onze cachingprotocollen zijn gebaseerd op een speciaal soort hashing dat we consistente hashing noemen. Ruwweg gezeg... [1,000 / 1,474 chars]

Source Reference Table

TitleYearTypeURL
SPECTER: Document-level Representation Learning using Citation-informed Transformers2020arXiv paperhttps://arxiv.org/abs/2004.07180
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models2021arXiv paperhttps://arxiv.org/abs/2104.08663
MTEB-NL and E5-NL: Embedding Benchmark and Models for Dutch2025arXiv paperhttps://arxiv.org/abs/2509.12340
clips/beir-nl-scidocsdataset cardhttps://huggingface.co/datasets/clips/beir-nl-scidocs

Dataset Information

FieldValue
Nano setNanoMTEB-Dutch
Backing datasetNanoMTEB-Dutch
Task / splitscidocs_nl
Hugging Face datasethakari-bench/NanoMTEB-Dutch
Languagenl
Categorynatural_language
Queries200
Documents10,000
Positive qrels986
Positives / query avg4.93
Positives / query min3
Positives / query median5.00
Positives / query max5
Multi-positive queries200 (100.00%)
Query length avg chars77.72
Document length avg chars1,331.57

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.13350.42500.2698top-500
Denseharrier_oss_v1_270m0.22640.64000.4564top-500
Reranking hybridreranking_hybrid0.18350.57000.4280top-100

Training and Leakage Metadata