HAKARI-Bench

NanoMTEB-German

Overview

NanoMTEB-German is a compact German retrieval group covering five tasks from the MTEB and multilingual MTEB ecosystem. It brings together legal retrieval, open-domain question answering, reading-comprehension context retrieval, municipal service search, and e-commerce product retrieval. The group is useful because it does not describe one single German retrieval problem: it tests whether a model can move between formal legal prose, encyclopedic passages, citizen-facing administrative language, and short marketplace labels.

The benchmark contains 982 queries, 23,455 task-local documents, and 4,959 positive qrel rows. Most tasks are single-positive retrieval tasks, but xmarket_de is heavily multi-positive, with category queries linked to many acceptable product records. This mixture makes the group a good diagnostic for German retrieval systems that need both precise answer-bearing passage retrieval and broader many-relevant-item ranking.

What This Group Measures

The central question is whether a model can retrieve German documents under different relevance relations. ger_da_lir asks the model to find legal decisions from legal passages, where exact legal terminology and citation-like phrasing are important. german_dpr and german_qu_ad use German Wikipedia question-answering data, but they behave differently: one rewards semantic answerability more strongly, while the other is almost solved by strong lexical and context overlap. gov_service maps natural user questions to Munich service pages, testing intent matching in administrative German. xmarket_de maps category-like German queries to product metadata, testing high-recall ranking with many relevant items.

For researchers, the value of the group is the contrast among retrieval signals. Some tasks are dominated by sparse lexical matching, some by dense semantic matching, and none in the current Nano slice is best served by the reranking hybrid profile at nDCG@10. That does not make hybrid unimportant: hybrid still improves top-100 coverage and hit@10 at group level, but it is not the leading nDCG@10 profile for any individual task in this group.

Task Families

Dataset Shape

The group has five task pages. Four tasks are marked as German (de), while xmarket_de is marked multilingual because product names, brand strings, and category labels often mix German with international terms. The group-level document count is the sum of task-local pools, not a deduplicated shared corpus.

Three tasks are strictly single-positive: german_dpr, german_qu_ad, and gov_service. ger_da_lir is nearly single-positive with 235 positives for 200 queries. xmarket_de is the outlier: 182 queries have 4,124 positives, or about 22.66 positives per query. This difference matters when interpreting recall and nDCG. A single missed positive can define failure in the QA tasks, while marketplace ranking is more about placing many acceptable products early.

Retrieval Behavior

BM25 Profile

BM25 is the best nDCG@10 profile for ger_da_lir and german_qu_ad. This is consistent with tasks where query wording, entities, legal terms, and passage phrases carry direct relevance evidence. In german_qu_ad, BM25 reaches 0.9458 nDCG@10, essentially matching dense and hybrid. In ger_da_lir, BM25 scores 0.5360 while dense falls to 0.2920, showing that the legal task in this Nano slice strongly rewards exact German legal vocabulary and surface-form anchors.

BM25 is weaker on german_dpr, gov_service, and xmarket_de. Those tasks require more paraphrase, intent matching, or category-product association than literal token overlap alone can provide. At group level, BM25 remains a strong baseline because the legal and GermanQuAD tasks are sizeable and lexically friendly, but it should not be treated as a universal German retrieval solution.

Dense Profile

Dense retrieval with harrier-oss-270m is the best profile for german_dpr, gov_service, and xmarket_de. The GermanDPR result is the clearest case: dense reaches 0.7837 nDCG@10 against BM25 at 0.4647, which indicates that the task depends on semantic answerability and paraphrase between question and passage. gov_service behaves similarly, with dense at 0.7903 against BM25 at 0.6132, because citizen questions and official service descriptions often use different wording for the same intent.

Dense is only slightly ahead on xmarket_de, where all profiles are low: 0.2268 for dense, 0.2210 for hybrid, and 0.2012 for BM25. This suggests that embedding similarity helps with category-product relatedness, but the task remains difficult because relevance is broad, product metadata is short, and many positives compete within the top ranks.

Reranking Hybrid Profile

The reranking hybrid column combines sparse and dense evidence to emulate a hybrid-search candidate set. In this group it is not the best nDCG@10 profile for any individual task, but it is still informative. It improves group-level hit@10 and recall@100 relative to dense alone, and it stays close to the best profile on german_qu_ad and xmarket_de. That pattern indicates that hybrid retrieval is recovering complementary candidates even when the final top-10 ordering is not the strongest.

The main caution is visible in german_dpr and gov_service: hybrid trails dense by a noticeable margin at nDCG@10. For these semantic-answerability tasks, adding sparse evidence can dilute dense ranking in the top positions. For German systems, this group therefore argues for tuning hybrid fusion per task family rather than assuming that combined retrieval always dominates both components.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ger_da_lirLegal retrievalde20010,0002351.180.53600.29200.4461BM25
german_dprQA passage retrievalde2002,8762001.000.46470.78370.6120Dense
german_qu_adReading-comprehension retrievalde2004742001.000.94580.93210.9427BM25
gov_servicePublic service retrievalde2001052001.000.61320.79030.6959Dense
xmarket_deMarketplace retrievalmultilingual18210,0004,12422.660.20120.22680.2210Dense

Interpretation Notes for Model Researchers

NanoMTEB-German should be read as a diagnostic group, not as a single aggregate German score. BM25-led performance on ger_da_lir and german_qu_ad suggests that exact lexical evidence remains critical for legal and context-overlap retrieval. Dense-led performance on german_dpr, gov_service, and xmarket_de suggests that German semantic retrieval quality matters for question answering, public-service intent matching, and category-product association.

The group also separates single-positive evaluation from many-positive evaluation. Improvements on xmarket_de may reflect better category coverage and product clustering, while improvements on german_dpr or gov_service usually mean better ranking of one target passage or page. When comparing models, inspect per-task nDCG@10 and recall@100 before interpreting the group mean.

Training and Leakage Notes

Training data for this group should be separated by retrieval family. German Wikipedia QA pairs, GermanQuAD-style context retrieval pairs, German municipal FAQ and service descriptions, German legal case retrieval pairs, and e-commerce category-product pairs are all useful, but they exercise different relevance relations. Mixing them without labels can hide whether gains come from legal lexical matching, semantic QA retrieval, public-service intent matching, or marketplace categorization.

Leakage control should exclude evaluation queries, qrels, and positive documents from GerDaLIR, GermanDPR, GermanQuAD, LHM-Dienstleistungen-QA, and XMarket-derived data. Synthetic augmentation should preserve named entities, legal terms, service names, product names, numbers, and category labels. Hard negatives are especially useful when they share surface terms but differ in the actual legal issue, answer, service, or product category.

Source Reference Table

SourceYearTypeURL
GerDaLIR: A German Dataset for Legal Information Retrieval2021paperhttps://aclanthology.org/2021.nllp-1.13/
GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval2021paperhttps://arxiv.org/abs/2104.12741
Cross-Market Product Recommendation2021paperhttps://arxiv.org/abs/2109.05929
MTEB: Massive Text Embedding Benchmark2023paperhttps://arxiv.org/abs/2210.07316
MMTEB arXivpaperhttps://arxiv.org/abs/2502.13595
GerDaLIR GitHubdataset or project pagehttps://github.com/lavis-nlp/GerDaLIR
mteb/GerDaLIRdataset or project pagehttps://huggingface.co/datasets/mteb/GerDaLIR
mteb/GermanDPRdataset or project pagehttps://huggingface.co/datasets/mteb/GermanDPR
deepset/germandprdataset or project pagehttps://huggingface.co/datasets/deepset/germandpr
deepset/germanquaddataset or project pagehttps://huggingface.co/datasets/deepset/germanquad
it-at-m/LHM-Dienstleistungen-QA2022dataset or project pagehttps://huggingface.co/datasets/it-at-m/LHM-Dienstleistungen-QA

Metadata Summary

FieldValue
Task pages5
Queries982
Split-local documents23,455
Positive qrels4,959
Languagesde, multilingual
Categoriesnatural_language
Positives / query avg5.05

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ger_da_lirNanoMTEB-Germandenatural_language20010,0002350.53600.29200.4461BM25
german_dprNanoMTEB-Germandenatural_language2002,8762000.46470.78370.6120Dense
german_qu_adNanoMTEB-Germandenatural_language2004742000.94580.93210.9427BM25
gov_serviceNanoMTEB-Germandenatural_language2001052000.61320.79030.6959Dense
xmarket_deNanoMTEB-Germanmultilingualnatural_language18210,0004,1240.20120.22680.2210Dense