NanoMTEB-German
Overview
NanoMTEB-German is a compact German retrieval group covering five tasks from the MTEB and multilingual MTEB ecosystem. It brings together legal retrieval, open-domain question answering, reading-comprehension context retrieval, municipal service search, and e-commerce product retrieval. The group is useful because it does not describe one single German retrieval problem: it tests whether a model can move between formal legal prose, encyclopedic passages, citizen-facing administrative language, and short marketplace labels.
The benchmark contains 982 queries, 23,455 task-local documents, and 4,959 positive qrel rows. Most tasks are single-positive retrieval tasks, but xmarket_de is heavily multi-positive, with category queries linked to many acceptable product records. This mixture makes the group a good diagnostic for German retrieval systems that need both precise answer-bearing passage retrieval and broader many-relevant-item ranking.
What This Group Measures
The central question is whether a model can retrieve German documents under different relevance relations. ger_da_lir asks the model to find legal decisions from legal passages, where exact legal terminology and citation-like phrasing are important. german_dpr and german_qu_ad use German Wikipedia question-answering data, but they behave differently: one rewards semantic answerability more strongly, while the other is almost solved by strong lexical and context overlap. gov_service maps natural user questions to Munich service pages, testing intent matching in administrative German. xmarket_de maps category-like German queries to product metadata, testing high-recall ranking with many relevant items.
For researchers, the value of the group is the contrast among retrieval signals. Some tasks are dominated by sparse lexical matching, some by dense semantic matching, and none in the current Nano slice is best served by the reranking hybrid profile at nDCG@10. That does not make hybrid unimportant: hybrid still improves top-100 coverage and hit@10 at group level, but it is not the leading nDCG@10 profile for any individual task in this group.
Task Families
- Legal retrieval:
ger_da_lirretrieves German legal decisions from legal passages. It is long-document retrieval with strong terminology and citation signals. - German QA passage retrieval:
german_dprretrieves answer-bearing German Wikipedia passages for open-domain questions. - German reading-comprehension retrieval:
german_qu_adretrieves the GermanQuAD context passage for each question, with very high scores across all retrieval profiles. - Public service retrieval:
gov_serviceretrieves Munich municipal service pages from citizen questions. - Marketplace retrieval:
xmarket_deretrieves product records from German category labels and short product-oriented queries.
Dataset Shape
The group has five task pages. Four tasks are marked as German (de), while xmarket_de is marked multilingual because product names, brand strings, and category labels often mix German with international terms. The group-level document count is the sum of task-local pools, not a deduplicated shared corpus.
Three tasks are strictly single-positive: german_dpr, german_qu_ad, and gov_service. ger_da_lir is nearly single-positive with 235 positives for 200 queries. xmarket_de is the outlier: 182 queries have 4,124 positives, or about 22.66 positives per query. This difference matters when interpreting recall and nDCG. A single missed positive can define failure in the QA tasks, while marketplace ranking is more about placing many acceptable products early.
Retrieval Behavior
BM25 Profile
BM25 is the best nDCG@10 profile for ger_da_lir and german_qu_ad. This is consistent with tasks where query wording, entities, legal terms, and passage phrases carry direct relevance evidence. In german_qu_ad, BM25 reaches 0.9458 nDCG@10, essentially matching dense and hybrid. In ger_da_lir, BM25 scores 0.5360 while dense falls to 0.2920, showing that the legal task in this Nano slice strongly rewards exact German legal vocabulary and surface-form anchors.
BM25 is weaker on german_dpr, gov_service, and xmarket_de. Those tasks require more paraphrase, intent matching, or category-product association than literal token overlap alone can provide. At group level, BM25 remains a strong baseline because the legal and GermanQuAD tasks are sizeable and lexically friendly, but it should not be treated as a universal German retrieval solution.
Dense Profile
Dense retrieval with harrier-oss-270m is the best profile for german_dpr, gov_service, and xmarket_de. The GermanDPR result is the clearest case: dense reaches 0.7837 nDCG@10 against BM25 at 0.4647, which indicates that the task depends on semantic answerability and paraphrase between question and passage. gov_service behaves similarly, with dense at 0.7903 against BM25 at 0.6132, because citizen questions and official service descriptions often use different wording for the same intent.
Dense is only slightly ahead on xmarket_de, where all profiles are low: 0.2268 for dense, 0.2210 for hybrid, and 0.2012 for BM25. This suggests that embedding similarity helps with category-product relatedness, but the task remains difficult because relevance is broad, product metadata is short, and many positives compete within the top ranks.
Reranking Hybrid Profile
The reranking hybrid column combines sparse and dense evidence to emulate a hybrid-search candidate set. In this group it is not the best nDCG@10 profile for any individual task, but it is still informative. It improves group-level hit@10 and recall@100 relative to dense alone, and it stays close to the best profile on german_qu_ad and xmarket_de. That pattern indicates that hybrid retrieval is recovering complementary candidates even when the final top-10 ordering is not the strongest.
The main caution is visible in german_dpr and gov_service: hybrid trails dense by a noticeable margin at nDCG@10. For these semantic-answerability tasks, adding sparse evidence can dilute dense ranking in the top positions. For German systems, this group therefore argues for tuning hybrid fusion per task family rather than assuming that combined retrieval always dominates both components.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| ger_da_lir | Legal retrieval | de | 200 | 10,000 | 235 | 1.18 | 0.5360 | 0.2920 | 0.4461 | BM25 |
| german_dpr | QA passage retrieval | de | 200 | 2,876 | 200 | 1.00 | 0.4647 | 0.7837 | 0.6120 | Dense |
| german_qu_ad | Reading-comprehension retrieval | de | 200 | 474 | 200 | 1.00 | 0.9458 | 0.9321 | 0.9427 | BM25 |
| gov_service | Public service retrieval | de | 200 | 105 | 200 | 1.00 | 0.6132 | 0.7903 | 0.6959 | Dense |
| xmarket_de | Marketplace retrieval | multilingual | 182 | 10,000 | 4,124 | 22.66 | 0.2012 | 0.2268 | 0.2210 | Dense |
Interpretation Notes for Model Researchers
NanoMTEB-German should be read as a diagnostic group, not as a single aggregate German score. BM25-led performance on ger_da_lir and german_qu_ad suggests that exact lexical evidence remains critical for legal and context-overlap retrieval. Dense-led performance on german_dpr, gov_service, and xmarket_de suggests that German semantic retrieval quality matters for question answering, public-service intent matching, and category-product association.
The group also separates single-positive evaluation from many-positive evaluation. Improvements on xmarket_de may reflect better category coverage and product clustering, while improvements on german_dpr or gov_service usually mean better ranking of one target passage or page. When comparing models, inspect per-task nDCG@10 and recall@100 before interpreting the group mean.
Training and Leakage Notes
Training data for this group should be separated by retrieval family. German Wikipedia QA pairs, GermanQuAD-style context retrieval pairs, German municipal FAQ and service descriptions, German legal case retrieval pairs, and e-commerce category-product pairs are all useful, but they exercise different relevance relations. Mixing them without labels can hide whether gains come from legal lexical matching, semantic QA retrieval, public-service intent matching, or marketplace categorization.
Leakage control should exclude evaluation queries, qrels, and positive documents from GerDaLIR, GermanDPR, GermanQuAD, LHM-Dienstleistungen-QA, and XMarket-derived data. Synthetic augmentation should preserve named entities, legal terms, service names, product names, numbers, and category labels. Hard negatives are especially useful when they share surface terms but differ in the actual legal issue, answer, service, or product category.
Source Reference Table
| Source | Year | Type | URL |
| GerDaLIR: A German Dataset for Legal Information Retrieval | 2021 | paper | https://aclanthology.org/2021.nllp-1.13/ |
| GermanQuAD and GermanDPR: Improving Non-English Question Answering and Passage Retrieval | 2021 | paper | https://arxiv.org/abs/2104.12741 |
| Cross-Market Product Recommendation | 2021 | paper | https://arxiv.org/abs/2109.05929 |
| MTEB: Massive Text Embedding Benchmark | 2023 | paper | https://arxiv.org/abs/2210.07316 |
| MMTEB arXiv | paper | https://arxiv.org/abs/2502.13595 | |
| GerDaLIR GitHub | dataset or project page | https://github.com/lavis-nlp/GerDaLIR | |
| mteb/GerDaLIR | dataset or project page | https://huggingface.co/datasets/mteb/GerDaLIR | |
| mteb/GermanDPR | dataset or project page | https://huggingface.co/datasets/mteb/GermanDPR | |
| deepset/germandpr | dataset or project page | https://huggingface.co/datasets/deepset/germandpr | |
| deepset/germanquad | dataset or project page | https://huggingface.co/datasets/deepset/germanquad | |
| it-at-m/LHM-Dienstleistungen-QA | 2022 | dataset or project page | https://huggingface.co/datasets/it-at-m/LHM-Dienstleistungen-QA |
Metadata Summary
| Field | Value |
| Task pages | 5 |
| Queries | 982 |
| Split-local documents | 23,455 |
| Positive qrels | 4,959 |
| Languages | de, multilingual |
| Categories | natural_language |
| Positives / query avg | 5.05 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| ger_da_lir | NanoMTEB-German | de | natural_language | 200 | 10,000 | 235 | 0.5360 | 0.2920 | 0.4461 | BM25 |
| german_dpr | NanoMTEB-German | de | natural_language | 200 | 2,876 | 200 | 0.4647 | 0.7837 | 0.6120 | Dense |
| german_qu_ad | NanoMTEB-German | de | natural_language | 200 | 474 | 200 | 0.9458 | 0.9321 | 0.9427 | BM25 |
| gov_service | NanoMTEB-German | de | natural_language | 200 | 105 | 200 | 0.6132 | 0.7903 | 0.6959 | Dense |
| xmarket_de | NanoMTEB-German | multilingual | natural_language | 182 | 10,000 | 4,124 | 0.2012 | 0.2268 | 0.2210 | Dense |