NanoMTEB-Scandinavian
Overview
NanoMTEB-Scandinavian is a compact retrieval group for Danish, Norwegian, and Swedish tasks from the Scandinavian Embedding Benchmark ecosystem. It covers fact verification, extractive QA answer selection, encyclopedia article lookup, FAQ retrieval, news retrieval, and informal social question answering. The group is small in task count, but it is not a single-domain benchmark: it moves from highly lexical title or evidence retrieval to conversational answer retrieval where lexical overlap is much weaker.
The group contains 1,273 queries, 9,737 task-local documents, and 1,753 positive qrel rows. Most tasks are monolingual within one Scandinavian language, while the group as a whole is multilingual because it spans Danish, Norwegian, and Swedish. Its value is that it tests whether a model can handle closely related North Germanic languages while preserving different source-task relevance relations.
What This Group Measures
The benchmark measures retrieval after several non-retrieval datasets have been adapted into query-document ranking tasks. dan_fever retrieves Danish evidence snippets for factual claims. nor_quad retrieves short Norwegian answer strings for questions. snl retrieves Store norske leksikon article text from Norwegian headwords. swe_faq retrieves Swedish authority FAQ answers. swedn and tv2_nordretrieval retrieve news summaries or articles from headlines or short summaries. twitter_hjerne retrieves Danish answer tweets for informal question tweets.
The group is therefore useful for separating lexical retrieval from semantic answer retrieval. Some tasks expose strong named entities, titles, places, or dates. Others require matching a question to a concise answer or an informal reply that does not repeat the same words. This creates a clear diagnostic contrast between BM25, dense retrieval, and hybrid candidate generation.
Task Families
- Fact verification:
dan_feverretrieves Danish evidence for factual claims. - Answer selection:
nor_quad,swe_faq, andtwitter_hjerneretrieve answer strings, FAQ answers, or social-media replies. - Encyclopedia retrieval:
snlretrieves Norwegian encyclopedia articles from short headwords. - News retrieval:
swednandtv2_nordretrievalretrieve Swedish and Danish news documents from headlines or summaries. - Multi-positive retrieval:
nor_quad,swedn, andtwitter_hjerneinclude more than one relevant document for some queries.
Dataset Shape
The group has seven task pages. dan_fever, snl, swe_faq, and tv2_nordretrieval are single-positive in the Nano split. nor_quad averages 1.48 positives per query, swedn has exactly two positives per query, and twitter_hjerne averages 3.40 positives per query. The document pools are small compared with many Nano sets, ranging from 262 documents for twitter_hjerne to 2,522 for dan_fever.
Text length differs by source. snl has very short title-like queries and long encyclopedia articles. nor_quad has compact questions and short answer documents. twitter_hjerne has long informal question tweets and shorter answer tweets. The news tasks retrieve longer article-style documents, while swe_faq uses public-sector answer text. These differences should be kept visible when comparing model behavior.
Retrieval Behavior
BM25 Profile
BM25 is the best nDCG@10 profile only for dan_fever, but it is highly competitive on several lexical tasks. dan_fever reaches 0.8856 nDCG@10, snl reaches 0.8781, and tv2_nordretrieval reaches 0.8957. These tasks expose strong surface evidence: claims, article headwords, local place names, dates, and news-specific entities.
BM25 is much weaker on answer-selection tasks. nor_quad scores only 0.1118 nDCG@10 because the positive document is often a short answer string that does not repeat the question wording. twitter_hjerne also challenges BM25 because informal replies can be useful without sharing many tokens with the question tweet. This makes the group a good reminder that strong Scandinavian lexical matching does not imply strong answer retrieval.
Dense Profile
Dense retrieval with harrier-oss-270m is the best profile for five tasks: nor_quad, snl, swe_faq, swedn, and tv2_nordretrieval. Its largest gains are on answer and social retrieval. nor_quad rises from 0.1118 BM25 nDCG@10 to 0.2378 dense nDCG@10, and twitter_hjerne rises from 0.2395 to 0.6211 even though hybrid remains below dense. Dense also improves swe_faq, where FAQ answers can express policy or advice without repeating the user's wording.
Dense is not a replacement for lexical retrieval in every case. dan_fever slightly favors BM25, indicating that claim-evidence overlap remains valuable. Still, dense retrieval has the strongest query-weighted nDCG@10 for the group at 0.7278, making it the most important single profile for this Nano slice.
Reranking Hybrid Profile
The reranking hybrid profile is best only for no individual task in this group, but it has the strongest recall@100 at 0.8878. It stays close to the best profile on dan_fever, snl, swe_faq, swedn, and tv2_nordretrieval, which means the combined candidate set is often good even when the top-10 order is not optimal.
The weak spot is answer selection. Hybrid trails dense on nor_quad and twitter_hjerne, suggesting that sparse evidence can dilute dense semantic signals when the relevant answer or reply shares little vocabulary with the query. For Scandinavian retrieval systems, this group supports hybrid candidate generation, but it also argues for task-aware reranking or fusion when the target is a short answer rather than a passage with lexical anchors.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| dan_fever | Fact verification | da | 200 | 2,522 | 200 | 1.00 | 0.8856 | 0.8630 | 0.8832 | BM25 |
| nor_quad | Answer selection | no | 196 | 1,048 | 291 | 1.48 | 0.1118 | 0.2378 | 0.1301 | Dense |
| snl | Encyclopedia retrieval | no | 200 | 1,300 | 200 | 1.00 | 0.8781 | 0.9599 | 0.9024 | Dense |
| swe_faq | FAQ answer retrieval | sv | 200 | 511 | 200 | 1.00 | 0.5449 | 0.6488 | 0.6395 | Dense |
| swedn | News retrieval | sv | 200 | 2,046 | 400 | 2.00 | 0.7081 | 0.7757 | 0.7398 | Dense |
| tv2_nordretrieval | News retrieval | da | 200 | 2,048 | 200 | 1.00 | 0.8957 | 0.9127 | 0.8998 | Dense |
| twitter_hjerne | Social QA retrieval | da | 77 | 262 | 262 | 3.40 | 0.2395 | 0.6211 | 0.4480 | Dense |
Interpretation Notes for Model Researchers
NanoMTEB-Scandinavian is strongest as a contrast set. High BM25 performance on dan_fever, snl, and tv2_nordretrieval reflects tasks with entities, titles, claims, and news terms. Dense gains on nor_quad, swe_faq, and twitter_hjerne reflect answerability and intent matching. A model that only does well on lexical tasks may not be useful for Scandinavian FAQ or social QA.
The group also tests related-language coverage. Danish, Norwegian, and Swedish share vocabulary and morphology, but the retrieval sources differ enough that a single language-level conclusion can be misleading. Researchers should compare fact verification, answer selection, encyclopedia, and news subtasks separately before interpreting the aggregate score.
Training and Leakage Notes
Useful training data includes Danish claim-evidence pairs, Norwegian extractive QA and answer-selection pairs, Store norske leksikon title/article data, Swedish FAQ and public-sector help-center pairs, Danish and Swedish headline-to-article pairs, and Danish social QA threads. For swedn and twitter_hjerne, multi-positive training should be preserved instead of forcing one canonical answer.
Leakage control should exclude Nano evaluation queries, qrels, positives, tweet threads, news article pairs, and near duplicates from SEB-related sources. Hard negatives should be close within the same genre: nearby encyclopedia articles, same-event news stories, related claims with changed entities or dates, similar FAQ answers, or answer tweets from adjacent topics.
Source Reference Table
Metadata Summary
| Field | Value |
| Task pages | 7 |
| Queries | 1,273 |
| Split-local documents | 9,737 |
| Positive qrels | 1,753 |
| Languages | da, no, sv |
| Categories | natural_language |
| Positives / query avg | 1.38 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| dan_fever | NanoMTEB-Scandinavian | da | natural_language | 200 | 2,522 | 200 | 0.8856 | 0.8630 | 0.8832 | BM25 |
| nor_quad | NanoMTEB-Scandinavian | no | natural_language | 196 | 1,048 | 291 | 0.1118 | 0.2378 | 0.1301 | Dense |
| snl | NanoMTEB-Scandinavian | no | natural_language | 200 | 1,300 | 200 | 0.8781 | 0.9599 | 0.9024 | Dense |
| swe_faq | NanoMTEB-Scandinavian | sv | natural_language | 200 | 511 | 200 | 0.5449 | 0.6488 | 0.6395 | Dense |
| swedn | NanoMTEB-Scandinavian | sv | natural_language | 200 | 2,046 | 400 | 0.7081 | 0.7757 | 0.7398 | Dense |
| tv2_nordretrieval | NanoMTEB-Scandinavian | da | natural_language | 200 | 2,048 | 200 | 0.8957 | 0.9127 | 0.8998 | Dense |
| twitter_hjerne | NanoMTEB-Scandinavian | da | natural_language | 77 | 262 | 262 | 0.2395 | 0.6211 | 0.4480 | Dense |