HAKARI-Bench

NanoMTEB-Scandinavian

Overview

NanoMTEB-Scandinavian is a compact retrieval group for Danish, Norwegian, and Swedish tasks from the Scandinavian Embedding Benchmark ecosystem. It covers fact verification, extractive QA answer selection, encyclopedia article lookup, FAQ retrieval, news retrieval, and informal social question answering. The group is small in task count, but it is not a single-domain benchmark: it moves from highly lexical title or evidence retrieval to conversational answer retrieval where lexical overlap is much weaker.

The group contains 1,273 queries, 9,737 task-local documents, and 1,753 positive qrel rows. Most tasks are monolingual within one Scandinavian language, while the group as a whole is multilingual because it spans Danish, Norwegian, and Swedish. Its value is that it tests whether a model can handle closely related North Germanic languages while preserving different source-task relevance relations.

What This Group Measures

The benchmark measures retrieval after several non-retrieval datasets have been adapted into query-document ranking tasks. dan_fever retrieves Danish evidence snippets for factual claims. nor_quad retrieves short Norwegian answer strings for questions. snl retrieves Store norske leksikon article text from Norwegian headwords. swe_faq retrieves Swedish authority FAQ answers. swedn and tv2_nordretrieval retrieve news summaries or articles from headlines or short summaries. twitter_hjerne retrieves Danish answer tweets for informal question tweets.

The group is therefore useful for separating lexical retrieval from semantic answer retrieval. Some tasks expose strong named entities, titles, places, or dates. Others require matching a question to a concise answer or an informal reply that does not repeat the same words. This creates a clear diagnostic contrast between BM25, dense retrieval, and hybrid candidate generation.

Task Families

Dataset Shape

The group has seven task pages. dan_fever, snl, swe_faq, and tv2_nordretrieval are single-positive in the Nano split. nor_quad averages 1.48 positives per query, swedn has exactly two positives per query, and twitter_hjerne averages 3.40 positives per query. The document pools are small compared with many Nano sets, ranging from 262 documents for twitter_hjerne to 2,522 for dan_fever.

Text length differs by source. snl has very short title-like queries and long encyclopedia articles. nor_quad has compact questions and short answer documents. twitter_hjerne has long informal question tweets and shorter answer tweets. The news tasks retrieve longer article-style documents, while swe_faq uses public-sector answer text. These differences should be kept visible when comparing model behavior.

Retrieval Behavior

BM25 Profile

BM25 is the best nDCG@10 profile only for dan_fever, but it is highly competitive on several lexical tasks. dan_fever reaches 0.8856 nDCG@10, snl reaches 0.8781, and tv2_nordretrieval reaches 0.8957. These tasks expose strong surface evidence: claims, article headwords, local place names, dates, and news-specific entities.

BM25 is much weaker on answer-selection tasks. nor_quad scores only 0.1118 nDCG@10 because the positive document is often a short answer string that does not repeat the question wording. twitter_hjerne also challenges BM25 because informal replies can be useful without sharing many tokens with the question tweet. This makes the group a good reminder that strong Scandinavian lexical matching does not imply strong answer retrieval.

Dense Profile

Dense retrieval with harrier-oss-270m is the best profile for five tasks: nor_quad, snl, swe_faq, swedn, and tv2_nordretrieval. Its largest gains are on answer and social retrieval. nor_quad rises from 0.1118 BM25 nDCG@10 to 0.2378 dense nDCG@10, and twitter_hjerne rises from 0.2395 to 0.6211 even though hybrid remains below dense. Dense also improves swe_faq, where FAQ answers can express policy or advice without repeating the user's wording.

Dense is not a replacement for lexical retrieval in every case. dan_fever slightly favors BM25, indicating that claim-evidence overlap remains valuable. Still, dense retrieval has the strongest query-weighted nDCG@10 for the group at 0.7278, making it the most important single profile for this Nano slice.

Reranking Hybrid Profile

The reranking hybrid profile is best only for no individual task in this group, but it has the strongest recall@100 at 0.8878. It stays close to the best profile on dan_fever, snl, swe_faq, swedn, and tv2_nordretrieval, which means the combined candidate set is often good even when the top-10 order is not optimal.

The weak spot is answer selection. Hybrid trails dense on nor_quad and twitter_hjerne, suggesting that sparse evidence can dilute dense semantic signals when the relevant answer or reply shares little vocabulary with the query. For Scandinavian retrieval systems, this group supports hybrid candidate generation, but it also argues for task-aware reranking or fusion when the target is a short answer rather than a passage with lexical anchors.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
dan_feverFact verificationda2002,5222001.000.88560.86300.8832BM25
nor_quadAnswer selectionno1961,0482911.480.11180.23780.1301Dense
snlEncyclopedia retrievalno2001,3002001.000.87810.95990.9024Dense
swe_faqFAQ answer retrievalsv2005112001.000.54490.64880.6395Dense
swednNews retrievalsv2002,0464002.000.70810.77570.7398Dense
tv2_nordretrievalNews retrievalda2002,0482001.000.89570.91270.8998Dense
twitter_hjerneSocial QA retrievalda772622623.400.23950.62110.4480Dense

Interpretation Notes for Model Researchers

NanoMTEB-Scandinavian is strongest as a contrast set. High BM25 performance on dan_fever, snl, and tv2_nordretrieval reflects tasks with entities, titles, claims, and news terms. Dense gains on nor_quad, swe_faq, and twitter_hjerne reflect answerability and intent matching. A model that only does well on lexical tasks may not be useful for Scandinavian FAQ or social QA.

The group also tests related-language coverage. Danish, Norwegian, and Swedish share vocabulary and morphology, but the retrieval sources differ enough that a single language-level conclusion can be misleading. Researchers should compare fact verification, answer selection, encyclopedia, and news subtasks separately before interpreting the aggregate score.

Training and Leakage Notes

Useful training data includes Danish claim-evidence pairs, Norwegian extractive QA and answer-selection pairs, Store norske leksikon title/article data, Swedish FAQ and public-sector help-center pairs, Danish and Swedish headline-to-article pairs, and Danish social QA threads. For swedn and twitter_hjerne, multi-positive training should be preserved instead of forcing one canonical answer.

Leakage control should exclude Nano evaluation queries, qrels, positives, tweet threads, news article pairs, and near duplicates from SEB-related sources. Hard negatives should be close within the same genre: nearby encyclopedia articles, same-event news stories, related claims with changed entities or dates, similar FAQ answers, or answer tweets from adjacent topics.

Source Reference Table

SourceYearTypeURL
The Scandinavian Embedding Benchmarks2024benchmark paperhttps://arxiv.org/abs/2406.02396
DanFEVER: claim verification dataset for Danish2021source task paperhttps://aclanthology.org/2021.nodalida-main.47/
NorQuAD: Norwegian Question Answering Dataset2023source task paperhttps://aclanthology.org/2023.nodalida-1.17/
Superlim: A Swedish Language Understanding Evaluation Benchmark2023benchmark paperhttps://aclanthology.org/2023.emnlp-main.506/
SweDN resource pagedataset pagehttps://spraakbanken.gu.se/en/resources/swedn
Nordjylland News datasheetdataset pagehttps://www.foundationmodels.dk/data/nordjyllandnews/nordjyllandnews.html
#Twitterhjerne dataset carddataset cardhttps://huggingface.co/datasets/sorenmulli/da-hashtag-twitterhjerne
strombergnlp/danfeverdataset cardhttps://huggingface.co/datasets/strombergnlp/danfever
mteb/norquad_retrievaldataset cardhttps://huggingface.co/datasets/mteb/norquad_retrieval
mteb/SweFaqRetrievaldataset cardhttps://huggingface.co/datasets/mteb/SweFaqRetrieval

Metadata Summary

FieldValue
Task pages7
Queries1,273
Split-local documents9,737
Positive qrels1,753
Languagesda, no, sv
Categoriesnatural_language
Positives / query avg1.38

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
dan_feverNanoMTEB-Scandinaviandanatural_language2002,5222000.88560.86300.8832BM25
nor_quadNanoMTEB-Scandinaviannonatural_language1961,0482910.11180.23780.1301Dense
snlNanoMTEB-Scandinaviannonatural_language2001,3002000.87810.95990.9024Dense
swe_faqNanoMTEB-Scandinaviansvnatural_language2005112000.54490.64880.6395Dense
swednNanoMTEB-Scandinaviansvnatural_language2002,0464000.70810.77570.7398Dense
tv2_nordretrievalNanoMTEB-Scandinaviandanatural_language2002,0482000.89570.91270.8998Dense
twitter_hjerneNanoMTEB-Scandinaviandanatural_language772622620.23950.62110.4480Dense