NanoMTEB-Scandinavian / swe_faq
Overview
swe_faq is the Swedish NanoMTEB-Scandinavian retrieval adaptation of SweFAQ, a dataset described in the SuperLim Swedish language understanding benchmark. SweFAQ contains frequently asked questions and answers from Swedish public-authority websites, including practical administrative domains such as social insurance, taxes, child support, parental benefits, disability support, and public services. In this retrieval task, a Swedish user question must retrieve the corresponding government-style answer.
The Nano split contains 200 queries, 511 documents, and exactly 200 positive relevance judgments. Each query has one positive answer. Queries average about 73 characters, while answer documents average about 320 characters. The observed questions involve Försäkringskassan, underhållsbidrag, föräldrapenning, bilstöd, work injury compensation, child benefits across EU/EES borders, and LSS. The task rewards matching a citizen's situation to the exact administrative answer.
Details
What the Original Data Measures
SuperLim describes SweFAQ as a Swedish FAQ dataset from public authorities. The retrieval adaptation uses the question as the query and the corresponding answer as the relevant document. This is practical QA retrieval: the answer should directly resolve the user's administrative question.
Relevance depends on more than benefit-category overlap. Two questions can both discuss child benefit or parental leave but differ in eligibility, country coordination, timing, payment, or exceptions. The model must identify the exact policy scenario being asked about.
Observed Data Profile
The corpus is small, but the answer documents contain dense policy language. Queries are usually full user questions with personal conditions, such as working in Sweden while family lives in another EU/EES country, or asking whether another parent abroad can receive parental benefit. Answers often include legal conditions, exceptions, or procedural constraints.
Each query has a single positive, so precise top ranking matters. Many candidates may discuss the same authority or benefit, but only one answer directly fits the question.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.5449, hit@10 of 0.7500, and recall@100 of 0.9050. Lexical matching is useful because Swedish administrative terms are specific. Words such as Försäkringskassan, vårdbidrag, barnbidrag, föräldrapenning, LSS, and arbetsskada often appear in both the question and answer.
The limitation is scenario matching. A question may use everyday wording while the answer uses formal policy language. Multiple answers can share the same benefit term but address different conditions. BM25 retrieves many positives within the top 100, but it is less reliable at placing the exact answer first.
Dense Evaluation Profile
The dense harrier-oss-270m run is strongest at top ranks, with nDCG@10 of 0.6488, hit@10 of 0.8100, and recall@100 of 0.9400. Dense retrieval improves because it can represent the user's situation and the answer's administrative condition as semantically related, even when wording differs.
This is important for public-sector FAQ retrieval. Users may phrase questions in practical terms, while authority answers use standardized legal or procedural language. Dense similarity better bridges that style gap than term overlap alone.
Reranking Hybrid Evaluation Profile
reranking_hybrid reports nDCG@10 of 0.6395, hit@10 of 0.8000, and recall@100 of 0.9650. Candidate lists contain 100 to 101 items, and 7 rows use the positive safeguard. Hybrid retrieval has the best recall@100, while dense retrieval has slightly better top-10 ranking.
This pattern suggests that lexical administrative terms and semantic scenario matching are complementary. Hybrid search is attractive for candidate generation because it keeps more correct answers available. Dense retrieval remains the stronger direct ranking profile by a small margin.
Metric Interpretation for Model Researchers
This split is dense-favorable for direct answer ranking and hybrid-favorable for candidate recall. BM25 is moderately strong because authority terms are repeated, but it cannot fully resolve eligibility scenarios and policy exceptions. Dense retrieval's top-rank advantage shows the value of semantic question-answer matching.
Because each query has one positive, nDCG@10 directly reflects whether the correct answer is placed high. Recall@100 is useful for reranking pipelines, where hybrid search provides the broadest candidate coverage.
Query and Relevance Type Tendencies
Representative queries ask whether the Social Insurance Agency can investigate a work injury for AFA insurance, why a care allowance decision must be followed up, whether a worker in Sweden can receive child benefit when the family lives in another EU/EES country, whether another parent abroad can receive parental benefit from Sweden, and what LSS means.
Relevant answers often begin with direct yes/no or definition-like language, followed by conditions. The model should match the user's concrete situation to the correct policy answer, not merely retrieve any answer about the same benefit.
Representative Failure Modes
BM25 may over-rank answers sharing the same benefit term but addressing a different condition. Dense retrieval may retrieve a semantically related policy answer that is not the exact scenario. Hybrid retrieval can preserve more positives but still rank same-benefit distractors high.
Another failure mode is missing cross-border or exception conditions. Questions involving EU/EES, Switzerland, work status, or which parent is insured require precise interpretation of the administrative situation.
Training Data That May Help
Useful training data includes non-overlapping Swedish FAQ question-answer pairs, public-sector help-center retrieval data, same-benefit hard negatives, and Swedish administrative QA paraphrases. Training should exclude SweFAQ or SuperLim test examples, Nano qrels, and answer documents from this split.
Hard negatives should be answers about the same authority and benefit but different eligibility, timing, or procedure. These are more useful than random negatives because they teach the model to identify exact policy fit.
Model Improvement Notes
Dense models can improve by representing administrative scenarios, benefit names, and exception conditions in Swedish. Sparse systems can improve through domain vocabulary and compound handling, but exact matching alone will confuse same-benefit answers. Hybrid retrieval is useful for first-stage recall, especially when followed by a reranker trained on FAQ answer selection.
For deployment-like evaluation, this task is a practical test of public-service search. The best model should retrieve the answer that a citizen can actually use, not merely a related policy page.
Example Data
| Query | Positive document |
| Kan Försäkringskassan utreda min arbetsskada så att jag kan få ersättning från AFA Försäkring? [94 chars] | Nej. Det beror på att vi bara får utreda om det är en arbetsskada om du uppfyller villkoren för att ha rätt till ersättning för din arbetsskada från Försäkringskassan. Det står i lagen. [185 chars] |
| Varför behöver mitt vårdbidragsbeslut följas upp? [49 chars] | Ditt beslut om vårdbidrag ska följas upp minst vartannat år, om det inte finns skäl för uppföljning med längre mellanrum. Beslutet ska också följas upp om förhållanden som påverkar behovet av vårdbidrag ändras. [210 chars] |
| Jag arbetar i Sverige men min familj bor i ett annat EU/EES-land. Kan jag få barnbidrag från Sverige? [101 chars] | Ja, om du arbetar i Sverige och har barn som bor i ett annat medlemsland kan du ha rätt till barnbidrag från Sverige. När föräldrarna bor eller arbetar i var sitt medlemsland behöver utbetalningen av barnbidrag samordnas mellan länderna. Kontakta oss på 010-115 10 20 för att få reda på vad som gäller i ditt fall. [314 chars] |
Source Reference Table
| Source | What it contributes |
| Scandinavian Embedding Benchmarks | Retrieval benchmark framing. |
| SuperLim paper | Original Swedish benchmark and SweFAQ context. |
| MTEB task card | Retrieval packaging of SweFAQ. |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-Scandinavian |
| Backing dataset | NanoMTEB-Scandinavian |
| Task / split | swe_faq |
| Hugging Face dataset | hakari-bench/NanoMTEB-Scandinavian |
| Language | sv |
| Category | natural_language |
| Queries | 200 |
| Documents | 511 |
| Positive qrels | 200 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 73.33 |
| Document length avg chars | 319.82 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.5449 | 0.7500 | 0.9050 | top-500 |
| Dense | harrier_oss_v1_270m | 0.6488 | 0.8100 | 0.9400 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.6395 | 0.8000 | 0.9650 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: test
- Train/eval overlap audit: not_audited
- Leakage note: exclude SweFAQ/SuperLim test examples, Nano qrels, and answer documents in this split
- Multi-positive training: single_positive_question_answer_focus
- Useful training data: non-overlapping Swedish FAQ question-answer pairs, Swedish public-sector help-center retrieval, same-benefit hard negatives, Swedish administrative QA paraphrases