NanoMTEB-Scandinavian / twitter_hjerne
Overview
twitter_hjerne is the Danish NanoMTEB-Scandinavian retrieval adaptation of the #Twitterhjerne social-media question-answer dataset. The source data consists of Danish help-seeking question tweets and human reply tweets. In retrieval form, the query is a question tweet and the relevant documents are reply tweets that attempt to answer it. This makes the task informal, multi-positive Danish answer retrieval.
The Nano split contains 77 queries, 262 documents, and 262 positive relevance judgments. Queries average about 166 characters, while reply documents average about 129 characters. Almost every query is multi-positive: 75 of 77 queries have more than one positive, the median is 3 positives, and the maximum is 6. The task covers technology help, shopping, workplace tools, food substitutions, family photo sharing, travel, children's activities, household problems, and recommendations.
Details
What the Original Data Measures
The source dataset collects Danish questions posted with the #Twitterhjerne hashtag and their human answers. The thesis and dataset card describe the data as small, informal Danish help or input requests, filtered to have clear questions, no required attached image, no personal information, and multiple relevant replies. The retrieval adaptation preserves that structure: multiple replies can be valid positives for the same query.
This differs from factoid QA and duplicate-question retrieval. A relevant answer may be short, subjective, partial, advisory, or recommendation-oriented. The model must retrieve useful replies, not a single canonical answer.
Observed Data Profile
The data is small and informal. Tweets may include hashtags, URLs, abbreviations, platform names, spelling variation, and conversational wording. Queries are often longer than replies because the question includes context, constraints, and social framing. Replies may be terse, such as a product name, a recommendation, or a short suggestion.
The multi-positive structure is central. A question can have several acceptable answers, especially for recommendations or practical advice. Evaluation should therefore reward models that retrieve a set of useful replies rather than only one best answer.
BM25 Evaluation Profile
BM25 is weak to moderate, with nDCG@10 of 0.2395, hit@10 of 0.6104, and recall@100 of 0.6527. It can retrieve answers when the reply repeats a product name, platform name, or key term from the question. For example, a question about media monitoring may retrieve a reply naming a service if vocabulary overlaps.
However, many useful replies do not repeat the question wording. A question about a PS5 controller can be answered with advice to return it to the store. A question about photo sharing can be answered with OneDrive. A question about heavy cream can be answered by explaining fat percentage. BM25 misses many of these because the answer is semantically useful but lexically sparse.
Dense Evaluation Profile
The dense harrier-oss-270m run is clearly strongest, with nDCG@10 of 0.6211, hit@10 of 0.9091, and recall@100 of 0.9008. Dense retrieval is well suited to this task because it can map a help-seeking question to an advisory reply even when the words differ. It can represent recommendation, troubleshooting, and substitution relations that are difficult for lexical matching.
This is one of the strongest dense-favorable profiles in the Scandinavian set. The gap from BM25 shows that informal social QA depends heavily on semantic answer usefulness rather than term overlap.
Reranking Hybrid Evaluation Profile
reranking_hybrid reports nDCG@10 of 0.4480, hit@10 of 0.8182, and recall@100 of 0.8664. Candidate lists contain 100 to 101 items, and 2 rows use the positive safeguard. Hybrid retrieval improves substantially over BM25 but remains below dense retrieval across the reported metrics.
This suggests that lexical candidates add some value but also introduce conversational or topical distractors. Dense retrieval alone is better at ranking useful replies near the top. Hybrid retrieval can still be useful as a candidate pool, but it is not the strongest final ranking for this task.
Metric Interpretation for Model Researchers
This split is strongly dense-favorable. BM25 underperforms because question tweets and answer tweets often have different vocabulary. reranking_hybrid narrows the gap but does not exceed dense retrieval. A model that performs well here likely captures pragmatic answer relations: recommendation, troubleshooting, substitution, and short advice.
Because almost all queries have multiple positives, recall@100 and nDCG@10 both matter. The task does not require selecting one canonical answer. It rewards retrieving several useful replies. Hit@10 alone can hide whether the model retrieves a broad set of relevant responses.
Query and Relevance Type Tendencies
Representative queries ask where to buy non-Danish potatoes, which media-monitoring service people use at work, how to handle a PS5 controller that barely charges, whether English heavy cream corresponds to Danish whipping cream, and how to share family photos without Google Photos or iCloud. Relevant replies are often short suggestions, service names, or practical advice.
The task includes subjective and contextual answers. A recommendation can be valid even if it does not repeat the question's constraints. A troubleshooting reply can be useful with only a few words. Models must infer the usefulness relation.
Representative Failure Modes
BM25 may retrieve replies that repeat a word from the question but do not answer it. Dense retrieval may retrieve plausible advice for a nearby topic but not the specific constraint. Hybrid retrieval can include both lexical false positives and broad semantic neighbors.
Another failure mode is ignoring constraints. A query may exclude Google Photos or iCloud, ask for non-Danish potatoes, or specify a workplace context. A relevant answer must respect those constraints. Short reply tweets can make this difficult.
Training Data That May Help
Useful training data includes Danish forum question-reply pairs, Danish social-media QA pairs, community-support and recommendation threads, and multi-positive answer retrieval examples. Training should preserve multiple human replies as positives rather than collapsing them into one answer.
Hard negatives should be plausible replies to nearby topics but not useful for the specific question. For example, a photo-sharing recommendation that violates the user's constraints is a stronger negative than an unrelated tweet.
Model Improvement Notes
Dense models can improve by representing informal Danish, hashtags, abbreviations, and pragmatic answer relations. Sparse systems have limited upside unless replies repeat question terms. Hybrid systems may help candidate recall, but final ranking benefits most from semantic answer matching.
For reranking, this task rewards models that can judge whether a short reply actually helps with the query. Multi-positive training and evaluation are important because several different replies can all be relevant.
Example Data
| Query | Positive document |
| Hej #Twitterhjerne & twitterfolkens (- eller X'ere, whatever 😊) Er der nogen der kan fortælle mig hvor jeg kan købe IKKE-danske kartofler? Hverken Rema1000, Netto, Kvickly ell SuperBrugsen sælger andet end danske. (Jeg ønsker ikke at spise det hjerneskadende pesticid Reglone) [278 chars] | Økologiske er vel ok? [21 chars] |
| Hvis I betaler for medieovervågning på arbejdet - hvem bruger I så, og er I tilfredse? #dkmedier #dkbiz #twitterhjerne [118 chars] | Infomedia - og mnjah [20 chars] |
| Er der andre der døjer med samme problem som mig, min controller til ps5 lader maks 1 streg om natten. Hver gang jeg sidder og spiller disconneter den hele tiden, jeg har ikke gjort noget ved den. Den havde det allerede 1 uge efter jeg fik konsollen. [250 chars] | Du skal bare indlevere den, hvor du har købt, så får du en ny [61 chars] |
Source Reference Table
| Source | What it contributes |
| Scandinavian Embedding Benchmarks | Retrieval benchmark adaptation. |
| Danoliterate thesis | Danish dataset context and filtering description. |
| Source dataset card | #Twitterhjerne question and answer tweet data. |
| MTEB task card | Retrieval task packaging. |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-Scandinavian |
| Backing dataset | NanoMTEB-Scandinavian |
| Task / split | twitter_hjerne |
| Hugging Face dataset | hakari-bench/NanoMTEB-Scandinavian |
| Language | da |
| Category | natural_language |
| Queries | 77 |
| Documents | 262 |
| Positive qrels | 262 |
| Positives / query avg | 3.40 |
| Positives / query min | 1 |
| Positives / query median | 3.00 |
| Positives / query max | 6 |
| Multi-positive queries | 75 (97.40%) |
| Query length avg chars | 165.75 |
| Document length avg chars | 128.77 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.2395 | 0.6104 | 0.6527 | top-500 |
| Dense | harrier_oss_v1_270m | 0.6211 | 0.9091 | 0.9008 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.4480 | 0.8182 | 0.8664 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: train
- Train/eval overlap audit: not_audited
- Leakage note: exclude Nano question tweets, reply tweets, and qrels from training
- Multi-positive training: preserve_multiple_human_replies_as_positives
- Useful training data: Danish forum question-reply pairs, Danish social-media QA pairs, community-support and recommendation threads, multi-positive answer retrieval examples