NanoMTEB-v2 / cqadupstack_gaming

Overview

NanoMTEB-v2 / cqadupstack_gaming is the Gaming slice of CQADupStack duplicate-question retrieval. Short Gaming StackExchange question titles are used as queries, and candidate documents are longer posts that may ask the same or near-duplicate question. The original CQADupStack benchmark was built from StackExchange duplicate links for community question-answering research. In this slice, the retrieval problem is player-intent matching: a model must recognize that two questions concern the same game mechanic, platform constraint, quest issue, or resource rule, even when the wording differs. The Nano split contains 200 queries over 10,000 documents and includes many multi-positive queries.

Details

What the Original Data Measures

CQADupStack measures duplicate-question retrieval across StackExchange subforums. The positive document is a question judged to be a duplicate or near duplicate of the query. This is not answer passage retrieval: the model is matching two questions.

The Gaming subset focuses on game-specific problems, including mechanics, co-op setup, buildings, resources, creature forms, platforms, and game versions. Relevance depends on shared intent, not just shared game names.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 415 positive qrel rows. Queries have 2.075 positives on average, with a median of 1 and a maximum of 22. There are 65 multi-positive queries, or 32.5% of the query set. Queries average 47.62 characters, while documents average 481.08 characters.

The examples are short titles paired with longer StackExchange posts. Documents may include duplicate markers, quoted text, game names, tags, and explanatory body content. Some questions are highly lexical because game names repeat; others require recognizing paraphrased gameplay intent.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.5073, hit@10 of 0.6850, and recall@100 of 0.7759. BM25 is a useful baseline because titles and duplicate posts often share game names, item names, or mechanic terms.

Its limitations appear when a duplicate question phrases the same issue differently. BM25 may rank another post from the same game, or a post with the same mechanic word, above the true duplicate. The task therefore tests whether lexical matching can separate same-game topical similarity from actual duplicate intent.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.6375, hit@10 of 0.7900, and recall@100 of 0.8506. Dense retrieval is clearly stronger than BM25 across all reported metrics. This suggests that embedding similarity captures gameplay intent and question paraphrase better than term overlap alone.

Dense retrieval is still challenged by domain-specific names and mechanics. If a model lacks gaming knowledge or treats a rare game term as noise, it may miss a duplicate that BM25 can surface. The best systems need both semantic paraphrase matching and sensitivity to exact game-specific terminology.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 10 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.5970, hit@10 of 0.7800, and recall@100 of 0.8771. Hybrid retrieval provides the best recall@100, while dense retrieval remains strongest in top-rank quality.

This indicates that sparse retrieval adds complementary game-name and keyword coverage, but dense retrieval better orders the most relevant duplicates. A reranker should benefit from the hybrid pool if it can compare detailed question intent across same-game hard negatives.

Metric Interpretation for Model Researchers

The multi-positive structure matters: some query titles have many duplicate posts, while most have only one. nDCG@10 rewards ranking any accepted duplicate highly, but recall@100 shows whether the candidate pool is broad enough for multi-positive queries.

Dense retrieval is the main first-stage baseline to beat. Hybrid retrieval is a stronger reranking pool because it captures both exact game terms and paraphrased duplicates.

Query and Relevance Type Tendencies

Queries are short gaming question titles. Relevant documents are longer duplicate questions from Gaming StackExchange. They may discuss the same mechanic using different examples, versions, or platforms.

The relevance relation is duplicate-question equivalence. A post about the same game is not relevant unless it asks the same or near-same player question.

Representative Failure Modes

Common failures include retrieving another question about the same game but a different mechanic, confusing platform-specific setup with general setup, over-matching popular game titles, and missing paraphrases that describe the same gameplay problem in different words. Dense systems can miss rare item or quest names; sparse systems can over-rank title overlap.

Training Data That May Help

Useful training data includes StackExchange duplicate-question pairs, gaming forum duplicate questions, gaming FAQ pairs, and hard negatives from the same game or tag. Multi-positive training is recommended because duplicate clusters can contain many accepted variants.

Model Improvement Notes

Models should learn duplicate intent at the question level. Hard negatives should share game names and mechanics but ask different questions. Rerankers should compare the full post body, not only the title, because duplicate evidence is often in setup details or constraints.

Example Data

Query	Positive document
How can a monk tank effectively for a group? [44 chars]	Monk skills suited for CC and tanking > Possible Duplicate: > How can a monk tank effectively for a group? When playing with my friends (who play ranged classes), I mostly end up tanking / crowd controlling with my monk. Are there any specific skills (and runes) that could help me do this? [298 chars]
Portal 2 Offline Co-op on Mac [29 chars]	Can we play Portal 2 co-op on one PC or Mac? Is there a way to play Portal 2 co-op on a single PC or Mac? If so, do we need two keyboards, or two mice, or what? Do we need to buy two copies of the game? Note: The _wireless_ XBox 360 controller does not work standalone out of the box on either an iMac or Windows 7 PC. It needs a receiver as well. I thought it would be Bluetooth, but no. [389 chars]
What type of buildings offer what level of jobs? [48 chars]	Who works in medium value commercial properties? I have some §§ (medium-wealth) buildings that are closed or closing due to a lack of workers. Yet, of my 10,652 §§ workers, 3,205 are unemployed and 152 are commuting out. This confuses me because I assumed §§ workers would work in §§ comercial buildings. [305 chars]

Source Reference Table

Title	Year	Type	URL
CQADupStack: A Benchmark Data Set for Community Question-Answering Research	2015	source task paper	https://eltimster.github.io/www/pubs/adcs2015.pdf
MTEB: Massive Text Embedding Benchmark	2023	benchmark paper	https://arxiv.org/abs/2210.07316
mteb/cqadupstack-gaming		dataset card	https://huggingface.co/datasets/mteb/cqadupstack-gaming

Dataset Information

Field	Value
Nano set	NanoMTEB-v2
Backing dataset	NanoMTEB-v2
Task / split	cqadupstack_gaming
Hugging Face dataset	hakari-bench/NanoMTEB-v2
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	415
Positives / query avg	2.08
Positives / query min	1
Positives / query median	1.00
Positives / query max	22
Multi-positive queries	65 (32.50%)
Query length avg chars	47.62
Document length avg chars	481.08

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.5073	0.6850	0.7759	top-500
Dense	`harrier_oss_v1_270m`	0.6375	0.7900	0.8506	top-500
Reranking hybrid	`reranking_hybrid`	0.5970	0.7800	0.8771	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MTEB CQADupStack Gaming test split
Train/eval overlap audit: not_audited
Leakage note: exclude NanoMTEB-v2 cqadupstack_gaming duplicate-question pairs
Multi-positive training: recommended
Useful training data: StackExchange duplicate-question pairs, gaming forum duplicate questions, same-game hard negatives