NanoMTEB-v2 / cqadupstack_unix
Overview
NanoMTEB-v2 / cqadupstack_unix is the Unix slice of CQADupStack duplicate-question retrieval. Short Unix StackExchange titles are used as queries, and longer technical posts are candidate duplicate documents. The original CQADupStack benchmark was built from StackExchange duplicate links, so the task is to retrieve a question that asks the same operational problem, not a document that answers it directly. This split focuses on Unix and Linux administration, shell usage, boot issues, filesystems, commands, configuration paths, and error scenarios. It is a useful benchmark for retrieval models that must handle technical language, command snippets, and near-duplicate problem descriptions.
Details
What the Original Data Measures
CQADupStack measures duplicate-question retrieval for community question-answering. In the Unix subset, positives are posts judged as duplicates or near duplicates of the query question. The model must connect terse problem titles to longer posts that describe the same command-line or system-administration issue.
Unlike general passage retrieval, relevance is question equivalence. A document about the same command, package, or subsystem may still be wrong if it asks a different operational question.
Observed Data Profile
The Nano split contains 200 queries, 10,000 documents, and 486 positive qrel rows. Queries have 2.43 positives on average, with a median of 1 and a maximum of 22. There are 84 multi-positive queries, or 42.0% of the query set. Queries average 49.21 characters, while documents average 969.12 characters.
Documents often contain command snippets, file paths, error messages, duplicate markers, and explanatory body text. The examples include file-copy workflows, GRUB repair, accidental deletion recovery, /proc update behavior, and .bashrc location changes.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.4001, hit@10 of 0.5550, and recall@100 of 0.4774. BM25 helps when commands, paths, package names, or error strings repeat exactly. Technical text often contains distinctive tokens that sparse retrieval can exploit.
However, duplicate Unix questions frequently describe the same operation using different filenames, flags, distributions, or failure details. BM25 may retrieve a question that mentions grub, sudo, sed, or .bashrc while asking a different question about that tool. This keeps sparse recall and top-rank quality moderate.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.5095, hit@10 of 0.6950, and recall@100 of 0.6687. Dense retrieval is substantially stronger than BM25. It better connects paraphrased problem descriptions and operationally similar questions even when exact commands or paths differ.
This result suggests that technical duplicate retrieval benefits from semantic modeling of intent. Still, dense models must preserve command-level precision: a small difference in option, path, or subsystem can change the problem entirely.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 14 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.4658, hit@10 of 0.6600, and recall@100 of 0.6687. Hybrid retrieval matches dense recall@100 but does not beat dense top-rank quality.
The hybrid profile indicates that BM25 contributes useful exact-token candidates, while dense retrieval supplies the stronger semantic ordering. A reranker should benefit from both, especially when sparse retrieval finds command-specific candidates that dense retrieval might under-rank.
Metric Interpretation for Model Researchers
This task has many multi-positive duplicate clusters but a median of one positive, so models must handle both single duplicate links and broader duplicate families. nDCG@10 measures whether a true duplicate is ranked early; recall@100 shows whether candidate generation can cover alternative accepted duplicates.
Dense retrieval is the strongest first-stage baseline. Hybrid retrieval is still valuable for reranking because Unix questions contain exact tokens that can be decisive.
Query and Relevance Type Tendencies
Queries are short Unix problem titles. Relevant documents are longer StackExchange questions asking the same problem, often with commands, paths, logs, or system context. The relevant post may use a different distribution, directory, or example command while preserving the same underlying operation.
The relevance relation is duplicate problem equivalence, not topical relatedness.
Representative Failure Modes
Common failures include retrieving a question about the same command but a different operation, confusing bootloader repair with partition installation, matching a file path token without matching the workflow, and missing paraphrases of error recovery or shell configuration problems. Dense systems may under-weight exact command syntax; sparse systems may over-weight it.
Training Data That May Help
Useful training data includes StackExchange duplicate-question pairs, Unix and shell support questions, command-error paraphrase pairs, and hard negatives from the same command or subsystem. Multi-positive training is recommended because duplicate clusters can contain many variants.
Model Improvement Notes
Models should preserve both semantic intent and exact technical constraints. Effective training should include same-command hard negatives and paraphrased duplicates with different filenames, flags, or distributions. Rerankers should inspect the full body text because the title alone often omits critical operational details.
Example Data
| Query | Positive document |
| copy sas file from prior version directory to new version directory [67 chars] | How to copy datasets from prior version directory to latest version directory I've go a number of directories named like: /data/db/OX/8_10 /data/db/OX/9_1 /data/db/OX/9_2 And need to copy some files (all the pt.* files) from the second latest one (above 9_1) to the latest one (above 9_2). I have tried directly like this. cp -p /data/db/OX/9_1/pt.* /data/db/OX/9_2 However, Instead of typing /data/db/OX/9_1/ & /data/db/OX/9_2. I'd like to be able to write: cp -p /data/db/OS/"$prior_version"/pt.* /data/db/OS/"$latest_version"/ And derive $prior_version and $latest_version from the list of directories in /data/db/OX/ in a shell script. [718 chars] |
| Linux Mint Booting Installed Partition [38 chars] | How can I fix/install/reinstall grub? So I started out with a 250GB HDD, the stock drive from an EeePC 1015pem that I am trying to turn into a MintBook. The drive is physically operable, but all data has been nuked, including the old OS. Given this, I attached the HDD to my desktop and installed Linux Mint 16 Xfce from a live USB created through Unetbootin-585. Set aside 10GB for swap and 240GB for Ext4 and /. The drive now refuses to boot for either the desktop or netbook. Both motherboards are sounding the correct sequence of beeps, so they seem healthy, and I can successfully access the BIOS on both systems. However, the only thing that comes up after starting the computer is a nonresponsive command- line. There is no error message, no grub or grub-rescue, nothing. Is there anything I can try besides reformatting and starting over? How would I go about installing a boot loader that can boot my OS? [914 chars] |
| Yanked USB Key During Move [26 chars] | Recovering accidentally deleted files I accidentally deleted a file from my laptop. I'm using Fedora. Is it possible to recover the file? [138 chars] |
Source Reference Table
| Title | Year | Type | URL |
| CQADupStack: A Benchmark Data Set for Community Question-Answering Research | 2015 | source task paper | https://eltimster.github.io/www/pubs/adcs2015.pdf |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| mteb/cqadupstack-unix | dataset card | https://huggingface.co/datasets/mteb/cqadupstack-unix |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-v2 |
| Backing dataset | NanoMTEB-v2 |
| Task / split | cqadupstack_unix |
| Hugging Face dataset | hakari-bench/NanoMTEB-v2 |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 486 |
| Positives / query avg | 2.43 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 22 |
| Multi-positive queries | 84 (42.00%) |
| Query length avg chars | 49.20 |
| Document length avg chars | 969.12 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.4001 | 0.5550 | 0.4774 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5095 | 0.6950 | 0.6687 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.4658 | 0.6600 | 0.6687 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: MTEB CQADupStack Unix test split
- Train/eval overlap audit: not_audited
- Leakage note: exclude NanoMTEB-v2 cqadupstack_unix duplicate-question pairs
- Multi-positive training: recommended
- Useful training data: StackExchange duplicate-question pairs, Unix and shell support questions, same-command hard negatives