NanoVNMTEB / cqadupstack_android_vn

Overview

cqadupstack_android_vn is a Vietnamese duplicate-question retrieval task from NanoVNMTEB. The query is a short translated Android support question title, and the relevant documents are translated archived Android StackExchange questions marked as duplicates. Many queries have multiple positives, including one very large duplicate cluster. Dense retrieval has the strongest top-rank profile, while reranking_hybrid has the best recall@100. BM25 is useful for Android terms but weaker because duplicate questions often phrase the same device or workflow problem differently.

Details

What the Original Data Measures

CQADupStack was built for community question-answering duplicate retrieval. The task reflects a realistic setting: given a new question, retrieve earlier archived questions that ask the same thing.

VN-MTEB translates this Android split into Vietnamese. The task keeps the technical duplicate-detection structure while adding translation artifacts and Vietnamese phrasing around Android terminology.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 811 positive qrel rows. Queries average 55.64 characters, while documents average 604.76 characters. Positives per query average 4.06, with a minimum of 1, a median of 1, and a maximum of 100. There are 95 multi-positive queries, 47.5% of the split.

Example queries ask about Markdown notes synchronized with Dropbox, SMS database storage paths, streaming video from PC to Android, deleting preinstalled apps on HTC Desire, and a Galaxy S3 sound triggered by placing a card on the back.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.3774, hit@10 of 0.6100, and recall@100 of 0.4747. BM25 benefits from technical terms such as adb, APK, ROM, SD card, Google Play, model names, and app names.

The limitation is duplicate paraphrase. Two duplicate questions can use different symptoms, device names, or workflow descriptions, while non-duplicates can share many Android keywords.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.4991, hit@10 of 0.7550, and recall@100 of 0.5845. Dense retrieval is the strongest early-ranking profile.

This indicates that embedding similarity handles translated troubleshooting paraphrases better than term frequency. It can connect equivalent Android problems even when the exact title words differ.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 14 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.4629, hit@10 of 0.7050, and recall@100 of 0.5980. Hybrid retrieval has the best recall@100 but lower early ranking than dense retrieval.

The pattern is useful for reranking. Sparse matching widens the pool around device and app terms, while dense retrieval better orders the actual duplicate questions.

Metric Interpretation for Model Researchers

Because many queries have multiple positives, nDCG@10 measures whether duplicate cluster members are ranked early, hit@10 measures whether at least one duplicate is found, and recall@100 measures cluster coverage for reranking.

For cqadupstack_android_vn, hit@10 alone can hide weak duplicate-cluster coverage, especially when a query has many positives.

Query and Relevance Type Tendencies

Queries are short Vietnamese Android question titles. Relevant documents are longer translated community questions, often with title, body, duplicate markers, and technical details.

Relevance is duplicate-question equivalence. A thread with the same device or app is wrong if it asks a different operation or failure mode.

Representative Failure Modes

Common failures include overmatching phone models, retrieving same-app but different-operation questions, missing paraphrased troubleshooting symptoms, and confusing duplicate clusters with broad topic clusters. BM25 overweights technical tokens; dense retrieval can blur similar Android workflows.

Training Data That May Help

Useful training data includes non-overlapping Android duplicate-question pairs, Vietnamese mobile troubleshooting QA, translated CQADupStack training splits with overlap removed, and hard negatives sharing device, app, or feature terms. Evaluation questions, documents, qrels, and duplicate clusters should be excluded.

Model Improvement Notes

Models should represent troubleshooting intent, device context, Android component, and operation type. Hard negatives should share exact model or feature names but ask different questions. Dense retrieval is the best direct ranker, while hybrid retrieval is useful for higher-recall reranking.

Example Data

Query	Positive document
Ghi chú Markdown với Dropbox đồng bộ [36 chars]	syncing markup/lời chú thích trên dropbox? Xin chào tôi muốn có chức năng sau: * Đặt một tập tin vào Dropbox (html, wordformat) - mà tôi chỉnh sửa từ máy tính của tôi * ứng dụng Android để truy cập tập tin đó (ngắn đường trên màn hình chính), hiển thị html/word-format + chỉnh sửa bất kỳ ý tưởng nào? Hiện tại tôi đang sử dụng một tập tin văn bản mà tôi có thể chỉnh sửa với trình soạn thảo mặc định của Android, nhưng đánh dấu sẽ tốt hơn :) Cảm ơn [449 chars]
Tin nhắn SMS được lưu trữ ở đâu trên hệ thống file? [51 chars]	Android đường dẫn tin nhắn SMS Tôi không thể tìm đường dẫn đến các tệp cơ sở dữ liệu tin nhắn SMS trên hệ điều hành android. Đường dẫn chính xác cho các tệp cơ sở dữ liệu tin nhắn SMS là gì? Tôi đã thử những giá trị này và chúng không đúng: /data/data/com.jb.gosms/databases/gommssms.db /data/data/com.android.providers.telephony/databases/mmssms.db [368 chars]
Dòng video từ PC đến Android? [29 chars]	Cách phát trực tuyến các video do tôi sở hữu đến một thiết bị Android? > Có thể trùng lặp: > Có một ứng dụng truyền tải đa phương tiện DLNA cho Android không? > Truyền tải video từ PC đến Android? Tôi đã từng là người dùng iPhone/iPad. Tôi đang cân nhắc mua một chiếc Kindle Fire hoặc Nook Color. Tôi thực sự muốn có khả năng truyền tải video từ kho phim khổng lồ mà tôi đã lưu trữ sẵn. Tôi đã từng sử dụng Boxee và Airplayit cho PC/iPad/iPhone.... Có phần mềm nào tương tự như vậy dành cho Android hay tôi sẽ phải mã hóa và chuyển đổi chúng? [551 chars]

Source Reference Table

Title	Year	Type	URL
CQADupStack: A Benchmark Data Set for Community Question-Answering Research	2015	ACM paper	https://doi.org/10.1145/2838931.2838934
VN-MTEB: Vietnamese Massive Text Embedding Benchmark	2026	ACL paper	https://aclanthology.org/2026.findings-eacl.86/
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models	2021	arXiv paper	https://arxiv.org/abs/2104.08663
GreenNode/cqadupstack-android-vn		dataset card	https://huggingface.co/datasets/GreenNode/cqadupstack-android-vn

Dataset Information

Field	Value
Nano set	NanoVNMTEB
Backing dataset	NanoVNMTEB
Task / split	cqadupstack_android_vn
Hugging Face dataset	hakari-bench/NanoVNMTEB
Language	vi
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	811
Positives / query avg	4.05
Positives / query min	1
Positives / query median	1.00
Positives / query max	100
Multi-positive queries	95 (47.50%)
Query length avg chars	55.64
Document length avg chars	604.76

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.3774	0.6100	0.4747	top-500
Dense	`harrier_oss_v1_270m`	0.4991	0.7550	0.5845	top-500
Reranking hybrid	`reranking_hybrid`	0.4629	0.7050	0.5980	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: translated VN-MTEB CQADupStack Android test split from GreenNode/cqadupstack-android-vn
Train/eval overlap audit: not_audited
Leakage note: Exclude translated Android test questions, documents, qrels, and duplicate clusters used by this Nano split.
Multi-positive training: multi_positive_objective
Useful training data: non-overlapping Android duplicate-question pairs, Vietnamese mobile troubleshooting QA pairs, translated CQADupStack training splits with overlap removed, same-device and same-feature hard negatives