NanoVNMTEB / quora_vn
Overview
quora_vn is the Vietnamese NanoVNMTEB version of the Quora duplicate-question retrieval task. The source data comes from the Quora Question Pairs release as used in BEIR, where relevance means that two questions ask the same underlying thing. VN-MTEB translates and filters the task into Vietnamese, producing a translated duplicate-question retrieval benchmark rather than native Vietnamese Quora data.
The Nano split contains 200 queries, 10,000 candidate documents, and 452 positive qrels. Queries average 76.51 characters, while documents average 129.2151 characters. The task is short-text paraphrase retrieval: both sides are questions, and the model must find duplicate formulations, including noisy translated variants. reranking_hybrid is strongest overall, BM25 is close, and dense retrieval is slightly weaker. This indicates that exact word and entity overlap are highly useful, but hybrid retrieval adds coverage and ranking stability.
Details
What the Original Data Measures
The Quora Question Pairs dataset was released to support duplicate-question detection. BEIR turns it into a retrieval task by treating one question as a query and candidate questions as documents, with duplicates as positives. Unlike answer retrieval, the target document is not an answer passage; it is another question with equivalent intent.
The Vietnamese version translates the questions. Some examples retain translation-helper wording or extra fragments, so the model must identify the underlying question inside noisy text. Relevance depends on semantic duplication, not broad topic similarity. A question about whether Trump could win an election is relevant to a paraphrase of that same question, but not to every question about Trump.
Observed Data Profile
The task has 452 positives across 200 queries. The average is 2.26 positives per query, the median is 1, and 67 queries have multiple positives. The maximum positive cluster has 57 documents, so most queries are narrow duplicate searches, while some have large paraphrase clusters.
Documents are short compared with passage tasks. This makes exact wording and small semantic differences important. Many duplicates share key entities or phrases, but some are paraphrased or wrapped in translation artifacts. A good retriever must handle both clean paraphrase and noisy duplicated intent.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.8345337070, hit@10 of 0.9600, and recall@100 of 0.9402654867 with a top-500 candidate set. These high scores show that lexical overlap is very strong for this task. Short duplicate questions often repeat entities, dates, product names, and core predicates.
BM25 can still fail when the duplicate is a true paraphrase with different wording or when the question is embedded inside extra translation text. It can also over-rank same-keyword non-duplicates: questions can share a named entity while asking different relations or opinions. The high baseline means improvements must be judged by ranking quality and recall, not just first-hit success.
Dense Evaluation Profile
Dense retrieval with harrier-oss-270m reaches nDCG@10 of 0.8259395743, hit@10 of 0.9350, and recall@100 of 0.9048672566. It is close to BM25 but slightly weaker across the reported metrics. This suggests that for translated Quora duplicates, exact tokens and entity overlap are unusually valuable.
Dense retrieval remains useful for paraphrases, but it can over-generalize across questions with similar topics and different intent. Short questions leave little context, so embedding similarity may group related political, technology, or advice questions that are not duplicates. The task needs semantic matching, but not at the cost of exact intent distinctions.
Reranking Hybrid Evaluation Profile
reranking_hybrid is strongest: nDCG@10 is 0.8510204725, hit@10 is 0.9600, and recall@100 is 0.9623893805. The top-100 candidate pool has exactly 100 candidates per query and no safeguard-expanded rows. Hybrid retrieval keeps BM25's first-hit strength while improving recall and nDCG.
The improvement reflects complementary evidence. Sparse retrieval preserves shared words and entities, while dense retrieval can rescue paraphrased duplicates or noisy translated variants. Because documents are short, a small amount of wrong semantic smoothing can hurt; hybrid retrieval works best when exact overlap and semantic equivalence agree.
Metric Interpretation for Model Researchers
This task is already high-performing, so the key differences are subtle. BM25's strength shows that duplicate questions often share surface forms. Dense underperforming BM25 suggests that generic semantic similarity is not enough; it must preserve question intent and entity constraints.
The multi-positive rate of 33.5% means recall@100 matters for duplicate clusters, but the median positive count of 1 keeps nDCG@10 important. Researchers should evaluate whether models retrieve several paraphrases when they exist while avoiding same-topic non-duplicates.
Query and Relevance Type Tendencies
Queries include everyday advice, politics, product-release timing, films, social media problems, weight loss, nightmares, and factual or opinion questions. Relevant documents are alternate phrasings of the same question. Some positives include noisy translated prompts such as requests to translate a sentence, with the actual duplicate embedded inside.
Relevance is intent equivalence. Same entity is not enough, and same broad topic is not enough. The question must ask the same thing from the user's perspective.
Representative Failure Modes
BM25 can miss paraphrases that share few words. Dense retrieval can retrieve semantically related but non-duplicate questions. Hybrid retrieval can still fail when translation noise changes the apparent intent or when a question contains extra irrelevant text before the true duplicate.
Another failure mode is entity drift. Two questions mentioning Donald Trump, Xiaomi Redmi, or Facebook may ask different relations, judgments, or troubleshooting needs. Models must preserve the full predicate, not only the entity.
Training Data That May Help
Useful training data includes non-overlapping Quora duplicate-question pairs, Vietnamese duplicate-question and paraphrase data, translated duplicate pairs with overlap removed, and same-entity hard negatives with different intent. Multi-positive training is useful because some queries have many duplicates.
Synthetic data can generate Vietnamese paraphrases and noisy translated variants. Hard negatives should share keywords or named entities while changing the question relation, opinion, or requested action.
Model Improvement Notes
The main improvement direction is intent-preserving duplicate retrieval. Sparse matching should preserve entities and core words; dense matching should capture paraphrase. Reranking should compare full question meaning and reject same-topic but different-intent candidates.
Error analysis should group false positives by same-entity drift, same-topic drift, translation-wrapper noise, and paraphrase failure. Because BM25 is strong, dense improvements need carefully mined hard negatives rather than broad paraphrase training alone.
Example Data
| Query | Positive document |
| Xiaomi Redmi note 4 ra mắt ở Ấn Độ vào ngày nào? [48 chars] | Hãy chuyển câu này sang tiếng Việt: Khi nào Xiaomi Redmi note 4 sẽ ra mắt ở Ấn Độ? [83 chars] |
| Có khả năng Trump sẽ thắng cuộc bầu cử không? [45 chars] | Nếu bạn muốn dịch câu này sang tiếng Việt, hãy viết câu đó ở dưới này. Trump có cơ hội thắng cử không? [103 chars] |
| Có nên thực hiện một phim truyền hình dựa trên bộ phim Shiva Trilogy? [69 chars] | Chào các bạn, Nếu bộ ba Shiva được chuyển thể thành một series phim truyền hình thì nó sẽ như thế nào? [103 chars] |
Source Reference Table
| Source | Role |
| Quora Question Pairs release | Original duplicate-question data release |
| BEIR | Retrieval benchmark framing for Quora |
| VN-MTEB | Vietnamese benchmark collection using translated retrieval tasks |
| GreenNode dataset card | Public dataset entry for this Vietnamese split |
Dataset Information
| Field | Value |
| Nano set | NanoVNMTEB |
| Backing dataset | NanoVNMTEB |
| Task / split | quora_vn |
| Hugging Face dataset | hakari-bench/NanoVNMTEB |
| Language | vi |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 452 |
| Positives / query avg | 2.26 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 57 |
| Multi-positive queries | 67 (33.50%) |
| Query length avg chars | 76.51 |
| Document length avg chars | 129.22 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.8345 | 0.9600 | 0.9403 | top-500 |
| Dense | harrier_oss_v1_270m | 0.8259 | 0.9350 | 0.9049 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.8510 | 0.9600 | 0.9624 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: translated VN-MTEB Quora test split from GreenNode/quora-vn
- Train/eval overlap audit: not_audited
- Leakage note: Exclude translated Quora-VN test questions, qrels, and positive duplicate questions used by this Nano split.
- Multi-positive training: multi_positive_objective
- Useful training data: non-overlapping Quora duplicate-question pairs, Vietnamese duplicate-question and paraphrase data, translated duplicate-question pairs with overlap removed, same-entity hard negatives with different intent