HAKARI-Bench

NanoVNMTEB

Overview

NanoVNMTEB is the Nano task group for VN-MTEB retrieval. It contains Vietnamese retrieval versions of widely used MTEB and BEIR-style tasks, including duplicate question retrieval, fact-checking evidence retrieval, web search, natural question answering, finance QA, argument retrieval, biomedical retrieval, and scientific-paper retrieval. The group evaluates Vietnamese retrieval quality and robustness to translated benchmark artifacts.

The group contains 4,768 queries, 247,475 task-local documents, and 24,671 positive qrel rows across 26 tasks. Most tasks are Vietnamese, while nfcorpus_vn is marked multilingual because biomedical terminology and translation artifacts cross language boundaries. The group is large enough that aggregate scores can hide very different retrieval relations.

What This Group Measures

VN-MTEB translates and filters English MTEB datasets into Vietnamese while preserving named entities, numbers, links, special characters, fluency, and meaning. NanoVNMTEB focuses on the retrieval family from that benchmark. A high score means that a model can preserve the original retrieval relation after Vietnamese translation, whether the relation is duplicate intent, evidence support, web answerability, scientific relatedness, argument stance, or biomedical relevance.

The group should not be read as one native Vietnamese corpus. Many tasks inherit their semantics from English benchmark sources. This makes it valuable for testing multilingual and translation-robust retrievers, but it also means that source-task shape matters as much as Vietnamese language quality.

Task Families

Dataset Shape

Most splits have 200 queries and 10,000 candidate documents. Smaller exceptions include sci_fact_vn, nfcorpus_vn, touche2020_vn, and treccovid_vn. Positive density varies sharply. ArguAna is single-positive. Many FEVER, NQ, and MS MARCO-style tasks are close to single-positive. DBpedia, NFCorpus, Touché, TREC-COVID, and several duplicate-question tasks are strongly multi-positive.

The group average is 5.17 positives per query, but that average is driven by large relevance sets in DBpedia, NFCorpus, Touché, and TREC-COVID. The median task behavior is closer to one or two positives per query. Query length also varies: argu_ana_vn uses long translated debate arguments, while MS MARCO, NQ, and many CQADupStack queries are short search or question strings.

Retrieval Behavior

BM25 Profile

BM25 is best for quora_vn and remains strong on entity-heavy and fact-like tasks. fever_vn, nano_fever, hotpot_qa_vn, msmarco_vn, and quora_vn all score high because translated named entities, titles, and short duplicate phrases often preserve lexical overlap. BM25 also works reasonably on many CQADupStack domains when technical terms, code fragments, product names, or StackExchange terminology survive translation.

BM25 is weak on tasks where the source relevance relation is not lexical. scidocs_vn has low nDCG@10 because related scientific papers may not share title words. climate_fever_vn, argu_ana_vn, and several CQADupStack domains also require evidence, stance, or duplicate intent beyond topical overlap. treccovid_vn has many positives, but BM25 nDCG@10 is modest because broad COVID terminology does not rank the best judged literature records early.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest group-level profile. It is best for most tasks, including argument retrieval, climate evidence, DBpedia, FEVER, MS MARCO, NQ, SciFact, and many duplicate-question domains. The large gains on msmarco_vn, nano_nq, nq_vn, and fever_vn show that Vietnamese embedding similarity helps connect translated questions and claims to answer-bearing or evidence passages.

Dense is not always best. quora_vn slightly favors hybrid, and some technical duplicate tasks favor hybrid because exact tokens and semantic intent are both important. Still, dense retrieval is the main profile for VN-MTEB-style Vietnamese translation robustness.

Reranking Hybrid Profile

The reranking hybrid profile is best for several tasks where exact tokens and semantic relatedness both matter: cqadupstack_mathematica_vn, cqadupstack_stats_vn, cqadupstack_tex_vn, cqadupstack_wordpress_vn, fi_qa2018_vn, nfcorpus_vn, quora_vn, scidocs_vn, and touche2020_vn. These tasks often contain technical terms, formulas, biomedical terminology, argument terms, or duplicate clusters where sparse and dense signals recover complementary candidates.

Hybrid has the best group-level recall@100, which is important for multi-positive tasks. It is not the best nDCG@10 profile overall because dense ranking is stronger on many translated QA and evidence tasks. For Vietnamese retrieval systems, this suggests a practical split: dense is a strong default, while hybrid is valuable for technical duplicates, biomedical retrieval, argument retrieval, and candidate generation.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_ana_vnArgument retrievalvi1998,6741991.000.27420.36980.3372Dense
climate_fever_vnClaim-evidence retrievalvi20010,0006353.170.24470.37130.3245Dense
cqadupstack_android_vnDuplicate-question retrievalvi20010,0008114.050.37740.49910.4629Dense
cqadupstack_gis_vnDuplicate-question retrievalvi20010,0002991.500.30380.34810.3420Dense
cqadupstack_mathematica_vnDuplicate-question retrievalvi20010,0004242.120.21570.19750.2367Reranking hybrid
cqadupstack_physics_vnDuplicate-question retrievalvi20010,0005922.960.41270.49910.4696Dense
cqadupstack_programmers_vnDuplicate-question retrievalvi20010,0004902.450.35680.42940.4229Dense
cqadupstack_stats_vnDuplicate-question retrievalvi20010,0003101.550.32050.36950.3796Reranking hybrid
cqadupstack_tex_vnDuplicate-question retrievalvi20010,0007433.710.28430.29270.3163Reranking hybrid
cqadupstack_unix_vnDuplicate-question retrievalvi20010,0004342.170.38220.44860.4455Dense
cqadupstack_webmasters_vnDuplicate-question retrievalvi20010,0008254.120.25170.34980.3236Dense
cqadupstack_wordpress_vnDuplicate-question retrievalvi20010,0003371.690.32140.31050.3672Reranking hybrid
dbpedia_vnEntity retrievalvi20010,0005,75428.770.61370.76400.7247Dense
fever_vnClaim-evidence retrievalvi20010,0002321.160.80130.95200.8904Dense
fi_qa2018_vnFinance QA retrievalvi20010,0005492.750.33880.40570.4118Reranking hybrid
hotpot_qa_vnMulti-hop evidence retrievalvi20010,0004002.000.80010.87730.8649Dense
msmarco_vnWeb passage retrievalvi20010,0002141.070.75790.92590.8285Dense
nano_feverClaim-evidence retrievalvi20010,0002321.160.79670.94090.8680Dense
nano_nqOpen-domain QA retrievalvi20010,0002341.170.60950.84950.7234Dense
nfcorpus_vnBiomedical retrievalmultilingual1663,6184,57127.540.25520.28270.2902Reranking hybrid
nq_vnOpen-domain QA retrievalvi20010,0002341.170.58820.79810.6826Dense
quora_vnDuplicate-question retrievalvi20010,0004522.260.83450.82590.8510Reranking hybrid
sci_fact_vnScientific evidence retrievalvi1345,1831551.160.61580.66360.6485Dense
scidocs_vnScholarly related-paper retrievalvi20010,0009884.940.16130.20280.2039Reranking hybrid
touche2020_vnArgument retrievalvi2510,00048119.240.68410.68690.7280Reranking hybrid
treccovid_vnCOVID literature retrievalvi4410,0004,07692.640.28110.37500.3551Dense

Interpretation Notes for Model Researchers

NanoVNMTEB is a translation-robustness and task-shape benchmark. Dense retrieval dominates many translated QA and evidence tasks, suggesting that Vietnamese semantic matching is critical. Hybrid retrieval is valuable for technical duplicate clusters, biomedical and scholarly retrieval, finance QA, and argument retrieval, where exact terms and semantic relatedness both matter.

The group should be analyzed by source family. Improvements on FEVER, MS MARCO, or NQ do not necessarily imply improvements on CQADupStack, SciDocs, NFCorpus, or Touché. Multi-positive tasks such as DBpedia, NFCorpus, Touché, and TREC-COVID should be inspected with recall and ranking behavior, not only one top-hit metric.

Training and Leakage Notes

Useful training data includes Vietnamese duplicate-question pairs, Vietnamese Wikipedia QA, translated or native claim-evidence pairs, finance QA, argument retrieval, biomedical retrieval, scientific related-paper retrieval, and Vietnamese web passage retrieval. Technical tasks should preserve code snippets, math or TeX fragments, URLs, product names, and domain-specific terminology.

Leakage control should exclude NanoVNMTEB evaluation queries, qrels, positive documents, duplicate clusters, and translated variants of common benchmark test examples. Overlap audits are especially important for MS MARCO, FEVER, NQ, Quora, CQADupStack, TREC-COVID, and SciDocs because these sources often appear in multilingual or synthetic training mixtures.

Source Reference Table

SourceYearTypeURL
VN-MTEB: Vietnamese Massive Text Embedding Benchmark2026benchmark paperhttps://aclanthology.org/2026.findings-eacl.86/
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
BEIR2021benchmark paperhttps://arxiv.org/abs/2104.08663
CQADupStack2015source task paperhttps://doi.org/10.1145/2838931.2838934
FEVER2018source task paperhttps://arxiv.org/abs/1803.05355
Natural Questions2019source task paperhttps://aclanthology.org/Q19-1026/
MS MARCO2016source task paperhttps://arxiv.org/abs/1611.09268
TREC-COVID2020source task paperhttps://arxiv.org/abs/2005.04474
GreenNode datasetsdataset organizationhttps://huggingface.co/GreenNode

Metadata Summary

FieldValue
Task pages26
Queries4,768
Split-local documents247,475
Positive qrels24,671
Languagesmultilingual, vi
Categoriesnatural_language
Positives / query avg5.17

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_ana_vnNanoVNMTEBvinatural_language1998,6741990.27420.36980.3372Dense
climate_fever_vnNanoVNMTEBvinatural_language20010,0006350.24470.37130.3245Dense
cqadupstack_android_vnNanoVNMTEBvinatural_language20010,0008110.37740.49910.4629Dense
cqadupstack_gis_vnNanoVNMTEBvinatural_language20010,0002990.30380.34810.3420Dense
cqadupstack_mathematica_vnNanoVNMTEBvinatural_language20010,0004240.21570.19750.2367Reranking hybrid
cqadupstack_physics_vnNanoVNMTEBvinatural_language20010,0005920.41270.49910.4696Dense
cqadupstack_programmers_vnNanoVNMTEBvinatural_language20010,0004900.35680.42940.4229Dense
cqadupstack_stats_vnNanoVNMTEBvinatural_language20010,0003100.32050.36950.3796Reranking hybrid
cqadupstack_tex_vnNanoVNMTEBvinatural_language20010,0007430.28430.29270.3163Reranking hybrid
cqadupstack_unix_vnNanoVNMTEBvinatural_language20010,0004340.38220.44860.4455Dense
cqadupstack_webmasters_vnNanoVNMTEBvinatural_language20010,0008250.25170.34980.3236Dense
cqadupstack_wordpress_vnNanoVNMTEBvinatural_language20010,0003370.32140.31050.3672Reranking hybrid
dbpedia_vnNanoVNMTEBvinatural_language20010,0005,7540.61370.76400.7247Dense
fever_vnNanoVNMTEBvinatural_language20010,0002320.80130.95200.8904Dense
fi_qa2018_vnNanoVNMTEBvinatural_language20010,0005490.33880.40570.4118Reranking hybrid
hotpot_qa_vnNanoVNMTEBvinatural_language20010,0004000.80010.87730.8649Dense
msmarco_vnNanoVNMTEBvinatural_language20010,0002140.75790.92590.8285Dense
nano_feverNanoVNMTEBvinatural_language20010,0002320.79670.94090.8680Dense
nano_nqNanoVNMTEBvinatural_language20010,0002340.60950.84950.7234Dense
nfcorpus_vnNanoVNMTEBmultilingualnatural_language1663,6184,5710.25520.28270.2902Reranking hybrid
nq_vnNanoVNMTEBvinatural_language20010,0002340.58820.79810.6826Dense
quora_vnNanoVNMTEBvinatural_language20010,0004520.83450.82590.8510Reranking hybrid
sci_fact_vnNanoVNMTEBvinatural_language1345,1831550.61580.66360.6485Dense
scidocs_vnNanoVNMTEBvinatural_language20010,0009880.16130.20280.2039Reranking hybrid
touche2020_vnNanoVNMTEBvinatural_language2510,0004810.68410.68690.7280Reranking hybrid
treccovid_vnNanoVNMTEBvinatural_language4410,0004,0760.28110.37500.3551Dense