NanoVNMTEB / dbpedia_vn

Overview

dbpedia_vn is the Vietnamese NanoVNMTEB version of the DBpedia-Entity retrieval task. DBpedia-Entity v2 was built as a test collection for entity search over DBpedia and Wikipedia-derived entity descriptions; VN-MTEB adapts the retrieval setting into Vietnamese. A query is usually a short entity-oriented request, and relevant documents are compact entity descriptions that satisfy the request.

The Nano split contains 200 queries, 10,000 candidate documents, and 5,754 positive qrels. Queries average 42.085 characters, and documents average 340.3752 characters. Unlike single-answer passage retrieval, this task is strongly multi-positive: the average query has 28.77 positives, and the median has 19. Dense retrieval is strongest on nDCG@10 and recall@100, while reranking_hybrid has the highest hit@10. The task is best understood as Vietnamese set-valued entity retrieval, where the model must rank many correct entities above same-category distractors.

Details

What the Original Data Measures

DBpedia-Entity v2 evaluates entity search over descriptions derived from DBpedia. It includes information needs from several sources, including entity-seeking and list-seeking queries. The retrieval target is not an arbitrary passage but an entity description, often identified by name, category, date, location, role, or membership in a set.

The Vietnamese version preserves this entity-search behavior. Queries may ask for Indian dishes, Star Trek captains, producers with many films, islands, countries, authors, presidents, bridges similar to a named bridge, or albums associated with an artist. Relevant documents are short entity abstracts containing titles and attributes. Retrieval must therefore combine entity-name matching, category matching, and semantic list membership.

Observed Data Profile

The task has 5,754 positive judgments across 200 queries. The average positive count is 28.77, the median is 19, and 194 of 200 queries have more than one positive, giving a multi-positive rate of 97.0%. The maximum positive count is 100. This is one of the clearest multi-positive tasks in the NanoVNMTEB set.

Documents are short compared with forum or QA tasks, so the challenge is not long-context extraction. Instead, the difficulty lies in ranking many entities that share a category. A query may ask for a list of entities, and a model must retrieve a set rather than stop after the first plausible match. Same-category hard negatives are common because many entities look locally similar.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.6137045546, hit@10 of 0.9600, and recall@100 of 0.5935001738 with a top-500 candidate set. The high hit@10 shows that lexical retrieval usually finds at least one relevant entity quickly. Entity names, categories, places, dates, and aliases provide strong sparse signals.

However, BM25 is weaker than dense retrieval on nDCG@10 and recall@100. List-style queries require ranking many relevant entities, and exact word overlap can favor a few obvious matches while missing semantically valid members of the set. BM25 may also over-rank documents that share the same category word but do not satisfy the full query constraint.

Dense Evaluation Profile

Dense retrieval with harrier-oss-270m reaches nDCG@10 of 0.7640113769, hit@10 of 0.9650, and recall@100 of 0.7233229058. It is the strongest condition for nDCG@10 and recall@100. This indicates that embedding similarity captures entity-category and list-membership semantics better than exact lexical overlap alone.

Dense retrieval is particularly useful for queries that describe a class indirectly, such as entities similar to a named bridge, works associated with a person, or members of a geographic or cultural category. It can retrieve entities that do not repeat every query token but match the semantic set. Its risk is retrieving same-topic entities that are close in embedding space but fail a specific constraint such as date, nationality, role, or membership.

Reranking Hybrid Evaluation Profile

reranking_hybrid reaches nDCG@10 of 0.7247250643, hit@10 of 0.9900, and recall@100 of 0.7076816128. The top-100 candidate pool has exactly 100 candidates per query, with no safeguard-expanded rows. Hybrid retrieval finds at least one positive for almost every query, but its nDCG@10 and recall@100 remain below dense retrieval.

This shows that hybrid search improves first-hit reliability but does not automatically improve list ranking. Sparse evidence can add useful entity-name and category anchors, but it may also pull same-word distractors into the top ranks. For this task, dense retrieval better orders the large positive sets, while hybrid retrieval slightly improves hit coverage.

Metric Interpretation for Model Researchers

Hit@10 is less discriminating here because both dense and hybrid are already near saturation. nDCG@10 and recall@100 are more informative: the task asks for many relevant entities, so ranking quality across the top set matters. Dense retrieval's advantage indicates that semantic list membership is central.

Researchers should treat this as a multi-positive entity retrieval benchmark, not a single-evidence task. Training and evaluation should reward retrieving multiple correct entities and ranking them above same-category non-relevant entities. Reducing each query to one positive would discard the main structure of the dataset.

Query and Relevance Type Tendencies

Queries include list searches, entity descriptions, and category-based needs. Examples include bridges similar to Manhattan Bridge, Indian dishes, Star Trek captains, film producers, and albums involving John Lennon and Yoko Ono. Relevant documents are short entity abstracts with titles and compact descriptions.

Relevance depends on satisfying all constraints in the query. A document about a bridge may be irrelevant if it is not similar in the intended way. A person may be irrelevant if they share an occupation but not the requested role or period. Entity retrieval therefore requires matching both category and qualifiers.

Representative Failure Modes

BM25 can over-rank entities with exact category words while missing entities that satisfy the list through paraphrase or description. Dense retrieval can over-rank entities in the same semantic neighborhood but with the wrong date, location, role, or relation. Hybrid retrieval can inherit both issues when sparse anchors and semantic similarity point to different parts of the entity space.

Another failure mode is first-hit complacency. A model may achieve high hit@10 while retrieving only one obvious positive and failing to cover the rest of the relevant set. For this benchmark, set coverage is part of the task.

Training Data That May Help

Useful training data includes non-overlapping DBpedia-Entity queries, Vietnamese entity linking and entity retrieval pairs, Wikipedia or DBpedia question-to-entity data, and list-search supervision with overlap removed. Multi-positive training is especially important.

Synthetic data should generate Vietnamese entity-search queries over small pools of entity descriptions, with all matching entities labeled. Hard negatives should come from the same broad category but violate a specific constraint, such as wrong country, wrong date, wrong profession, or wrong fictional universe.

Model Improvement Notes

The main improvement target is list-aware dense retrieval. Models should learn category membership, aliases, dates, locations, and role constraints. Sparse evidence remains useful for names and rare entities, but it should not dominate ranking when the query is semantic or list-based.

Error analysis should inspect whether missed positives are due to alias mismatch, category mismatch, qualifier mismatch, or insufficient multi-positive coverage. Reranking should optimize ordering across many relevant entities, not only whether any positive appears near the top.

Example Data

Query	Positive document
Những cây cầu nào giống như Cầu Manhattan? [42 chars]	Cầu 25 de Abril Cầu 25 de Abril (tiếng Bồ Đào Nha: Ponte 25 de Abril, phát âm tiếng Bồ Đào Nha: [ˈpõt(ɨ) ˈvĩt i ˈsĩku ðɨ ɐˈβɾiɫ]) là cây cầu treo nối thành phố Lisbon, thủ đô của Bồ Đào Nha đến thị trấn Almada trên bờ trái (phía nam) Sông Tagus. Nó được khánh thành vào ngày 6 tháng 8 năm 1966 và một đường tàu hỏa đã được bổ sung vào năm 1999. Bởi vì nó là một cây cầu treo có màu sắc tương tự, nên nó thường bị so sánh với Cầu Cổng Vàng ở San Francisco, Hoa Kỳ. [464 chars]
John Lennon Yoko Ono album Starting Over [40 chars]	(Chỉ Như) Bắt Đầu Một Lần Nữa (Just Like) Starting Over là một ca khúc được viết và trình diễn bởi John Lennon cho album của ông, Double Fantasy. Mặt B của đĩa đơn này là "Kiss Kiss Kiss" thuộc về Yoko Ono. Ca khúc được phát hành vào ngày 20 tháng 10 năm 1980 tại Hoa Kỳ và bốn ngày sau đó ở Vương quốc Anh, và đã đứng đầu bảng xếp hạng âm nhạc tại cả hai nước sau khi ông bị ám sát. Năm 2013, tạp chí Billboard đã xếp nó vào vị trí thứ 62 trong số những ca khúc vĩ đại nhất mọi thời đại trên bảng xếp hạng Billboard Hot 100. [526 chars]
Món Ấn Độ [9 chars]	Bánh Appam Appam là một loại bánh kếp được làm từ bột gạo lên men và sữa dừa. Nó là món ăn phổ biến ở bang Kerala miền Nam Ấn Độ. Nó cũng rất phổ biến ở Tamil Nadu và Sri Lanka. Món này thường được ăn vào bữa sáng hoặc bữa tối. Appam được xem như một chế độ ăn chính thức của người Nasrani (còn gọi là Kitô hữu Thánh Thomas hay Kitô hữu Syria) tại Kerala, họ dùng nó như một món đồng thời mang tính văn hóa đặc trưng cho cộng đồng này. [436 chars]

Source Reference Table

Source	Role
DBpedia-Entity v2	Original entity-search benchmark
DBpedia-Entity project page	Collection description and benchmark context
BEIR	Retrieval benchmark framing
VN-MTEB	Vietnamese benchmark collection using translated retrieval tasks
GreenNode dataset card	Public dataset entry for this Vietnamese split

Dataset Information

Field	Value
Nano set	NanoVNMTEB
Backing dataset	NanoVNMTEB
Task / split	dbpedia_vn
Hugging Face dataset	hakari-bench/NanoVNMTEB
Language	vi
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	5,754
Positives / query avg	28.77
Positives / query min	1
Positives / query median	19.00
Positives / query max	100
Multi-positive queries	194 (97.00%)
Query length avg chars	42.09
Document length avg chars	340.38

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.6137	0.9600	0.5935	top-500
Dense	`harrier_oss_v1_270m`	0.7640	0.9650	0.7233	top-500
Reranking hybrid	`reranking_hybrid`	0.7247	0.9900	0.7077	top-100

Training and Leakage Metadata

Original train split: unknown
Evaluation split origin: translated VN-MTEB DBpedia-Entity test split from GreenNode/dbpedia-vn
Train/eval overlap audit: not_audited
Leakage note: Exclude translated DBpedia-VN test queries, qrels, and positive entity descriptions used by this Nano split.
Multi-positive training: multi_positive_objective
Useful training data: non-overlapping DBpedia-Entity and entity-search queries, Vietnamese entity linking and entity retrieval pairs, Wikipedia or DBpedia question-to-entity data, list-search supervision with overlap removed