NanoCMTEB / video

Overview

NanoCMTEB video is a Chinese entertainment-video retrieval task from the Multi-CPR and C-MTEB retrieval families. Queries are very short user video-search strings, and documents are compact video title or metadata records. The task measures whether retrieval systems can identify the intended video, episode, clip, performer, title, or metadata item from short and sometimes mixed-script queries.

Details

What the Original Data Measures

Multi-CPR includes an entertainment-video retrieval domain collected from real search systems with human relevance judgments. C-MTEB includes VideoRetrieval in its Chinese retrieval group. The task is industrial search over short titles and metadata rather than long passage QA.

The query may be a title fragment, performer name, episode clue, romanized string, device term, or mixed-script entertainment search. The positive document is the matching video title or metadata record.

Observed Data Profile

The task contains 200 queries, 10,000 documents, and 200 relevance judgments. It is strictly single-positive in the Nano labels: every query has exactly 1 positive, and there are 0 multi-positive queries.

Queries average 7.07 characters, and documents average 30.52 characters. Text is mainly Chinese but may include Japanese, English, Korean names, romanization, abbreviations, and mixed-script titles.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.6897, hit@10 of 0.8050, and recall@100 of 0.8950 using the top-500 BM25 candidate subset. This is a strong lexical baseline because exact title fragments, names, and episode tokens are highly informative.

BM25's limitation is alias and script variation. Short user queries may omit words, use alternate title forms, include romanization, or refer to a performer rather than the full title. Exact matching can also confuse same-series or same-cast items.

Dense Evaluation Profile

The dense harrier-oss-270m run reaches nDCG@10 of 0.8629, hit@10 of 0.9500, and recall@100 of 0.9850. Dense retrieval is the strongest top-ranking profile. It improves substantially over BM25 in nDCG@10 and hit@10.

This suggests that embedding similarity is effective for short entertainment metadata retrieval, especially when aliases, title variants, or mixed scripts separate the query from the exact document title.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate set reaches nDCG@10 of 0.8103, hit@10 of 0.9200, and recall@100 of 0.9950. It uses a top-100 candidate range with an optional rank-101 safeguard; this task has 1 safeguard row, candidate counts from 100 to 101, and a mean of 100.01 candidates.

Hybrid retrieval has the best recall@100 but trails dense retrieval for top-10 ordering. The hybrid pool is useful for broad candidate coverage, while dense retrieval is the best observed ranker for the intended video item.

Metric Interpretation for Model Researchers

This task is dense-favorable with a strong exact-match baseline. BM25 performs well because title terms and names are powerful signals, but dense retrieval better handles aliases and short-query ambiguity. Reranking_hybrid is useful when downstream reranking needs nearly complete top-100 coverage.

Because each query has a single positive, ranking mistakes usually represent selecting the wrong video, episode, cast-related item, or title variant. Precision over near duplicates is critical.

Query and Relevance Type Tendencies

Queries include dance or exam videos, TV drama titles, performer names, animation episodes, educational or medical clips, device troubleshooting videos, and short mixed-script searches. Positive documents are compact titles or metadata records.

The relevance relation is exact media-item identification. Same-series, same-cast, same-topic, or same-device documents are hard negatives if they are not the intended item.

Representative Failure Modes

Likely failures include retrieving the wrong episode from the same series, matching a performer but wrong clip, confusing romanized and translated titles, over-ranking same-topic device videos, and missing short aliases.

BM25 is vulnerable to alias and script mismatch. Dense retrieval can over-generalize within series or performer clusters. Hybrid retrieval improves coverage but still needs metadata-aware reranking for exact item selection.

Training Data That May Help

Useful training data includes video search query-title pairs, entertainment metadata retrieval pairs, multilingual title alias pairs, and hard negatives from the same series, cast, performer, episode number, or device model.

Synthetic data should generate compact Chinese video titles and metadata records, then create short user video-search strings with aliases, abbreviations, romanization, and title variants. Hard negatives should share series names or people but differ in episode, season, device model, or media object.

Model Improvement Notes

Strong systems should handle short queries, aliases, mixed scripts, and near-duplicate media metadata. Dense retrieval is the best observed first-stage method, while sparse matching remains useful for exact title fragments and names.

The task is useful for evaluating entertainment search systems where the user intent is a specific video record rather than a general topic.

Example Data

Query	Positive document
游泳和悦悦 [5 chars]	悦悦游泳20170817 21m [16 chars]
甲状腺的检查 [6 chars]	科普时间专业仪器如何检查甲状腺 [16 chars]
BAMBINo [7 chars]	bambino2016 恩率 oppa [19 chars]

Source Reference Table

Item	Reference
Task paper	Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval
Benchmark paper	C-Pack: Packed Resources For General Chinese Embeddings
Source dataset	mteb/VideoRetrieval
NanoCMTEB dataset	hakari-bench/NanoCMTEB

Representative query and positive source snippets:

Query	Positive document snippet
游泳和悦悦	A compact video title for a Yueyue swimming clip from 2017.
甲状腺的检查	A science or medical video title about checking the thyroid with professional instruments.
BAMBINo	A mixed-script performer or title metadata record containing "bambino".
明天依然爱你泰国电视剧普通话版	A drama title record for episode 15 of the series.
秃鹰档案国语	A video title about the Mandarin version of a crime or action clip.

Dataset Information

Field	Value
Nano set	NanoCMTEB
Backing dataset	NanoCMTEB
Task / split	video
Hugging Face dataset	hakari-bench/NanoCMTEB
Language	zh
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	7.07
Document length avg chars	30.52

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.6897	0.8050	0.8950	top-500
Dense	`harrier_oss_v1_270m`	0.8629	0.9500	0.9850	top-500
Reranking hybrid	`reranking_hybrid`	0.8103	0.9200	0.9950	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: VideoRetrieval dev
Train/eval overlap audit: not_audited
Leakage note: exclude NanoCMTEB video queries, qrels, and video metadata passages
Multi-positive training: single_positive
Useful training data: video search query-title pairs, entertainment metadata retrieval pairs, multilingual title alias pairs, same-series and same-cast hard negatives