HAKARI-Bench

NanoCMTEB

Overview

NanoCMTEB is a compact Chinese retrieval group based on C-MTEB retrieval tasks. It covers medical consultation retrieval, COVID policy and news retrieval, general Chinese web passage retrieval, e-commerce product retrieval, translated MS MARCO-style passage ranking, T2Ranking, and entertainment-video retrieval. Most queries are short Chinese search intents, while documents range from product or video titles to long web and policy passages.

The group is useful because it tests Chinese retrieval across several practical domains rather than one homogeneous web-search setting. Short queries, mixed scripts, product codes, translated MS MARCO artifacts, medical wording, and multi-positive relevance sets all appear. BM25 measures the strength of Chinese term and phrase overlap, dense retrieval tests intent matching across terse queries and domain language, and reranking_hybrid shows where sparse and dense signals recover complementary candidates.

What This Group Measures

NanoCMTEB follows the C-MTEB view of Chinese embedding evaluation as a multi-domain benchmark. Its retrieval tasks draw from Chinese web search, translated MS MARCO, Multi-CPR-style domain retrieval, T2Ranking, and CMedQA-style medical consultation retrieval.

The shared measurement target is Chinese first-stage retrieval under varied query and document formats. A relevant document may be a doctor-style answer, a government COVID passage, a product title, a video title, a translated web answer passage, or a noisy search result. The group therefore rewards models that handle short Chinese intent, mixed scripts, domain terms, and multiple acceptable positives.

Task Families

Dataset Shape

NanoCMTEB contains 8 task pages, 1,600 queries, 80,000 split-local documents, and 3,208 positive qrel rows. Each task has 200 queries and 10,000 documents. The group averages about two positives per query, but du and t2 have much larger relevant sets than e-commerce, medical, and video title retrieval.

Queries are generally very short. ecom, video, du, mmarco, and t2 average around 7 to 11 characters, while cmedqa is longer because patient questions carry more context. Documents vary sharply: product and video records are short titles, medical answers are short passages, and T2 or COVID documents can be much longer and noisier.

Retrieval Behavior

BM25 Profile

BM25 is strong on tasks where Chinese keywords, named entities, policy terms, or search phrases appear directly in the relevant document. covid, du, t2, mmarco, and video all show substantial sparse signal in the current metadata. These tasks often preserve exact query terms, even when the document is noisy or long.

BM25 is weaker on cmedqa and medical, where patient wording and doctor or medical-answer wording can differ. It can also over-rank product or video items that share brand, series, performer, or category words but do not match the actual intent.

Dense Profile

Dense retrieval is the best nDCG@10 profile for most NanoCMTEB tasks. It helps bridge terse Chinese queries to answer passages and can connect lay health language to more clinical wording. It is especially strong for du, ecom, medical, mmarco, t2, and video.

The dense profile should still be read with exact-token caution. Product codes, movie names, person names, policy names, and transliterated or mixed-script tokens can be decisive. Dense improvements are most meaningful when they do not lose these anchors.

Reranking Hybrid Profile

reranking_hybrid is strongest when exact Chinese terms and dense intent matching recover different positives. It is the best profile for covid, where policy and news wording benefits from both term anchors and semantic context. For other tasks it often sits between BM25 and dense, making it useful as a candidate-pool view for reranking even when dense has the best top-rank score.

Because du and t2 have many positives per query, hybrid candidate coverage should be interpreted together with nDCG@10. Finding one positive is not enough; the model must rank several relevant passages well.

Task Summary

TaskRetrieval shapeLanguageQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
cmedqapatient question to doctor answerzh20010,0003240.16670.33220.2595Dense
covidCOVID query to policy or news passagezh20010,0002040.78880.75180.7834BM25
duChinese web query to passagezh20010,0008890.73370.92860.8224Dense
ecomshopping query to product titlemultilingual20010,0002000.59130.80520.7025Dense
medicalhealth query to medical answerzh20010,0002000.35820.56910.4699Dense
mmarcotranslated MS MARCO query to passagezh20010,0002120.67950.88590.7984Dense
t2Chinese search query to noisy web passagezh20010,0009790.79440.92450.8604Dense
videovideo search query to title or metadatazh20010,0002000.68970.86290.8103Dense

Interpretation Notes for Model Researchers

NanoCMTEB is useful for separating Chinese token and phrase matching from semantic intent matching. BM25-heavy tasks show that exact Chinese terms and named entities remain powerful. Dense-heavy tasks show where an embedding model captures user intent, paraphrase, or domain wording beyond exact overlap.

The group should also be read by document format. Product and video retrieval are title-heavy; medical and CMedQA are answer-heavy; T2 and COVID include longer passages; du and mmarco are closer to web QA retrieval. A model that does well on one format may not transfer to another.

Training and Leakage Notes

Useful training data includes Chinese web-search relevance judgments, multi-positive passage ranking, CMedQA-style consultation pairs, Chinese medical QA, COVID and policy retrieval data, product-search logs, video-search title matching, and translated MS MARCO pairs with overlap removed. For du and t2, training should preserve multi-positive labels.

Exclude NanoCMTEB evaluation queries, qrels, positives, product titles, video titles, consultation threads, and translated MS MARCO evaluation rows. Synthetic data should keep short Chinese query style and include hard negatives that share keywords but fail the true intent.

Source Reference Table

SourceYearTypeURL
C-Pack: Packaged Resources To Advance General Chinese Embedding2023paperhttps://arxiv.org/abs/2309.07597
C-MTEB benchmarkprojecthttps://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB
MTEB benchmarkprojecthttps://github.com/embeddings-benchmark/mteb

Metadata Summary

FieldValue
Task pages8
Queries1,600
Split-local documents80,000
Positive qrels3,208
Languagesmultilingual, zh
Categoriesnatural_language
Positives / query avg2.00

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
cmedqaNanoCMTEBzhnatural_language20010,0003240.16670.33220.2595Dense
covidNanoCMTEBzhnatural_language20010,0002040.78880.75180.7834BM25
duNanoCMTEBzhnatural_language20010,0008890.73370.92860.8224Dense
ecomNanoCMTEBmultilingualnatural_language20010,0002000.59130.80520.7025Dense
medicalNanoCMTEBzhnatural_language20010,0002000.35820.56910.4699Dense
mmarcoNanoCMTEBzhnatural_language20010,0002120.67950.88590.7984Dense
t2NanoCMTEBzhnatural_language20010,0009790.79440.92450.8604Dense
videoNanoCMTEBzhnatural_language20010,0002000.68970.86290.8103Dense