NanoCMTEB
Overview
NanoCMTEB is a compact Chinese retrieval group based on C-MTEB retrieval tasks. It covers medical consultation retrieval, COVID policy and news retrieval, general Chinese web passage retrieval, e-commerce product retrieval, translated MS MARCO-style passage ranking, T2Ranking, and entertainment-video retrieval. Most queries are short Chinese search intents, while documents range from product or video titles to long web and policy passages.
The group is useful because it tests Chinese retrieval across several practical domains rather than one homogeneous web-search setting. Short queries, mixed scripts, product codes, translated MS MARCO artifacts, medical wording, and multi-positive relevance sets all appear. BM25 measures the strength of Chinese term and phrase overlap, dense retrieval tests intent matching across terse queries and domain language, and reranking_hybrid shows where sparse and dense signals recover complementary candidates.
What This Group Measures
NanoCMTEB follows the C-MTEB view of Chinese embedding evaluation as a multi-domain benchmark. Its retrieval tasks draw from Chinese web search, translated MS MARCO, Multi-CPR-style domain retrieval, T2Ranking, and CMedQA-style medical consultation retrieval.
The shared measurement target is Chinese first-stage retrieval under varied query and document formats. A relevant document may be a doctor-style answer, a government COVID passage, a product title, a video title, a translated web answer passage, or a noisy search result. The group therefore rewards models that handle short Chinese intent, mixed scripts, domain terms, and multiple acceptable positives.
Task Families
- Medical retrieval:
cmedqaandmedicalconnect patient or consumer health questions to medical-domain answers. - COVID and policy retrieval:
covidretrieves pandemic-related news, notices, and policy passages. - Chinese web retrieval:
du,mmarco, andt2cover DuReader-style, translated MS MARCO-style, and T2Ranking-style passage ranking. - E-commerce retrieval:
ecommatches short shopping queries to compact product-title-like documents. - Video retrieval:
videomatches short entertainment-video queries to title or metadata-like records.
Dataset Shape
NanoCMTEB contains 8 task pages, 1,600 queries, 80,000 split-local documents, and 3,208 positive qrel rows. Each task has 200 queries and 10,000 documents. The group averages about two positives per query, but du and t2 have much larger relevant sets than e-commerce, medical, and video title retrieval.
Queries are generally very short. ecom, video, du, mmarco, and t2 average around 7 to 11 characters, while cmedqa is longer because patient questions carry more context. Documents vary sharply: product and video records are short titles, medical answers are short passages, and T2 or COVID documents can be much longer and noisier.
Retrieval Behavior
BM25 Profile
BM25 is strong on tasks where Chinese keywords, named entities, policy terms, or search phrases appear directly in the relevant document. covid, du, t2, mmarco, and video all show substantial sparse signal in the current metadata. These tasks often preserve exact query terms, even when the document is noisy or long.
BM25 is weaker on cmedqa and medical, where patient wording and doctor or medical-answer wording can differ. It can also over-rank product or video items that share brand, series, performer, or category words but do not match the actual intent.
Dense Profile
Dense retrieval is the best nDCG@10 profile for most NanoCMTEB tasks. It helps bridge terse Chinese queries to answer passages and can connect lay health language to more clinical wording. It is especially strong for du, ecom, medical, mmarco, t2, and video.
The dense profile should still be read with exact-token caution. Product codes, movie names, person names, policy names, and transliterated or mixed-script tokens can be decisive. Dense improvements are most meaningful when they do not lose these anchors.
Reranking Hybrid Profile
reranking_hybrid is strongest when exact Chinese terms and dense intent matching recover different positives. It is the best profile for covid, where policy and news wording benefits from both term anchors and semantic context. For other tasks it often sits between BM25 and dense, making it useful as a candidate-pool view for reranking even when dense has the best top-rank score.
Because du and t2 have many positives per query, hybrid candidate coverage should be interpreted together with nDCG@10. Finding one positive is not enough; the model must rank several relevant passages well.
Task Summary
| Task | Retrieval shape | Language | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| cmedqa | patient question to doctor answer | zh | 200 | 10,000 | 324 | 0.1667 | 0.3322 | 0.2595 | Dense |
| covid | COVID query to policy or news passage | zh | 200 | 10,000 | 204 | 0.7888 | 0.7518 | 0.7834 | BM25 |
| du | Chinese web query to passage | zh | 200 | 10,000 | 889 | 0.7337 | 0.9286 | 0.8224 | Dense |
| ecom | shopping query to product title | multilingual | 200 | 10,000 | 200 | 0.5913 | 0.8052 | 0.7025 | Dense |
| medical | health query to medical answer | zh | 200 | 10,000 | 200 | 0.3582 | 0.5691 | 0.4699 | Dense |
| mmarco | translated MS MARCO query to passage | zh | 200 | 10,000 | 212 | 0.6795 | 0.8859 | 0.7984 | Dense |
| t2 | Chinese search query to noisy web passage | zh | 200 | 10,000 | 979 | 0.7944 | 0.9245 | 0.8604 | Dense |
| video | video search query to title or metadata | zh | 200 | 10,000 | 200 | 0.6897 | 0.8629 | 0.8103 | Dense |
Interpretation Notes for Model Researchers
NanoCMTEB is useful for separating Chinese token and phrase matching from semantic intent matching. BM25-heavy tasks show that exact Chinese terms and named entities remain powerful. Dense-heavy tasks show where an embedding model captures user intent, paraphrase, or domain wording beyond exact overlap.
The group should also be read by document format. Product and video retrieval are title-heavy; medical and CMedQA are answer-heavy; T2 and COVID include longer passages; du and mmarco are closer to web QA retrieval. A model that does well on one format may not transfer to another.
Training and Leakage Notes
Useful training data includes Chinese web-search relevance judgments, multi-positive passage ranking, CMedQA-style consultation pairs, Chinese medical QA, COVID and policy retrieval data, product-search logs, video-search title matching, and translated MS MARCO pairs with overlap removed. For du and t2, training should preserve multi-positive labels.
Exclude NanoCMTEB evaluation queries, qrels, positives, product titles, video titles, consultation threads, and translated MS MARCO evaluation rows. Synthetic data should keep short Chinese query style and include hard negatives that share keywords but fail the true intent.
Source Reference Table
| Source | Year | Type | URL |
| C-Pack: Packaged Resources To Advance General Chinese Embedding | 2023 | paper | https://arxiv.org/abs/2309.07597 |
| C-MTEB benchmark | project | https://github.com/FlagOpen/FlagEmbedding/tree/master/C_MTEB | |
| MTEB benchmark | project | https://github.com/embeddings-benchmark/mteb |
Metadata Summary
| Field | Value |
| Task pages | 8 |
| Queries | 1,600 |
| Split-local documents | 80,000 |
| Positive qrels | 3,208 |
| Languages | multilingual, zh |
| Categories | natural_language |
| Positives / query avg | 2.00 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| cmedqa | NanoCMTEB | zh | natural_language | 200 | 10,000 | 324 | 0.1667 | 0.3322 | 0.2595 | Dense |
| covid | NanoCMTEB | zh | natural_language | 200 | 10,000 | 204 | 0.7888 | 0.7518 | 0.7834 | BM25 |
| du | NanoCMTEB | zh | natural_language | 200 | 10,000 | 889 | 0.7337 | 0.9286 | 0.8224 | Dense |
| ecom | NanoCMTEB | multilingual | natural_language | 200 | 10,000 | 200 | 0.5913 | 0.8052 | 0.7025 | Dense |
| medical | NanoCMTEB | zh | natural_language | 200 | 10,000 | 200 | 0.3582 | 0.5691 | 0.4699 | Dense |
| mmarco | NanoCMTEB | zh | natural_language | 200 | 10,000 | 212 | 0.6795 | 0.8859 | 0.7984 | Dense |
| t2 | NanoCMTEB | zh | natural_language | 200 | 10,000 | 979 | 0.7944 | 0.9245 | 0.8604 | Dense |
| video | NanoCMTEB | zh | natural_language | 200 | 10,000 | 200 | 0.6897 | 0.8629 | 0.8103 | Dense |