NanoCMTEB / t2

Overview

NanoCMTEB t2 is a Chinese web passage retrieval task based on the T2Ranking benchmark family. Queries are short Chinese search strings, and documents are noisy web passages that may include copied text, HTML fragments, article sections, product text, medical notes, or other web content. The task measures ranked retrieval quality when most queries have several relevant passages.

Details

What the Original Data Measures

T2Ranking is a large-scale Chinese passage ranking benchmark built from real search-engine queries, expert relevance labels, diversified candidates, and multiple relevance levels. C-MTEB includes T2Retrieval as part of its Chinese retrieval group.

The task is closer to web passage ranking than single-answer QA. A query may be short and ambiguous, and several passages can be relevant at different degrees. The retriever must rank passages that satisfy the search intent above noisy or merely term-overlapping candidates.

Observed Data Profile

The task contains 200 queries, 10,000 documents, and 979 relevance judgments. It is strongly multi-positive: there are 4.90 positives per query on average, a minimum of 1, a median of 4.0, a maximum of 23, and 175 multi-positive queries, or 87.50% of the set.

Queries average 10.74 Chinese characters. Documents average 913.50 characters, making them long for this NanoCMTEB group. Documents may contain boilerplate, copied page fragments, HTML artifacts, and mixed-quality web text.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.7944, hit@10 of 0.9600, and recall@100 of 0.8662 using the top-500 BM25 candidate subset. This is a strong lexical profile: short web queries often contain highly salient terms that appear in relevant passages.

BM25 is still weaker than dense retrieval because relevant passages may satisfy the intent without exact phrase overlap, while noisy web passages can repeat query terms without being useful. The high hit@10 hides ordering limitations among multiple positives.

Dense Evaluation Profile

The dense harrier-oss-270m run reaches nDCG@10 of 0.9245, hit@10 of 0.9800, and recall@100 of 0.9683. Dense retrieval is the strongest top-ranking profile. It substantially improves over BM25 in nDCG@10 and recall@100.

This shows that embedding similarity is highly effective for T2-style Chinese passage ranking. Dense retrieval can identify answer-bearing or intent-satisfying passages despite noisy formatting and query-passage wording differences.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate set reaches nDCG@10 of 0.8604, hit@10 of 0.9800, and recall@100 of 0.9806. It uses a top-100 candidate range with an optional rank-101 safeguard; this task has 1 safeguard row, candidate counts from 100 to 101, and a mean of 100.01 candidates.

Hybrid retrieval has the best recall@100 and ties dense hit@10, but dense remains clearly better for nDCG@10. The hybrid pool is useful for downstream reranking, while dense retrieval gives the best first-page ordering.

Metric Interpretation for Model Researchers

This task is dense-favorable with a strong sparse baseline. Because most queries have multiple positives, nDCG@10 is more informative than hit@10. A model should rank several highly relevant passages early, not just find one positive.

BM25 demonstrates that exact Chinese query terms are valuable, but dense retrieval better separates useful passages from noisy term-overlap passages. Reranking_hybrid provides broad coverage and is a good candidate pool for rerankers.

Query and Relevance Type Tendencies

Queries include entertainment knowledge, geography, handwriting, traditional medicine, legal or consumer questions, and practical web-search topics. Positive documents are web passages with answer-bearing content, sometimes mixed with formatting artifacts or copied snippets.

The relevance relation is intent satisfaction. A positive passage should answer or strongly satisfy the query, and multiple passages can be valid for the same query.

Representative Failure Modes

Likely failures include retrieving pages that repeat entity names without answering, under-ranking cleaner answer passages behind noisy copied text, missing paraphrased answers, and failing to order multiple positives by usefulness.

BM25 is vulnerable to noisy exact-match passages. Dense retrieval can still over-match broad topic. Hybrid retrieval improves coverage but may need reranking to recover dense's top-order quality.

Training Data That May Help

Useful training data includes non-overlapping T2Ranking relevance data, Chinese web search passage-ranking annotations, multi-positive retrieval training data, and relevance-graded hard negatives.

Synthetic data should start from noisy Chinese web passages and create short search queries with multiple relevant passages. Hard negatives should preserve query terms while differing in answer specificity, relevance grade, or intent.

Model Improvement Notes

Strong systems should handle short Chinese queries, noisy web passages, and multi-positive ranking. Dense retrieval is the strongest observed first-stage method, while hybrid retrieval can provide higher recall for reranking.

The task is useful for evaluating Chinese web ranking models where top-10 ordering among several relevant passages matters.

Example Data

Query	Positive document
卢升象雪中悍刀行是谁 [10 chars]	<br><img><br><img><br>人猫-韩貂寺<br>大内巨宦。本名韩生宣,统领十万宦官二十馀年,人称人猫,王朝内口碑比起徐骁只差不好。左手缠绕三千红丝。皇帝私生子赵楷之师。与徐骁,黄龙士并称为当世三大魔头。因为喜欢虐杀初入一品的高手而被称为魔头。擅以指玄杀天象,陆地神仙境下几乎无敌,指玄境第二,仅次于桃花剑神邓太阿。列入北莽高手后新武评天下第十。<br><img><br>木剑游侠儿—温华<br>落魄剑士。提一杆木剑,走一场江湖,与徐凤年三千里游历路上邂逅,跟老黄和李姑娘也交情不浅。后于襄樊城外与徐凤年再遇。拒绝徐凤年帮其介绍李淳罡只为一心练就自己的剑道。被黄三甲看重,传授怪人隋斜谷的两剑。入京城比剑,一连三败,得温不胜之名。战平棠溪剑仙卢白颉,一举成名。黄三甲以他钟爱的女子声色双甲李白狮和教剑恩情相胁让他杀徐凤年。温华知晓徐凤年的真实身份后,自断一臂一腿,舍弃心爱女子和有望成就陆地神仙的剑,折剑出江湖。温华是本书中江湖气最重的人物,虽惫懒无赖,然有志向,亦重情义,是真正的江湖儿女。后在家乡与一温婉女子疑似成家。<br><img><br>白衣—洛阳<br>白衣女子,天下第一魔头。曾化名黄宝妆,在北莽与徐凤年结识。敦煌城一战,败于邓太阿,胸前骊珠被击碎。在大秦皇陵外为徐凤年和红袍阴物所伤,在北海毁去拓跋菩萨势在必得的神兵,自北莽一路南下,万骑难逆其锋,助徐凤年击退柳篙师后修为降了两成,与徐凤年约好去洛阳。大秦皇后,八百年前以毒酒鸩杀秦皇心爱女子(姜泥前世),服食长生不老药活到如今。新武评第五<br><img><br>骑牛—洪洗象<br>武当山掌教,王重楼的小师弟,武当道教千年历史上最年轻的祖师爷,五岁被上一代武当掌教带上山,收为闭关弟子,年幼便与这一代掌教王重楼变成了师兄弟。喜欢倒骑青牛,不成为天下第一,就不能下山,修的是无上天道。传闻是真武大帝转世,被寄予厚望,有望武道天道一肩挑。骑牛读书二十载,一朝明悟,扶摇踏鹤,直入天象。曾咫尺一步,直接夺去了道门剑魁齐仙侠的手中拂尘。爱慕徐脂虎,却不敢下山找她。一步入天象,只为骑鹤下江南。武评副评西观音东剑冠南吕祖北真武之北真武。吕祖转世,用兵解转世换得徐脂虎飞升,愿在为人间证道三百年。现转世为男童馀福,被李玉斧带回武当山。<br><img><br>舒羞<br>徐凤年身边扈从,出身南疆,会很多歪门邪道,内力不俗。掌力... [1,000 / 1,272 chars]
都江堰是哪几条江汇合 [10 chars]	都江堰渠首枢纽主要由鱼嘴、飞沙堰、宝瓶口三大主体工程构成。三者有机配合,相互制约,协调运行,引水灌田,分洪减灾,具有“分四六,平潦旱”的功效。岷江鱼嘴分水工程鱼嘴分水堤又称“鱼嘴”,是都江堰的分水工程,因其形如鱼嘴而得名,它昂头于岷江江心,包括百丈堤、杩槎、金刚堤等一整套相互配合的设施。其主要作用是把汹涌的岷江分成内外二江,西边叫外江,俗称“金马河”,是岷江正流,主要用于排洪;东边沿山脚的叫内江,是人工引水渠道,主要用于灌溉。在古代,鱼嘴是以竹笼装卵石垒砌。由于它建筑在岷江冲出山口呈弯道环流的江心,冬春季江水较枯,水流经鱼嘴上面的弯道绕行,主流直冲内江,内江进水量约6成,外江进水量约4成;夏秋季水位升高,水势不再受弯道制约,主流直冲外江,内、外江江水的比例自动颠倒:内江进水量约4成,外江进水量约6成。这就利用地形,完美地解决了内江灌区冬春季枯水期农田用水以及人民生活用水的需要和夏秋季洪水期的防涝问题。 [406 chars]
李字行书怎么写好看 [9 chars]	<br>李字常用签名一笔写法。<br>如图: 一笔草属狂草,是草书最放纵的一种,笔势相连而圆转,字形狂放多变,在今草的基础上将点画连绵书写,形成“一笔书”,在章法上与今草一脉相承。在中国古代书论中,不论是对篆、隶、行、楷,还是对草书的论述,大多是以自然景观或某些现象作比,加以形容和描述,读者要靠一种生活感受、生活经验去领悟,才能欣赏和理解。<br>扩展资料: 狂草的特征: 1、草书中最放纵的一种。笔势连绵回绕,字形变化繁多。<br>相传创自汉张芝 ,至唐张旭、怀素始有流传。清冯班《钝吟书要》:“虽狂如旭素,咸臻神妙。<br>古人醉时作狂草,细看无一失笔,平日工夫细也。” 清高士奇《跋》:“ 唐怀素书,奇纵变化,超迈前古。<br>其自叙一卷,尤为生平狂草。” 2、随意潦草。<br>巴金《家》二五:“从倩如的狂草的字迹看来,可以知道她是多么愤慨。” 3、狂草是在今草的基础上将点画连绵书写,形成“一笔书”,在章法上与今草一脉相承。<br>狂草的成就,是唐代书法高峰的另一方面的表现。代表人物是张旭和怀素.张旭史称“草圣”。 [475 chars]

Source Reference Table

Item	Reference
Task paper	T2Ranking: A large-scale Chinese Benchmark for Passage Ranking
Benchmark paper	C-Pack: Packed Resources For General Chinese Embeddings
Source dataset	mteb/T2Retrieval
NanoCMTEB dataset	hakari-bench/NanoCMTEB

Representative query and positive source snippets:

Query	Positive document snippet
卢升象雪中悍刀行是谁	A web passage about characters in the novel mentions related named figures and roles.
都江堰是哪几条江汇合	A passage explains the Dujiangyan water project and the inner and outer river division.
李字行书怎么写好看	A passage discusses one-stroke or cursive writing styles for the character.
桂苓片的功效与作用	A passage explains a traditional medicine formula and its uses.
在洗浴中心摔倒怎么起诉	A legal-advice passage discusses consumer venue responsibility and evidence.

Dataset Information

Field	Value
Nano set	NanoCMTEB
Backing dataset	NanoCMTEB
Task / split	t2
Hugging Face dataset	hakari-bench/NanoCMTEB
Language	zh
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	979
Positives / query avg	4.89
Positives / query min	1
Positives / query median	4.00
Positives / query max	23
Multi-positive queries	175 (87.50%)
Query length avg chars	10.74
Document length avg chars	913.50

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.7944	0.9600	0.8662	top-500
Dense	`harrier_oss_v1_270m`	0.9245	0.9800	0.9683	top-500
Reranking hybrid	`reranking_hybrid`	0.8604	0.9800	0.9806	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: T2Retrieval dev
Train/eval overlap audit: not_audited
Leakage note: exclude NanoCMTEB t2 queries, qrels, and web passages
Multi-positive training: preserve_multiple_relevant_passages_per_query
Useful training data: non-overlapping T2Ranking relevance data, Chinese web search passage-ranking annotations, multi-positive retrieval training data, relevance-graded hard negatives