NanoMTEB-Misc / 2022_zh

Overview

2022_zh is the Chinese news retrieval split of TREC 2022 NeuCLIR. Queries are Chinese topic statements, and documents are Chinese news articles from a NeuCLIR hard-negative retrieval pool. The Nano split contains 47 queries, 10,000 documents, and 1,643 positive qrels. It is strongly multi-positive: queries have 34.96 positives on average, the median is 18, and 97.87% of queries have more than one positive. Queries average 24.00 characters, while documents average 1,107.60 characters. The task evaluates Chinese topical news search over broad event and issue clusters, not single-answer passage retrieval.

Details

What the Original Data Measures

Overview of the TREC 2022 NeuCLIR Track describes NeuCLIR as a neural cross-language retrieval benchmark over Chinese, Persian, and Russian Common Crawl news. The original task used English topics to retrieve target-language articles, with monolingual translated topics also available for reference. This Nano split uses Chinese queries and Chinese documents, preserving the TREC-style information needs and pooled relevance judgments.

The retrieval target is any judged relevant article for the topic. A query can have many relevant articles, often covering different facets of an event, public-health topic, technology issue, or political development.

Observed Data Profile

The split has 47 Chinese queries, 10,000 documents, and 1,643 positive judgments. Documents are Chinese news articles with headlines and body text. Queries are short compared with Persian and Russian NeuCLIR topics, which makes query-document matching more dependent on semantic topic alignment and named entity handling.

Examples include Iran's internet shutdown during protests, domestic COVID vaccine production in Iran, AI in agriculture, discrimination against people with AIDS in China, and myopia among Chinese students. These are topical search requests rather than factoid questions.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.2931, hit@10 of 0.7872, and recall@100 of 0.2958. It can often find at least one relevant article when a query contains distinctive entities or terms, but it covers a small share of the positive set by rank 100. Short Chinese queries and broad news topics limit the effectiveness of exact term matching.

The low recall@100 is the key BM25 weakness. Even with many positives per query, the sparse candidate list misses many relevant articles because they use different wording, locations, or event framing.

Dense Evaluation Profile

Dense retrieval is the strongest profile, with nDCG@10 of 0.5101, hit@10 of 0.9149, and recall@100 of 0.6245. The dense model captures Chinese news topic semantics much better than BM25, especially when relevant articles describe the same issue without repeating the exact topic wording.

For researchers, this split is a strong diagnostic of Chinese event-level semantic retrieval. A model must map compact Chinese topic statements to long news articles and recover multiple related articles in a broad cluster.

Reranking Hybrid Evaluation Profile

The reranking_hybrid profile reaches nDCG@10 of 0.4072, hit@10 of 0.8936, and recall@100 of 0.4918. It improves over BM25 but remains below dense retrieval in both early ranking and coverage. Candidate lists contain 100 to 101 entries, with two safeguard-positive rows.

This pattern indicates that lexical anchors help, but the dominant signal is semantic topic matching. Hybrid search is more robust than BM25 alone, yet the dense candidate set is the better starting point for this Chinese NeuCLIR split.

Metric Interpretation for Model Researchers

2022_zh is dense-favorable. BM25 retrieves some obvious matches, dense retrieval provides the best ranking and coverage, and reranking_hybrid lands between them. Because nearly all queries have many positives, hit@10 is less informative than nDCG@10 and recall@100: finding one article is useful, but the benchmark also rewards broader coverage of relevant news articles.

Top-10 quality reflects whether the model ranks strongly relevant articles early, while recall@100 reflects whether it can expose enough of the topic cluster for reranking or downstream analysis.

Query and Relevance Type Tendencies

Queries are Chinese topical information needs about public events, technology, health, agriculture, politics, or social issues. Positive documents are Chinese news articles relevant to the topic. Relevance is broad but judged: articles must satisfy the information need, not simply mention the same entity.

The task rewards retrieval over news clusters, including multiple articles that cover the same issue from different angles.

Representative Failure Modes

BM25 can miss relevant articles that use different wording or focus on an adjacent aspect of the topic. Dense retrieval can retrieve semantically related but unjudged or off-angle articles. Hybrid retrieval can improve lexical precision while still missing much of the positive set.

Short queries can under-specify the desired angle, and long news articles can contain many terms that distract sparse retrieval.

Training Data That May Help

Useful training data includes Chinese news retrieval pairs, Chinese TREC-style topic retrieval, multilingual CLIR data, and hard negatives from same-topic news clusters. Training should exclude NeuCLIR evaluation topics, qrels, and article pools that overlap with this Nano split.

Synthetic data should generate Chinese search topics from related news clusters and assign several relevant articles per topic. Hard negatives should share entities or event categories but differ in relation, impact, or location.

Model Improvement Notes

Models should handle Chinese segmentation, named entities, and event semantics. Dense encoders should be trained for multi-positive topical coverage. Rerankers should distinguish articles that truly answer the information need from articles that merely share high-level news vocabulary.

Example Data

Query	Positive document
我要查找 2019 年 11 月反政府抗议期间伊朗互联网被切断的原因和影响。 [38 chars]	伊朗当局的断网行动可能会成为常态一名伊朗青年显示，她的手机无法连上网络，2019年11月23日。在伊朗，一些城市的网络继续被当局切断。这是伊朗当局面对示威呼吁所做出的回应。这不是第一次伊朗当局切断网络了。去年11月时，德黑兰当局曾经对互联网进行非常极端的控制，瘫痪了伊朗的网络。广告继续浏览后续根据萨瓦大学（University of Savoie）IT教授和网络战略研究员萨拉曼提安（Kave Salamatian）的说法，伊朗当局对通信的这种控制注定会变成常态。他说， “（监控）系统早就检测就绪了，前一次的断网就不是小范围的断网，这是好几年持续不断尝试的结果。” “目前，伊朗当局可以做到想什么时候断网就什么时候断网，监控的目标可以做到非常精确，对准哪个城市哪个APP等等，他们已经达到了目的，能对传播的东西进行精确的监控。” “当然，呼吁示威的方法不只是网络，传统的方法还是存在的，要知道，伊朗革命发生在1978年，当时是没有网络的，传统的方法还是有的，只是，用这些机制，信息传播需要更长的时间，也更艰难。” [471 chars]
伊朗在伊朗生产的本土新冠疫苗CovIran Barekat Covid。 [36 chars]	首批中国新冠疫苗运抵德黑兰伊朗国家通讯社（IRNA）：早些时候，伊朗已开始接种名为“卫星V”的俄罗斯新冠疫苗。在进口疫苗的同时，伊朗也在努力生产本国疫苗，“Covo-iran Barekat”疫苗临床试验的第一阶段已经完成。此外，伊朗“Razi COV-Pars”疫苗的人体试验阶段也已于28日开始。最近伊朗卫生部副部长阿里雷扎·雷西（Alireza Raeisi）说，第一批中国国家药监局已批准国药集团（Sinopharm）将抵达伊朗首都德黑兰国际机场。他说，在不远的将来伊朗还会进口如“新冠肺炎疫苗实施计划”（COVAX）、印度疫苗等国家的新型冠状病毒疫苗。雷西表示，这两天“新冠肺炎疫苗实施计划”（COVAX）也会通知我们要把4百万剂新型冠状病毒疫苗发给伊朗的时间。伊朗卫生部长纳马基最近表示，伊朗将在不久的将来成为世界上最大的新型冠状病毒疫苗生产国之一。纳马基保证伊朗将很快成为世界上最大的疫苗生产国之一，并将把疫苗出口到世界各地，将在9月之前为伊朗所有的弱势群体接种疫苗。截至北京时间2月28日16时23分，伊朗新冠确诊病例累计达1623159例，死亡病例增至59980例。 [510 chars]
我在寻找有关人工智能如何在农业中应用的文章。 [22 chars]	花蓮農業跨入AI智慧新紀元！無人機農業專班開啟農業新商機 ▲花蓮智慧農業專班於新光兆豐休閒農場舉辦飛行演示。（圖／花蓮縣政府提供，下同）記者王兆麟／花蓮報導花蓮縣農業邁入智慧AI新紀元！花蓮縣政府與台灣無人機應用發展協會攜手合作，首次在花蓮開設「無人機與AI的應用花蓮智慧農業專班」，26日於新光兆豐休閒農場舉辦飛行演示記者會。期許透過科技的力量吸引青年農民返鄉加入創新農業發展，翻轉傳統農業型態，進一步提升花蓮農業產值。無人機因成本低、效率高已廣泛應用於農藥噴灑、電力巡檢、森林巡檢、國土監察、石油管線巡檢、環境治理、橋梁公路巡檢等眾多行業。花蓮縣為培育專業無人機人才，不僅享有學費補助，邀請專業培訓機構及知名講師授課，介紹智慧農業應用於農業相關資訊技術、AI無人機用於土壤與田地的分析、種植、噴灑農藥、施肥、作物監控、灌溉與健康評估及AI無人機在智慧農業、精緻農業的國際發展趨勢等專業課程。台灣整體社會結構老化，農業缺工問題嚴重。花蓮縣開設AI無人機智慧農業專班，提升農場工作效率，減輕農民勞動負擔，降低勞力運用。透過精準定位協助農民進行農藥噴灑，可大幅降低時間及成本，過去傳統人力噴藥一分田需要一小時，利用無人機只要三分鐘即可完成，每天作業面積可達3到5甲。農業處長羅文龍表示，台灣正面臨農村老化、缺乏青農返鄉等困境，無人機AI智慧的運用，正好解決了許多人力資源匱乏的問題。政策的推動上，也逐漸將無人機使用在稽查水土保持或人員不易到達的地域，期待能降低成本、精進管理，創造正面雙贏的局面，讓花蓮的無人機AI智慧走在前端，引領全台。 ▲參與培訓的青農種子學員，希望利用科技的力量翻轉花蓮智慧農業開拓農業新商機。我國民航法無人機專章自109年3月31日施行，從今年四月起，從事無人機農噴，除了必須擁有農藥代噴人員證照外，操作者須考取無人機專業高級操作證。相關考試須具備無人機專業操作技能，學員經過專業培訓及練習能大幅提升上榜率。另外參與培訓的青農種子學員，也有更加長遠的打算，他們自發組織成立了花蓮無人機應用發展協會，未來可以充分利用AI無人機的科技優勢及整合農業系統解決方案，即可以取代人力噴藥噴自己的農作物外，也幫其它農友代噴，又可以結合AI 監控作物、灌溉與病蟲害等健康評估。他們希望利用科技的力量翻轉花蓮智慧農業與無人機推廣發展新紀元。 ►偷偷分享少女秘密！ [1,002 chars]

Source Reference Table

Title	Year	Type	URL
Overview of the TREC 2022 NeuCLIR Track	2023	Benchmark paper	https://arxiv.org/abs/2304.12367
NeuCLIR official site	2022	Project page	https://neuclir.github.io/
mteb/NeuCLIR2022RetrievalHardNegatives	2025	Dataset card	https://huggingface.co/datasets/mteb/NeuCLIR2022RetrievalHardNegatives

Dataset Information

Field	Value
Nano set	NanoMTEB-Misc
Backing dataset	NanoMTEB-Misc
Task / split	2022_zh
Hugging Face dataset	hakari-bench/NanoMTEB-Misc
Language	zh
Category	natural_language
Queries	47
Documents	10,000
Positive qrels	1,643
Positives / query avg	34.96
Positives / query min	1
Positives / query median	18.00
Positives / query max	100
Multi-positive queries	46 (97.87%)
Query length avg chars	24.00
Document length avg chars	1,107.60

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.2931	0.7872	0.2958	top-500
Dense	`harrier_oss_v1_270m`	0.5101	0.9149	0.6245	top-500
Reranking hybrid	`reranking_hybrid`	0.4072	0.8936	0.4918	top-100