NanoFaMTEB-v2 / neu_clir2023_fas

Overview

neu_clir2023_fas is a Persian NeuCLIR retrieval task in NanoFaMTEB-v2. The queries are Persian information needs, and the documents are long Persian news or web documents drawn from a hard-negative retrieval pool.

This task is a useful long-document retrieval benchmark. It combines named events, organizations, countries, and policy topics with many relevant documents per query. A model must retrieve multiple articles that satisfy the same information need rather than only identify one short answer passage.

Details

What the Original Data Measures

FaMTEB includes NeuCLIR retrieval among its Persian retrieval resources. NeuCLIR is a cross-language and multilingual information retrieval benchmark focused on news-like topics and document collections. In this Nano split, the task is represented as Persian retrieval over Persian documents.

The public source is mteb/NeuCLIR2023RetrievalHardNegatives, described as a hard-negative version built from BM25 and multilingual dense retriever pools. This makes the task suitable for comparing lexical retrieval, dense semantic retrieval, and hybrid candidate pools on long Persian documents.

Observed Data Profile

This Nano split contains 74 queries, 10,000 documents, and 3,669 positive qrels. It is strongly multi-positive: queries have 49.58 positives on average, with a minimum of 1, a median of 38.0, and a maximum of 100. There are 73 multi-positive queries, or 98.65% of the split. Queries average 65.82 characters, and documents average 3,121.94 characters.

Observed examples ask for information about Chinese companies sanctioned by the United States, the Evergreen ship blocked in the Suez Canal, tourism potential between Uzbekistan and Iran, ecological effects of the Sanchi tanker collision, and the opportunities and challenges of 5G internet. Positive documents are long news-style passages or articles.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.4336, hit@10 of 0.9054, and recall@100 of 0.5508 with a top-500 candidate pool. The high hit rate shows that lexical retrieval usually finds at least one relevant article, especially when queries contain names such as organizations, ships, countries, or technologies.

The ranking and coverage limits are visible in nDCG and recall. Long documents contain many overlapping terms, and hard negatives may mention the same event without satisfying the exact information need. BM25 can also over-prioritize articles with repeated keywords while missing broader contextual relevance.

Dense Evaluation Profile

The dense harrier-oss-270m profile reaches nDCG@10 of 0.5766, hit@10 of 0.9730, and recall@100 of 0.5890. Dense retrieval is the strongest top-ranking profile and improves all three headline metrics over BM25.

This suggests that embedding similarity is important for NeuCLIR-style topics. Queries often describe an information need rather than a single entity lookup, and relevant articles may express the event, consequence, or policy issue using varied wording. Dense retrieval helps connect the query intent to long documents that do not repeat every phrase.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.5595, hit@10 of 0.9459, and recall@100 of 0.5993. It uses exactly 100 candidates per query and has no safeguard-positive rows.

Hybrid retrieval is strongest on recall@100, which matters in this task because nearly all queries have many positives. Its nDCG@10 is slightly below dense retrieval, indicating that dense-only ranking is a better first-stage top-10 ordering signal, while hybrid retrieval gives rerankers a broader and more complete candidate pool.

Metric Interpretation for Model Researchers

neu_clir2023_fas is best read as a long-document, many-positive retrieval task. Hit@10 is less discriminative because most profiles retrieve at least one positive. nDCG@10 measures how well the first page is ordered, and recall@100 measures whether the model covers the broader relevant article set.

Dense retrieval is the strongest direct ranking signal. reranking_hybrid is attractive for reranking because it improves relevant coverage despite using only 100 candidates. BM25 remains useful as a lexical component, but exact term matching alone is not enough for high-quality ranking.

Query and Relevance Type Tendencies

Queries are Persian information needs about events, policy issues, international relations, accidents, technologies, and organizations. They are longer than typical keyword queries and often ask to "find information" about a topic.

Relevant documents are long articles or news-style passages. A document may be relevant because it discusses the event background, consequence, stakeholder, or policy angle requested by the query. This creates a broader relevance space than single-answer QA.

Representative Failure Modes

BM25 may retrieve articles that mention the same entity or event but do not cover the requested angle. Dense retrieval may find thematically similar articles while missing a precise named organization, date, or incident. Hybrid retrieval improves coverage but still leaves the reranker to choose among long documents with overlapping event descriptions.

Another failure mode is under-coverage. With dozens of relevant documents per query, a model can look strong on hit@10 while still missing a large fraction of relevant articles in the top 100.

Training Data That May Help

Useful training data includes Persian news retrieval, NeuCLIR-style topic-document judgments, long-document retrieval data, multilingual CLIR resources, and hard negatives centered on the same event or organization. Training should include many-positive topics so the model learns coverage rather than only one-answer matching.

Training should exclude this split's topics, documents, and relevance labels.

Model Improvement Notes

Improving this task requires long-document representations that preserve both entity details and topic-level intent. Models should handle Persian news style, dates, organization names, event descriptions, and consequence-oriented queries.

For reranking, passage or document chunking may matter: a long document can be relevant because of one section, while the rest contains unrelated context. A strong reranker should identify whether the document satisfies the information need, not only whether it discusses the same broad topic.

Example Data

Query	Positive document
اطلاعاتی درباره شرکت های چینی تحریم شده توسط ایالات متحده، به استثنای هواوی، پیدا کنید. [87 chars]	گنجانده شدن چند شرکت چینی در لیست سیاه پنتاگون+ جزئیات به گزارش اسپوتنیک، شرکت China Construction Technology Co. Ltd. (CCTC) ، China International Engineering Consulting Corp. (CIECC) ، شرکت China National Offshore Oil Corp. (CNOOC) و Semiconductor Manufacturing International Corp. (SMIC) به عنوان "شرکتهای کمونیست نظامی چین" که با مجتمع صنایع نظامی چین همکاری می کنند، در این لیست گنجانده شده اند. "لیست سیاه" به رئیس جمهور ایالات متحده اجازه می دهد تا هرگونه فعالیت تجاری شرکت های مربوطه در آمریکا را مورد تحریم قرار دهد. پیشتر مجلس نمایندگان كنگره آمریكا به اتفاق آرا لایحه ای را تصویب كرد كه می تواند شركت های چینی را امکان عرضه سهام خود در بورس های آمریكا محروم کند. در سند مصوبه خاطرنشان می شود این طرح قانونی باید یک مشکل طولانی مدت را حل کند که اختلافات تجاری بین ایالات متحده و چین به آن شدت بخشیده است. به عقیده نویسندگان این لایحه ، چین از واگذاری امکان دسترسی تنظیم کننده های آمریکایی به گزارش های حسابرسی خود امتناع می ورزد. قانون جدید شرکت های چینی فعال در ایالات متحده را موظف می کند... [1,000 / 1,417 chars]
اطلاعاتی در مورد کشتی "اورگرین" که در کانال سوئز گیر کرده است را پیدا کنید [74 chars]	تلاش تازه برای شناور کردن کشتی گرفتار در کانال سوئز تلاش تازه برای شناور کردن کشتی گرفتار در کانال سوئز ۳۳ دقیقه پیش منبع تصویر، EPA تلاشی تازه برای به شناور کردن یک کشتی عظیم باربری که کانال سوئز در مصر را سد کرده است در جریان است. مقام های کانال می گویند که ۱۴ قایق بکسل با استفاده از مد آب در روز شنبه سعی دارند کشتی را آزاد کنند و اگر این تلاش موفق نبود روز یکشنبه قایق های دیگری به کمک آنها خواهند آمد. کشتی "اِوِر گیوِن" روز سه شنبه از پهلو در این کانال که یکی از شلوغ ترین مسیرهای تجاری جهان است گیر کرد. بیش از ۳۰۰ کشتی در دو طرف نقطه ای که مسدود شده منتظر عبور هستند. بعضی کشتی ها مجبور شده اند با دور زدن آفریقا به مقصد برسند. تا اواخر روز جمعه کشتی های لایروبی بیش از ۲۰ هزار تن ماسه را از اطراف دماغه کشتی، که عمیقا در کرانه های کانال فرو روفته، جابجا کردند. اسامه ربیع، رئیس سازمان مدیریت کانال سوئز، در یک کنفرانس خبری گفت که ۹ هزار تن آبی که برای حفظ تعادل کشتی استفاده می شود تخلیه شده است تا کشتی سبک شود. او گفت که عقب کشتی جمعه شب شروع به حرکت کرد، و سکان و پروانه کشتی شروع به کار... [1,000 / 2,891 chars]
پتانسیل گردشگری بين ازبکستان و ایران چقدر است؟ [46 chars]	جاده ابریشم پل ارتباطی گردشگری ایران و ازبکستان غلامحسین ابراهیمی با سابقه فعالیت در بخش های اقتصادی و گردشگری در کشورهای تونس، اردن، لهستان، لیتوانی ، ازبکستان و مالزی به خبرنگار مهر گفت: ازبکستان مانند ایران به دلیل قرار گرفتن در مسیر جاده ابریشم از ظرفیت های بسیار بالایی در حوزه گردشگری برخوردار است به همین دلیل شوکت میرضیایف رئیس جمهوری این کشور در دو سال اخیر برنامه های فشرده و تشویقی زیادی برای توسعه این صنعت در زمینه تسهیل و ارزان کردن سفر به ازبکستان، توسعه زیرساخت های اقامتی، توسعه زبان انگلیسی، آموزش نیروی انسانی ورزیده، تعریف و برند کردن جشنواره های بین المللی با محوریت شهرهایی چون سمرقند، بخارا، خوقند و تاشکند، توسعه ظرفیت فرودگاهی ، ایجاد پلیس گردشگری، توسعه گردشگری فرهنگی و زیارتی، تبلیغات گردشگری در سطح بین المللی ، تسهیل سرمایه گذاری در حوزه گردشگری و مهمتر از همه تسهیل ویزا و استفاده از تجربیات گردشگری کشورهای موفق و حتی جذب مشاوران خارجی را اجرا کرده است به نحوی که آمار گردشگران این کشور از یک میلیون نفر در سال ۲۰۱۶ به بیش از ۵ میلیون نفر در سال ۲۰۱۸ رسید. وی افزود: ت... [1,000 / 4,940 chars]

Source Reference Table

Source	Role
FaMTEB: Massive Text Embedding Benchmark in Persian Language	Persian embedding benchmark paper.
MTEB: Massive Text Embedding Benchmark	General embedding benchmark framework.
NeuCLIR project	Original NeuCLIR benchmark context.
mteb/NeuCLIR2023RetrievalHardNegatives	Public hard-negative source dataset card.
hakari-bench/NanoFaMTEB-v2	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoFaMTEB-v2
Backing dataset	NanoFaMTEB-v2
Task / split	neu_clir2023_fas
Hugging Face dataset	hakari-bench/NanoFaMTEB-v2
Language	fa
Category	natural_language
Queries	74
Documents	10,000
Positive qrels	3,669
Positives / query avg	49.58
Positives / query min	1
Positives / query median	38.00
Positives / query max	100
Multi-positive queries	73 (98.65%)
Query length avg chars	65.82
Document length avg chars	3,121.94

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4336	0.9054	0.5508	top-500
Dense	`harrier_oss_v1_270m`	0.5766	0.9730	0.5890	top-500
Reranking hybrid	`reranking_hybrid`	0.5595	0.9459	0.5993	top-100