NanoFaMTEB-v2 / miracl_fa

Overview

miracl_fa is a Persian MIRACL retrieval task in NanoFaMTEB-v2. The queries are short Persian information-seeking questions, and the documents are Persian Wikipedia-style passages drawn from a hard-negative MIRACL retrieval pool.

This task evaluates whether a model can retrieve fact-bearing Persian passages for natural web-search questions. Many queries contain explicit entities or named concepts, but the hard-negative construction makes the task more demanding than title matching: competing passages may share names, topics, or related relations while missing the requested answer.

Details

What the Original Data Measures

FaMTEB includes MIRACL as a Persian retrieval resource for evaluating multilingual and Persian embedding models. MIRACL itself is designed around ad hoc retrieval over Wikipedia-style passages, with queries that resemble real information needs.

The public source used here is mteb/MIRACLRetrievalHardNegatives. Its dataset card describes hard negatives pooled from BM25, e5-multilingual-large, and e5-mistral-instruct rankings. As a result, the Nano task is useful for comparing lexical matching, dense semantic retrieval, and hybrid candidate construction under a Persian retrieval setting with already challenging negatives.

Observed Data Profile

This Nano split contains 200 queries, 10,000 documents, and 427 positive qrels. Queries have 2.14 positives on average, with a minimum of 1, a median of 2.0, and a maximum of 8. There are 105 multi-positive queries, or 52.5% of the split. Queries average 39.99 characters, and documents average 413.55 characters.

Observed examples ask factual questions about international relations, Iranian ministers, geographic regions, provincial capitals, and plant taxonomy. Positive documents are usually compact encyclopedia passages centered on the requested entity, event, place, or definition.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.4929, hit@10 of 0.8000, and recall@100 of 0.9555 with a top-500 candidate pool. This is a high recall profile: exact Persian terms, entity names, and answer-bearing keywords allow BM25 to include relevant passages in the candidate set for most queries.

The weaker nDCG@10 shows that lexical matching is less reliable at ordering the most useful passages near the top. MIRACL hard negatives often share topic words with positives, so BM25 can retrieve passages about a related country, minister, city, or concept without ranking the precise answer passage first.

Dense Evaluation Profile

The dense harrier-oss-270m profile reaches nDCG@10 of 0.6318, hit@10 of 0.8750, and recall@100 of 0.8993. Dense retrieval is the strongest top-ranking signal in this task, improving nDCG substantially over BM25.

This suggests that semantic similarity helps distinguish the requested fact from nearby lexical matches. Dense retrieval can connect question intent to a passage even when the passage does not repeat every query term exactly. Its recall@100 is lower than BM25, so it is better as a top-ranking signal than as the only broad candidate generator.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.5931, hit@10 of 0.9050, and recall@100 of 0.9906. It uses 100 candidates per query, with one query containing a rank-101 safeguard positive.

This is the best coverage profile. Hybrid retrieval combines BM25's exact-term recall with dense retrieval's ability to prefer semantically matching passages. For MIRACL-Fa, that means the hybrid pool is especially useful for reranking experiments: nearly all relevant passages are present in the top-100 candidate set, even though dense-only ranking has the best nDCG@10 among the three initial profiles.

Metric Interpretation for Model Researchers

miracl_fa separates candidate coverage from first-stage ranking quality. BM25 provides broad lexical recall, dense retrieval provides stronger top-10 ordering, and reranking_hybrid provides the most complete top-100 candidate pool.

Researchers should treat nDCG@10 as a measure of how well a model prioritizes the exact answer passage among hard negatives. Recall@100 is also important because more than half of the queries have multiple positives. A reranker that starts from the hybrid pool can test passage-ordering ability without being heavily limited by first-stage candidate misses.

Query and Relevance Type Tendencies

Queries are short Persian questions, often asking "what", "where", or "who" style factual needs. They frequently include an entity name or a distinctive concept. Relevant passages are encyclopedia-like and usually contain the answer in a concise descriptive paragraph.

The relevance relation is direct: a passage is positive when it supplies the requested fact or definition. The difficulty comes from related passages that share entities or topical vocabulary but do not answer the exact question.

Representative Failure Modes

BM25 may over-rank a passage that repeats query terms but concerns the wrong relation. Dense retrieval may prefer a semantically related passage while missing a rare named entity or exact title. Hybrid retrieval reduces both problems, but reranking still has to choose between very similar Persian encyclopedia passages.

Another common risk is partial relevance for multi-positive queries. A model may retrieve one correct passage while failing to cover alternative valid passages or related supporting descriptions.

Training Data That May Help

Useful training data includes Persian Wikipedia retrieval, MIRACL-style query-passage pairs, multilingual hard-negative retrieval data, Persian QA search logs, and entity-centric contrastive pairs. Hard negatives should mention the same entity, country, location, office, or scientific term while omitting the requested answer.

Training should exclude MIRACL-Fa rows sampled into this Nano split.

Model Improvement Notes

A strong model for this task should preserve exact Persian entity matching while also representing the relation asked by the query. Improvements may come from Persian-aware tokenization, multilingual retrieval pretraining, and hard-negative mining that forces the model to distinguish "same topic" from "answers the question".

For reranking, the most useful behavior is precise discrimination among near-duplicate or topically adjacent passages. The hybrid pool already has very high recall, so the remaining challenge is top-10 ordering.

Example Data

Query	Positive document
اسرائیل با چه کشورهایی روابط دوستانه دارد؟ [42 chars]	وزارت امور خارجه اسرائیل پیش از پیروزی انقلاب ۱۳۵۷ و به قدرت رسیدن نظام جمهوری اسلامی، ایران با کشور اسرائیل روابط دوستانه و حسنه‌ای را داشت و ایران اولین کشور اسلامی در منطقه خاورمیانه بود که کشور اسرائیل را به رسمیت شناخت. در آن زمان دو کشور ایران و اسرائیل سفارتخانه‌هایی را در پایتخت دو کشور جهت تحکیم روابط برقرار کردند و روابط دوستانه ایران و اسرائیل تا به قدرت رسیدن روح الله خمینی در ایران ادامه داشت. [410 chars]
وزیر کنونی فرهنگ و ارشاد اسلامی ایران چه کسی است؟ [49 chars]	محمدمهدی اسماعیلی محمدمهدی اسماعیلی (متولد ۱۳۵۴ در کبودرآهنگ) سیاستمدار ایرانی و وزیر فرهنگ و ارشاد اسلامی است. او دانش‌آموخته دکتری علوم سیاسی از پژوهشگاه علوم انسانی و مطالعات فرهنگی و عضو هیأت علمی دانشگاه تهران است. تحصیلات حوزوی را نیز تا پایان دوره سطح ادامه داده است. وی همچنین در ۲۰ مرداد ۱۴۰۰ به عنوان وزیر فرهنگ و ارشاد اسلامی پیشنهادی دولت سیزدهم توسط سید ابراهیم رئیسی به مجلس معرفی شد. [399 chars]
مثلث برمودا در کجا قرار دارد؟ [29 chars]	مثلث برمودا مثلث برمودا ، همچنین به عنوان مثلث شیطان شناخته می‌شود. منطقه‌ای است در ناحیه غربی اقیانوس اطلس شمالی که گفته می‌شود تعدادی هواپیما و کشتی تحت شرایط مرموز در آن ناپدید شده‌اند. [189 chars]

Source Reference Table

Source	Role
FaMTEB: Massive Text Embedding Benchmark in Persian Language	Persian embedding benchmark paper.
MTEB: Massive Text Embedding Benchmark	General embedding benchmark framework.
MIRACL project	Original MIRACL benchmark context.
mteb/MIRACLRetrievalHardNegatives	Public hard-negative source dataset card.
hakari-bench/NanoFaMTEB-v2	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoFaMTEB-v2
Backing dataset	NanoFaMTEB-v2
Task / split	miracl_fa
Hugging Face dataset	hakari-bench/NanoFaMTEB-v2
Language	fa
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	427
Positives / query avg	2.13
Positives / query min	1
Positives / query median	2.00
Positives / query max	8
Multi-positive queries	105 (52.50%)
Query length avg chars	39.99
Document length avg chars	413.55

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4929	0.8000	0.9555	top-500
Dense	`harrier_oss_v1_270m`	0.6318	0.8750	0.8993	top-500
Reranking hybrid	`reranking_hybrid`	0.5931	0.9050	0.9906	top-100