NanoFaMTEB-v2 / miracl_fa
Overview
miracl_fa is a Persian MIRACL retrieval task in NanoFaMTEB-v2. The queries are short Persian information-seeking questions, and the documents are Persian Wikipedia-style passages drawn from a hard-negative MIRACL retrieval pool.
This task evaluates whether a model can retrieve fact-bearing Persian passages for natural web-search questions. Many queries contain explicit entities or named concepts, but the hard-negative construction makes the task more demanding than title matching: competing passages may share names, topics, or related relations while missing the requested answer.
Details
What the Original Data Measures
FaMTEB includes MIRACL as a Persian retrieval resource for evaluating multilingual and Persian embedding models. MIRACL itself is designed around ad hoc retrieval over Wikipedia-style passages, with queries that resemble real information needs.
The public source used here is mteb/MIRACLRetrievalHardNegatives. Its dataset card describes hard negatives pooled from BM25, e5-multilingual-large, and e5-mistral-instruct rankings. As a result, the Nano task is useful for comparing lexical matching, dense semantic retrieval, and hybrid candidate construction under a Persian retrieval setting with already challenging negatives.
Observed Data Profile
This Nano split contains 200 queries, 10,000 documents, and 427 positive qrels. Queries have 2.14 positives on average, with a minimum of 1, a median of 2.0, and a maximum of 8. There are 105 multi-positive queries, or 52.5% of the split. Queries average 39.99 characters, and documents average 413.55 characters.
Observed examples ask factual questions about international relations, Iranian ministers, geographic regions, provincial capitals, and plant taxonomy. Positive documents are usually compact encyclopedia passages centered on the requested entity, event, place, or definition.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.4929, hit@10 of 0.8000, and recall@100 of 0.9555 with a top-500 candidate pool. This is a high recall profile: exact Persian terms, entity names, and answer-bearing keywords allow BM25 to include relevant passages in the candidate set for most queries.
The weaker nDCG@10 shows that lexical matching is less reliable at ordering the most useful passages near the top. MIRACL hard negatives often share topic words with positives, so BM25 can retrieve passages about a related country, minister, city, or concept without ranking the precise answer passage first.
Dense Evaluation Profile
The dense harrier-oss-270m profile reaches nDCG@10 of 0.6318, hit@10 of 0.8750, and recall@100 of 0.8993. Dense retrieval is the strongest top-ranking signal in this task, improving nDCG substantially over BM25.
This suggests that semantic similarity helps distinguish the requested fact from nearby lexical matches. Dense retrieval can connect question intent to a passage even when the passage does not repeat every query term exactly. Its recall@100 is lower than BM25, so it is better as a top-ranking signal than as the only broad candidate generator.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate subset reaches nDCG@10 of 0.5931, hit@10 of 0.9050, and recall@100 of 0.9906. It uses 100 candidates per query, with one query containing a rank-101 safeguard positive.
This is the best coverage profile. Hybrid retrieval combines BM25's exact-term recall with dense retrieval's ability to prefer semantically matching passages. For MIRACL-Fa, that means the hybrid pool is especially useful for reranking experiments: nearly all relevant passages are present in the top-100 candidate set, even though dense-only ranking has the best nDCG@10 among the three initial profiles.
Metric Interpretation for Model Researchers
miracl_fa separates candidate coverage from first-stage ranking quality. BM25 provides broad lexical recall, dense retrieval provides stronger top-10 ordering, and reranking_hybrid provides the most complete top-100 candidate pool.
Researchers should treat nDCG@10 as a measure of how well a model prioritizes the exact answer passage among hard negatives. Recall@100 is also important because more than half of the queries have multiple positives. A reranker that starts from the hybrid pool can test passage-ordering ability without being heavily limited by first-stage candidate misses.
Query and Relevance Type Tendencies
Queries are short Persian questions, often asking "what", "where", or "who" style factual needs. They frequently include an entity name or a distinctive concept. Relevant passages are encyclopedia-like and usually contain the answer in a concise descriptive paragraph.
The relevance relation is direct: a passage is positive when it supplies the requested fact or definition. The difficulty comes from related passages that share entities or topical vocabulary but do not answer the exact question.
Representative Failure Modes
BM25 may over-rank a passage that repeats query terms but concerns the wrong relation. Dense retrieval may prefer a semantically related passage while missing a rare named entity or exact title. Hybrid retrieval reduces both problems, but reranking still has to choose between very similar Persian encyclopedia passages.
Another common risk is partial relevance for multi-positive queries. A model may retrieve one correct passage while failing to cover alternative valid passages or related supporting descriptions.
Training Data That May Help
Useful training data includes Persian Wikipedia retrieval, MIRACL-style query-passage pairs, multilingual hard-negative retrieval data, Persian QA search logs, and entity-centric contrastive pairs. Hard negatives should mention the same entity, country, location, office, or scientific term while omitting the requested answer.
Training should exclude MIRACL-Fa rows sampled into this Nano split.
Model Improvement Notes
A strong model for this task should preserve exact Persian entity matching while also representing the relation asked by the query. Improvements may come from Persian-aware tokenization, multilingual retrieval pretraining, and hard-negative mining that forces the model to distinguish "same topic" from "answers the question".
For reranking, the most useful behavior is precise discrimination among near-duplicate or topically adjacent passages. The hybrid pool already has very high recall, so the remaining challenge is top-10 ordering.
Example Data
| Query | Positive document |
| اسرائیل با چه کشورهایی روابط دوستانه دارد؟ [42 chars] | وزارت امور خارجه اسرائیل پیش از پیروزی انقلاب ۱۳۵۷ و به قدرت رسیدن نظام جمهوری اسلامی، ایران با کشور اسرائیل روابط دوستانه و حسنهای را داشت و ایران اولین کشور اسلامی در منطقه خاورمیانه بود که کشور اسرائیل را به رسمیت شناخت. در آن زمان دو کشور ایران و اسرائیل سفارتخانههایی را در پایتخت دو کشور جهت تحکیم روابط برقرار کردند و روابط دوستانه ایران و اسرائیل تا به قدرت رسیدن روح الله خمینی در ایران ادامه داشت. [410 chars] |
| وزیر کنونی فرهنگ و ارشاد اسلامی ایران چه کسی است؟ [49 chars] | محمدمهدی اسماعیلی محمدمهدی اسماعیلی (متولد ۱۳۵۴ در کبودرآهنگ) سیاستمدار ایرانی و وزیر فرهنگ و ارشاد اسلامی است. او دانشآموخته دکتری علوم سیاسی از پژوهشگاه علوم انسانی و مطالعات فرهنگی و عضو هیأت علمی دانشگاه تهران است. تحصیلات حوزوی را نیز تا پایان دوره سطح ادامه داده است. وی همچنین در ۲۰ مرداد ۱۴۰۰ به عنوان وزیر فرهنگ و ارشاد اسلامی پیشنهادی دولت سیزدهم توسط سید ابراهیم رئیسی به مجلس معرفی شد. [399 chars] |
| مثلث برمودا در کجا قرار دارد؟ [29 chars] | مثلث برمودا مثلث برمودا ، همچنین به عنوان مثلث شیطان شناخته میشود. منطقهای است در ناحیه غربی اقیانوس اطلس شمالی که گفته میشود تعدادی هواپیما و کشتی تحت شرایط مرموز در آن ناپدید شدهاند. [189 chars] |
Source Reference Table
| Source | Role |
| FaMTEB: Massive Text Embedding Benchmark in Persian Language | Persian embedding benchmark paper. |
| MTEB: Massive Text Embedding Benchmark | General embedding benchmark framework. |
| MIRACL project | Original MIRACL benchmark context. |
| mteb/MIRACLRetrievalHardNegatives | Public hard-negative source dataset card. |
| hakari-bench/NanoFaMTEB-v2 | Nano benchmark dataset containing this split. |
Dataset Information
| Field | Value |
| Nano set | NanoFaMTEB-v2 |
| Backing dataset | NanoFaMTEB-v2 |
| Task / split | miracl_fa |
| Hugging Face dataset | hakari-bench/NanoFaMTEB-v2 |
| Language | fa |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 427 |
| Positives / query avg | 2.13 |
| Positives / query min | 1 |
| Positives / query median | 2.00 |
| Positives / query max | 8 |
| Multi-positive queries | 105 (52.50%) |
| Query length avg chars | 39.99 |
| Document length avg chars | 413.55 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.4929 | 0.8000 | 0.9555 | top-500 |
| Dense | harrier_oss_v1_270m | 0.6318 | 0.8750 | 0.8993 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.5931 | 0.9050 | 0.9906 | top-100 |