NanoFaMTEB-v2 / msmarco_fa
Overview
msmarco_fa is a Persian MS MARCO-style passage retrieval task in NanoFaMTEB-v2. The queries are short web-search questions, and the documents are answer-like Persian passages from a hard-negative retrieval corpus.
This task is unusual inside the Nano set because it has a small number of queries but many positives per query. It is therefore useful for studying broad answer coverage: a model must find many acceptable passages for a short query, not just one exact answer page.
Details
What the Original Data Measures
FaMTEB includes translated and Persian-adapted retrieval datasets to evaluate Persian text embedding models. msmarco_fa uses MCINext/MSMARCO_FA_test_top_250_only_w_correct-v2, a Persian MS MARCO-style test split prepared for passage retrieval evaluation.
The original MS MARCO retrieval setting measures whether a system can retrieve answer passages for real web-search queries. In this Persian version, the same style appears as short translated or Persian queries paired with many answer-bearing passages. The MTEB framework supplies the common retrieval evaluation protocol.
Observed Data Profile
This Nano split contains 43 queries, 8,766 documents, and 2,826 positive qrels. Every query has multiple positives. Queries have 65.72 positives on average, with a minimum of 4, a median of 75.0, and a maximum of 100. Queries average 31.49 characters, and documents average 326.20 characters.
Observed examples include Persian web-search questions about suicide causes among military personnel, physical descriptions of pine trees, interior concrete flooring cost, declaratory judgments, and hydrogen liquefaction temperature. Documents are short answer passages, often translated from web-style informational sources.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.4737, hit@10 of 0.9070, and recall@100 of 0.4296 with a top-500 candidate pool. The high hit rate shows that lexical matching often finds at least one relevant passage. Short queries with concrete terms such as "hydrogen", "temperature", "concrete flooring", or "declaratory judgment" give BM25 useful anchors.
The lower nDCG and recall are more informative than hit@10 here. Because each query may have dozens of positives, retrieving one relevant passage is not enough to cover the relevance set. BM25 tends to favor passages that repeat the most obvious query terms, while missing paraphrased or semantically equivalent answers.
Dense Evaluation Profile
The dense harrier-oss-270m profile reaches nDCG@10 of 0.6139, hit@10 of 0.9302, and recall@100 of 0.4922. Dense retrieval is strongest on nDCG@10 and recall@100 among the initial top-500 profiles.
This pattern matches the MS MARCO-style task design. Many positive passages answer the same intent using different wording, so embedding similarity can connect short queries to answer passages that do not share all surface words. Dense retrieval is especially helpful for translated or web-like phrases where exact token overlap is not the only relevance signal.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate subset reaches nDCG@10 of 0.6119, hit@10 of 0.9767, and recall@100 of 0.4812. It uses exactly 100 candidates per query and has no safeguard-positive rows.
Hybrid retrieval improves the chance that at least one positive appears near the top, as shown by the strongest hit@10. Its nDCG@10 is almost identical to dense retrieval, while its recall@100 is slightly lower than the dense top-500 profile because the hybrid subset is constrained to 100 candidates. For reranking work, this makes the task a compact but high-signal candidate pool.
Metric Interpretation for Model Researchers
msmarco_fa should not be interpreted only through hit@10. Hit@10 is high for all three profiles because every query has many positives. nDCG@10 and recall@100 better show whether a model can rank several answer-bearing passages highly and cover the broader relevance set.
Dense retrieval is the clearest first-stage winner for ranking quality, while reranking_hybrid is best for ensuring a positive in the top 10. The gap between many positives and moderate recall@100 indicates that this task remains useful for measuring semantic coverage, not just exact answer retrieval.
Query and Relevance Type Tendencies
Queries are short Persian web-search phrases or questions. They often name a concept, cost, definition, condition, or physical property. Relevant documents are answer passages, and many of them can satisfy the same information need.
The relevance relation is broad compared with encyclopedia QA tasks. A passage may be relevant because it gives a useful explanation, definition, cost factor, or practical answer, even if it is not a single canonical source.
Representative Failure Modes
BM25 may retrieve passages with strong term overlap but the wrong intent. For example, a document may mention the same legal or medical phrase while answering a different question. Dense retrieval may cluster semantically nearby answers but still miss rare terms, units, or technical constraints.
Because there are many positives, another failure mode is shallow coverage. A model may retrieve a few obvious positives and still miss many alternate answer passages, which lowers recall even when hit@10 looks strong.
Training Data That May Help
Useful training data includes Persian web-search logs, mMARCO-style translated query-passage pairs, Persian passage ranking data, and hard negatives with high lexical overlap. Query-passage pairs should include multiple valid answers per query so the model learns broad relevance coverage.
Training should exclude the 43 evaluation queries and their positive passages from this Nano split.
Model Improvement Notes
Improving this task requires representing short Persian queries as information needs rather than as bags of keywords. Models should learn to match answer intent, units, definitions, and practical explanations across paraphrases.
For reranking, the best gains will come from distinguishing truly answer-bearing passages from topically related passages. Since many positives are acceptable, rerankers should avoid over-specializing to a single canonical wording.
Example Data
| Query | Positive document |
| علل خودکشی در میان نظامیان [26 chars] | علائم اختلال استرس پس از سانحه میتوانند خیلی زود پس از تجربه یک رویداد آسیبزا ظاهر شوند. مشکلات دیگری نیز معمولاً همراه با اختلال استرس پس از سانحه رخ میدهند، از جمله افسردگی، سایر اختلالات اضطرابی و سوء مصرف الکل و مواد مخدر. در واقع، بیش از نیمی از مردان مبتلا به اختلال استرس پس از سانحه مشکلات الکلی دارند و تقریباً نیمی از زنان مبتلا به این اختلال، دچار افسردگی میشوند. اختلال استرس پس از سانحه همچنین میتواند توانایی فرد را در برقراری روابط، عملکرد در محل کار و مدرسه و انجام فعالیتهای تفریحی کاهش دهد. علاوه بر این، افراد مبتلا به اختلال استرس پس از سانحه ممکن است... [580 chars] |
| توصیف فیزیکی درخت کاج چیست؟ [27 chars] | توضیحات محصول. صنوبر چشم آبی نوزاد، یک گونه مخروطی و پرشاخه از صنوبر کلرادو با رشد یکنواخت و سوزنهای آبی رنگ فشرده است. به طور متوسط، سالانه حدود 20 سانتیمتر رشد عمودی دارد، در حالی که برخی از صنوبرهای کلرادو میتوانند تا 30 تا 45 سانتیمتر در سال رشد کنند. از آنجایی که این گیاه پیوندی است، سرعت رشد میتواند بسته به پایه (ریشه زیرین) که روی آن پیوند شده، متفاوت باشد. [371 chars] |
| هزینه کفپوش بتنی داخلی [22 chars] | برخی از عواملی که ممکن است به این هزینه اضافه کنند عبارتند از: آمادهسازی محل و زیرسازی، دسترسی به محل، کفهای کوچک زیر ۴۶ متر مربع، و بتن ضخیمتر. هزینه کف بتنی رنگی یکپارچه: ۳.۷۵ دلار به ازای هر متر مربع، که شامل بسته پایه کف بتنی و افزودن رنگ به بتن میشود. [260 chars] |
Source Reference Table
| Source | Role |
| FaMTEB: Massive Text Embedding Benchmark in Persian Language | Persian embedding benchmark paper. |
| MTEB: Massive Text Embedding Benchmark | General embedding benchmark framework. |
| MCINext/MSMARCO_FA_test_top_250_only_w_correct-v2 | Public source dataset card. |
| hakari-bench/NanoFaMTEB-v2 | Nano benchmark dataset containing this split. |
Dataset Information
| Field | Value |
| Nano set | NanoFaMTEB-v2 |
| Backing dataset | NanoFaMTEB-v2 |
| Task / split | msmarco_fa |
| Hugging Face dataset | hakari-bench/NanoFaMTEB-v2 |
| Language | fa |
| Category | natural_language |
| Queries | 43 |
| Documents | 8,766 |
| Positive qrels | 2,826 |
| Positives / query avg | 65.72 |
| Positives / query min | 4 |
| Positives / query median | 75.00 |
| Positives / query max | 100 |
| Multi-positive queries | 43 (100.00%) |
| Query length avg chars | 31.49 |
| Document length avg chars | 326.20 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.4737 | 0.9070 | 0.4296 | top-500 |
| Dense | harrier_oss_v1_270m | 0.6139 | 0.9302 | 0.4922 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.6119 | 0.9767 | 0.4812 | top-100 |