NanoFaMTEB-v2 / fever_fa

Overview

fever_fa is a Persian fact-verification evidence retrieval task in NanoFaMTEB-v2. The query is a short factual claim, and the target documents are Persian evidence passages that can support or refute the claim. The task is adapted from FEVER-style retrieval through FaMTEB.

This task evaluates whether a Persian retriever can connect concise claims to encyclopedia-style evidence. Entity names and factual phrases are often strong lexical anchors, but the model still needs to rank the passage that contains the relevant evidence rather than merely a passage about the same entity.

Details

What the Original Data Measures

FaMTEB is a Persian text embedding benchmark that includes retrieval tasks adapted from BEIR-style and MTEB-style sources. fever_fa uses MCINext/FEVER_FA_test_top_250_only_w_correct-v2, a Persian FEVER-style hard-negative dataset.

The original FEVER setting measures evidence retrieval for factual claims. A retrieval system must find passages that provide factual evidence for verification. In this Persian variant, the same evidence-retrieval problem is evaluated over Persian claims and passages.

Observed Data Profile

This Nano split contains 200 queries, 10,000 documents, and 229 positive qrels. Most queries have one positive, while 25 queries are multi-positive. Positives per query average 1.15, with a minimum of 1, median of 1.0, and maximum of 4. Queries average 47.09 characters, and documents average 523.29 characters.

Observed examples include claims about films, geographic valleys, companies, actors, and city statistics. Documents are mostly Persian encyclopedia-style passages with entity titles and factual descriptions.

BM25 Evaluation Profile

BM25 is very strong, with nDCG@10 of 0.8025, hit@10 of 0.9000, and recall@100 of 0.9432 using a top-500 candidate pool. This reflects the strong entity signal in FEVER-style retrieval. Claims often mention the entity whose page or passage contains the evidence.

BM25 can still fail when multiple passages share the same entity name or when the evidence requires matching a specific factual property rather than only the entity title. Lexical retrieval narrows the search space well, but final ranking still needs fact-level precision.

Dense Evaluation Profile

The dense harrier-oss-270m profile is strongest by top-rank metrics, with nDCG@10 of 0.8972, hit@10 of 0.9450, and recall@100 of 0.9170. Dense retrieval improves top ranking by matching claim meaning to evidence content.

The lower recall@100 relative to BM25 and hybrid indicates that dense retrieval can rank known evidence very well but may miss some positive evidence passages deeper in the candidate set. This is plausible when exact entity names are decisive.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.8396, hit@10 of 0.9250, and recall@100 of 0.9825. It uses top-100 candidates with optional rank-101 safeguards; one row contains 101 candidates and one safeguard-positive row is recorded. Hybrid retrieval has the strongest top-100 coverage.

This suggests that BM25 and dense signals are complementary: BM25 preserves entity-name coverage, while dense retrieval improves fact-level ranking. A reranker using hybrid candidates can start from a very complete evidence pool.

Metric Interpretation for Model Researchers

fever_fa is an entity-heavy Persian evidence retrieval task. BM25 is already strong, dense retrieval is best at top ranking, and hybrid retrieval is best at candidate recall. The main research target is fact-sensitive ranking among entity-related passages.

Because some queries have multiple positives, recall@100 matters for evidence coverage. Top-rank metrics measure whether the retriever places the most useful evidence early.

Query and Relevance Type Tendencies

Queries are short Persian factual claims. Documents are encyclopedia-like passages that often begin with an entity title and factual description. Relevance depends on whether the passage contains evidence for the claim.

Representative Failure Modes

BM25 may retrieve the right entity page but the wrong evidence passage. Dense retrieval may retrieve semantically related passages while missing exact entity evidence. Hybrid retrieval can recover many positives but still needs fact-sensitive reranking.

Training Data That May Help

Useful training data includes Persian fact-checking retrieval, translated FEVER claim-evidence pairs, and entity-centric Wikipedia evidence retrieval. Hard negatives should share named entities but not support the claim.

Training should exclude evaluation queries, positives, and translated duplicates from this split.

Model Improvement Notes

Improving this task requires combining entity recall with factual matching. Models should preserve names, dates, titles, and locations, while also matching predicates and claim polarity.

For reranking, evidence sufficiency and claim-passage entailment signals may help.

Example Data

Query	Positive document
یک پرنده بر فراز آشیانه نسترن تنها یک جایزه اسکار را برنده شد. [62 chars]	بر فراز آشیانه فاخته (فیلم) پرواز بر فراز آشیانه فاخته فیلمی آمریکایی محصول سال ۱۹۷۵ به کارگردانی میلوش فورمن، بر اساس رمانی به همین نام از کن کسی است. جک نیکلسون در این فیلم بازی کرده و لوئیز فletcher، ویلیام ردفیلد، ویل سامپسون و برد دوریف در نقش‌های مکمل به ایفای نقش پرداخته‌اند. این فیلم به عنوان یکی از بزرگترین فیلم‌های تاریخ سینما شناخته می‌شود و در فهرست ۱۰۰ سال... ۱۰۰ فیلم موسسه فیلم آمریکا رتبه ۳۳ را به خود اختصاص داده است. این فیلم دومین فیلمی بود که پنج جایزه اسکار اصلی (بهترین فیلم، بهترین بازیگر مرد، بهترین بازیگر زن، بهترین کارگردان و بهترین فیلمنامه) را پس از فیلم «یک شب اتفاق افتاد» در سال ۱۹۳۴ به دست آورد، دستاویزی که تا سال ۱۹۹۱ و با فیلم «سکوت بره‌ها» تکرار نشد. همچنین جوایز متعددی از گلدن گلوب و بفتا را نیز دریافت کرده است. در سال ۱۹۹۳، این فیلم از سوی کتابخانه کنگره ایالات متحده به عنوان «از نظر فرهنگی، تاریخی یا زیبایی شناختی حائز اهمیت» شناخته شد و برای حفظ در فهرست ملی فیلم انتخاب گردید. [925 chars]
دره رود سالت در کنار رودخانه می‌سی‌سی‌پی قرار دارد. [51 chars]	دره رودخانه سالت دره رودخانه نمک در مرکز آریزونا یک دره وسیع در امتداد رودخانه نمک است که منطقه کلان‌شهری فینیکس را در خود جای داده است. اگرچه این اصطلاح جغرافیایی هنوز هم برای شناسایی این منطقه استفاده می‌شود، نام «دره آفتاب» از اوایل دهه ۱۹۳۰ به منظور تبلیغات و رونق اقتصادی، جایگزین استفاده از آن شد. گرد و غبار سطحی خاک حاصل از دره رودخانه نمک، که به عنوان «گرد و غبار آریزونا» شناخته می‌شود، به عنوان یک گرد و غبار رایج برای آزمایش کارایی فیلترهای هوا مورد استفاده قرار می‌گرفت. این گرد و غبار حاوی ذرات ساینده کوچکی بود. [527 chars]
اسکای یوکی یک شرکت مخابراتی بریتانیایی است. [43 chars]	بریتانیا پادشاهی متحد بریتانیای کبیر و ایرلند شمالی، که معمولاً به عنوان پادشاهی متحد (بریتانیا) شناخته می‌شود، کشوری مستقل در اروپای غربی است. این کشور در سواحل شمال غربی سرزمین اصلی اروپا واقع شده و شامل جزیرهٔ بریتانیای کبیر، بخش شمال شرقی جزیرهٔ ایرلند و بسیاری جزایر کوچک‌تر است. ایرلند شمالی تنها بخشی از پادشاهی متحد است که مرز زمینی با کشور دیگری، یعنی جمهوری ایرلند، دارد. اگرچه ایرلند شمالی تنها بخشی از بریتانیا است که مرز زمینی با یک کشور مستقل دیگر دارد، دو منطقهٔ فرادریایی آن نیز مرز زمینی با کشورهای مستقل دیگر دارند. جبل‌الطارق مرز مشترکی با اسپانیا دارد، در حالی که مناطق پایهٔ حاکمیتی آکروتیری و دکلیا مرزهایی با جمهوری قبرس، جمهوری ترک قبرس شمالی و منطقهٔ حائل سازمان ملل متحد که دو نهاد قبرسی را از هم جدا می‌کند، دارند. علاوه بر این مرز زمینی، پادشاهی متحد از اقیانوس اطلس احاطه شده است، با دریای شمال در شرق، کانال مانش در جنوب و دریای سلتیک در جنوب غربی، که این امر آن را به دوازدهمین کشور با طولانی‌ترین خط ساحلی در جهان تبدیل کرده است. دریای ایرلند بین بریتانیای کبیر و ایرل... [1,000 / 4,136 chars]

Source Reference Table

Source	Role
FaMTEB: Massive Text Embedding Benchmark in Persian Language	Persian embedding benchmark paper.
MTEB: Massive Text Embedding Benchmark	General benchmark framework.
MCINext/FEVER_FA_test_top_250_only_w_correct-v2	Public source dataset card.
hakari-bench/NanoFaMTEB-v2	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoFaMTEB-v2
Backing dataset	NanoFaMTEB-v2
Task / split	fever_fa
Hugging Face dataset	hakari-bench/NanoFaMTEB-v2
Language	fa
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	229
Positives / query avg	1.15
Positives / query min	1
Positives / query median	1.00
Positives / query max	4
Multi-positive queries	25 (12.50%)
Query length avg chars	47.09
Document length avg chars	523.29

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.8025	0.9000	0.9432	top-500
Dense	`harrier_oss_v1_270m`	0.8972	0.9450	0.9170	top-500
Reranking hybrid	`reranking_hybrid`	0.8396	0.9250	0.9825	top-100