NanoFaMTEB-v2 / nq_fa

Overview

nq_fa is a Persian Natural Questions-style passage retrieval task in NanoFaMTEB-v2. The queries are short factual questions, and the documents are Persian encyclopedia-style passages.

This task evaluates open-domain factual retrieval in Persian. Compared with many-positive web passage tasks, nq_fa usually has only one positive passage per query, so the model must identify the specific passage that answers the question rather than relying on broad topical coverage.

Details

What the Original Data Measures

FaMTEB includes translated and Persian retrieval datasets to evaluate text embeddings beyond English. nq_fa uses MCINext/NQ_FA_test_top_250_only_w_correct-v2, a Persian Natural Questions-style hard-negative dataset under the MTEB retrieval setup.

Natural Questions-style retrieval measures whether a system can find answer-bearing passages for factual questions derived from real search behavior. In this Persian split, the target documents are compact encyclopedia passages, and hard negatives often share entities or topical vocabulary with the correct answer.

Observed Data Profile

This Nano split contains 200 queries, 10,000 documents, and 251 positive qrels. Queries have 1.25 positives on average, with a minimum of 1, a median of 1.0, and a maximum of 3. There are 48 multi-positive queries, or 24.0% of the split. Queries average 46.72 characters, and documents average 556.82 characters.

Observed examples ask about television show judges, release timing, amusement-park ride closure dates, actors in a sitcom, and the number of national parks in India. Positive documents are concise passages about the relevant program, attraction, actor, list, or entity.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.4470, hit@10 of 0.7000, and recall@100 of 0.9363 with a top-500 candidate pool. This profile shows strong lexical candidate coverage but weaker top-10 ordering.

Named entities, titles, dates, and distinctive nouns help BM25 include positives in the broader candidate set. The challenge is that many hard negatives repeat the same entity name or topic. BM25 can retrieve a passage about the right show, location, or country while missing the exact answer-bearing passage at the top.

Dense Evaluation Profile

The dense harrier-oss-270m profile reaches nDCG@10 of 0.5817, hit@10 of 0.8350, and recall@100 of 0.9163. Dense retrieval is the strongest first-stage top-ranking profile for this task.

The improvement over BM25 suggests that embedding similarity helps connect the factual intent of the question to the answer passage. Dense retrieval can handle paraphrase, translated phrasing, and answer descriptions that do not repeat every query term. Its recall@100 is slightly below BM25, so lexical matching still contributes useful breadth.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.5274, hit@10 of 0.7900, and recall@100 of 0.9841. It uses 100 candidates per query, with four rank-101 safeguard positives.

Hybrid retrieval is best for candidate coverage. It does not outperform dense retrieval on nDCG@10, but it gives a reranker a more complete pool of answer-bearing passages. For a task with mostly one positive per query, this high recall is important because a missing candidate cannot be recovered by downstream reranking.

Metric Interpretation for Model Researchers

nq_fa separates exact-answer ranking from broad candidate recall. Dense retrieval is the best direct ranker, BM25 is a strong lexical recall source, and reranking_hybrid provides the most complete top-100 candidate set.

nDCG@10 and hit@10 are especially important because most queries have only one relevant passage. Recall@100 is still useful for diagnosing whether a candidate generator supplies the answer passage to a reranker. The four safeguard rows indicate that a small number of positives needed the optional rank-101 inclusion in the hybrid pool.

Query and Relevance Type Tendencies

Queries are Persian factual questions, often asking "who", "when", "how many", or "which" questions. They frequently mention a title, person, place, or organization. Relevant documents are encyclopedia-like passages that contain the answer and enough context to identify the entity.

The relevance relation is narrow. A passage about the same subject is not enough if it does not answer the requested fact.

Representative Failure Modes

BM25 may over-rank passages that share the entity name but discuss a different property. Dense retrieval may retrieve a semantically close passage that answers a related question. Hybrid retrieval reduces candidate misses but still needs a reranker to distinguish the exact answer span from nearby entity descriptions.

Date and list questions can be difficult when many passages contain numbers. Actor, episode, or release questions can also confuse models if several names from the same franchise or show appear in the corpus.

Training Data That May Help

Useful training data includes Persian open-domain QA retrieval, translated Natural Questions examples, Persian Wikipedia passage retrieval, and hard negatives that share the same entity but answer a different fact. Training should include narrowly answerable questions with single or few positives.

Training should exclude source test rows included in this Nano split.

Model Improvement Notes

Improving this task requires precise question-passage alignment. Models should preserve entity names and dates while also representing the relation asked by the question. Persian-aware tokenization and training on translated QA retrieval can help with both exact names and paraphrased answer contexts.

For reranking, the most valuable behavior is rejecting topically related passages that do not contain the requested fact. A reranker should focus on answer sufficiency, not only subject similarity.

Example Data

Query	Positive document
داوران برنامه رقص روی یخ در سال ۲۰۱۴ چه کسانی بودند؟ [52 chars]	رقص روی یخ فیلیپ Schofield و Christine Bleakley برای اجرای مشترک برنامه بازگشتند. Dean، Torvill و Karen Barber برای مربیگری افراد مشهور به برنامه برگشتند. Robin Cousins، Jason Gardiner، Barber و Ashley Roberts برای نهمین، هشتمین، هفتمین و دومین فصل حضور خود در هیئت داوران برنامه بازگشتند. Cousins به دلیل گزارشگری المپیک زمستانی 2014 در هفته‌های 6 و 7 غایب بود، بنابراین Nicky Slater، داور سابق، جای او را گرفت و Barber به طور موقت به عنوان رئیس هیئت داوران منصوب شد. [469 chars]
فصل پنجم روبی کی منتشر می‌شود؟ [30 chars]	فهرست قسمت‌های RWBY RWBY یک مجموعهٔ وب انیمه‌ای آمریکایی در حال تولید است که توسط شرکت Rooster Teeth Productions ساخته شده است. این مجموعه ابتدا در ۱۸ ژوئیهٔ ۲۰۱۳ در وب‌سایت Rooster Teeth منتشر شد و بعداً قسمت‌های آن در یوتیوب و وب‌سایت‌های استریمینگ مانند کرانچی‌رول بارگذاری شد. تا کنون چهار فصل که به آن‌ها «بخش» گفته می‌شود منتشر شده است و فصل پنجم از ۱۴ اکتبر ۲۰۱۷ در حال پخش است.[۱] تا اکتبر ۲۰۱۷، ۵۴ قسمت که به آن‌ها «فصل» گفته می‌شود منتشر شده است. [457 chars]
چه زمانی ترن هوایی آب‌نما در آلن تاورز بسته شد؟ [47 chars]	فلوم (التون تاورز) فلوم یک ترن هوایی آبی (Log Flume) در پارک التون تاورز در استافوردشایر بود. این ترن هوایی در سال ۱۹۸۱ افتتاح شد و در سال ۲۰۰۴ همزمان با اسپانسرینگ آن توسط شرکت ایمپریال لدر، دوباره طراحی شد. این ترن هوایی با تم حمام و سه سقوط طراحی شده بود و در زمان افتتاح، طولانی‌ترین ترن هوایی آبی در جهان به شمار می‌رفت. این جاذبه در سال ۲۰۱۵ بسته شد و یک سال بعد برای بازسازی منطقه و ساخت ترن هوایی SW8 از بین برده شد. [425 chars]

Source Reference Table

Source	Role
FaMTEB: Massive Text Embedding Benchmark in Persian Language	Persian embedding benchmark paper.
MTEB: Massive Text Embedding Benchmark	General embedding benchmark framework.
MCINext/NQ_FA_test_top_250_only_w_correct-v2	Public source dataset card.
hakari-bench/NanoFaMTEB-v2	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoFaMTEB-v2
Backing dataset	NanoFaMTEB-v2
Task / split	nq_fa
Hugging Face dataset	hakari-bench/NanoFaMTEB-v2
Language	fa
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	251
Positives / query avg	1.25
Positives / query min	1
Positives / query median	1.00
Positives / query max	3
Multi-positive queries	48 (24.00%)
Query length avg chars	46.72
Document length avg chars	556.82

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4470	0.7000	0.9363	top-500
Dense	`harrier_oss_v1_270m`	0.5817	0.8350	0.9163	top-500
Reranking hybrid	`reranking_hybrid`	0.5274	0.7900	0.9841	top-100