HAKARI-Bench

NanoFaMTEB-v2

Overview

NanoFaMTEB-v2 is the compact Persian retrieval group for FaMTEB. It covers seventeen Persian natural-language retrieval tasks, including argument retrieval, fact-verification evidence, finance QA, multi-hop QA, MIRACL and Natural Questions style passage retrieval, MS MARCO-style search, NeuCLIR news retrieval, Persian web search, duplicate-question retrieval, scientific and biomedical search, synthetic QA, chatbot RAG FAQ retrieval, WebFAQ, and Wikipedia QA.

The group is useful because Persian retrieval is not treated as a single uniform problem. Some tasks have short web-like queries, some have paragraph or dialogue queries, and several have many relevant documents per query. A model must handle Persian script, morphology, translated benchmark artifacts, native web material, synthetic Persian data, and domain vocabulary from finance, medicine, science, news, and Wikipedia. BM25, dense retrieval, and reranking_hybrid separate lexical anchoring, semantic matching, and candidate complementarity across this mix.

What This Group Measures

FaMTEB: Massive Text Embedding Benchmark in Persian Language introduces a Persian embedding benchmark in the MTEB style. NanoFaMTEB-v2 is the compact retrieval subset of that broader benchmark. It combines native Persian resources, translated benchmark tasks, synthetic Persian QA, and domain-specific retrieval sources.

The group measures Persian retrieval robustness across task definitions. A relevant item can be a duplicate question, a FAQ answer, a Wikipedia evidence passage, a finance answer, a scientific abstract, a biomedical COVID document, a news article, or a chatbot knowledge-base entry. This diversity makes single-score interpretation risky: researchers should read the group by task family and retrieval profile.

Task Families

Dataset Shape

NanoFaMTEB-v2 contains 17 task pages, 2,966 queries, 161,314 split-local documents, and 17,925 positive qrel rows. Query counts vary: most tasks have 200 queries, while MS MARCO has 43, NeuCLIR has 74, and TREC-COVID has 50. Several tasks are single-positive, but MS MARCO, NeuCLIR, Persian web retrieval, TREC-COVID, SCIDOCS, and other tasks are strongly multi-positive.

Lengths vary widely. persian_web_document uses very short web queries, whereas argu_ana_fa and syn_per_chatbot_ragfaq use long argument or conversation-like queries. Documents range from short duplicate questions and FAQ answers to long NeuCLIR news documents and scientific or biomedical abstracts. The group therefore tests Persian retrieval across both terse search intent and rich context matching.

Retrieval Behavior

BM25 Profile

BM25 is strongest when Persian query terms, named entities, or domain words appear directly in the positive document. quora_fa, syn_per_qa, web_faq_fas, wikipedia_multilingual_fa, fever_fa, hotpot_qa_fa, and persian_web_document all have substantial sparse signal. These tasks either preserve short search terms, contain repeated question wording, or have clear Wikipedia/FAQ lexical anchors.

BM25 is weaker on stance-sensitive, conversational, and related-paper tasks. argu_ana_fa, syn_per_chatbot_ragfaq, scidocs_fa, and some finance or biomedical tasks require matching the retrieval relation rather than repeated words. Multi-positive tasks can also hide difficulty: BM25 may find one positive while still ranking the relevant set poorly.

Dense Profile

Dense retrieval is the best profile for most tasks in the current metadata. It improves Persian QA, MIRACL, MS MARCO, NeuCLIR, web retrieval, chatbot RAG, and many evidence tasks by connecting paraphrase, answerability, and intent across different wording. It is especially important for long conversational or argument-style queries.

Dense retrieval should still be checked for exact Persian anchors. Named entities, transliterations, domain terminology, and short web queries can be lost if embedding similarity smooths them away. The best dense gains are those that improve semantic matching without damaging entity and term recall.

Reranking Hybrid Profile

reranking_hybrid is strongest where sparse and dense candidates are complementary. It leads on fi_qa2018_fa, hotpot_qa_fa, scidocs_fa, treccovid_fa, and web_faq_fas in the current metadata. These tasks combine domain terms or exact FAQ/QA cues with semantic answerability.

For reranker experiments, hybrid is particularly useful on multi-positive tasks. MS MARCO, NeuCLIR, Persian web retrieval, and TREC-COVID can have many valid positives, so Recall@100 and candidate diversity matter as much as top-rank nDCG.

Task Summary

TaskRetrieval focusQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_ana_faargument to paired counterargument1998,6691990.28600.32870.3128Dense
fever_faclaim to evidence20010,0002290.80250.89720.8396Dense
fi_qa2018_fafinance question to answer passage20010,0005340.29230.35250.3722Reranking hybrid
hotpot_qa_famulti-hop QA to evidence20010,0004000.77350.80600.8366Reranking hybrid
miracl_faMIRACL question to Wikipedia passage20010,0004270.49290.63180.5931Dense
msmarco_faweb query to passage answer438,7662,8260.47370.61390.6119Dense
neu_clir2023_fasinformation need to news documents7410,0003,6690.43360.57660.5595Dense
nq_fanatural question to evidence20010,0002510.44700.58170.5274Dense
persian_web_documentshort web query to document20010,0002,1860.69900.77800.7703Dense
quora_faduplicate question retrieval20010,0005700.83930.91220.8861Dense
sci_fact_fascientific claim evidence2005,1832250.62940.56100.6100BM25
scidocs_farelated scientific documents20010,0009860.17450.19370.2143Reranking hybrid
syn_per_chatbot_ragfaqconversation to FAQ entry2008,6962000.28820.43040.3826Dense
syn_per_qasynthetic Persian QA evidence20010,0002000.86090.92040.9173Dense
treccovid_faCOVID topic to biomedical literature5010,0004,6230.35190.35940.4161Reranking hybrid
web_faq_fasweb FAQ query to answer20010,0002000.86800.87560.9029Reranking hybrid
wikipedia_multilingual_faWikipedia question to answer passage20010,0002000.89340.90070.8958Dense

Interpretation Notes for Model Researchers

NanoFaMTEB-v2 is best interpreted as a Persian retrieval coverage benchmark. High performance on short FAQ or Wikipedia-style tasks does not guarantee strength on argument retrieval, chatbot RAG, NeuCLIR, or scientific relatedness. Compare models by task family and by whether the task is native, translated, or synthetic.

Profile changes are especially informative. BM25-competitive tasks expose strong lexical anchors in Persian. Dense-led tasks show paraphrase and answerability gains. Hybrid-led tasks suggest candidate complementarity, usually when domain terms and semantic relevance both matter.

Training and Leakage Notes

Useful training data includes Persian web search logs, MIRACL-style question-passage pairs, Persian FAQ retrieval, claim-evidence data, conversation-to-knowledge-base pairs, finance and scientific retrieval, and biomedical COVID literature search. Multi-positive tasks should preserve their qrel structure during training.

Exclude NanoFaMTEB-v2 evaluation queries, positives, qrels, translated test items, synthetic seeds, FAQ entries, news documents, and scientific abstracts. Synthetic data should preserve Persian script, morphology, right-to-left punctuation, named entities, and domain terminology, with hard negatives that share surface terms but fail the task relation.

Source Reference Table

SourceYearTypeURL
FaMTEB: Massive Text Embedding Benchmark in Persian Language2025paperhttps://arxiv.org/abs/2502.11571
MTEB: Massive Text Embedding Benchmark2022paperhttps://arxiv.org/abs/2210.07316

Metadata Summary

FieldValue
Task pages17
Queries2,966
Split-local documents161,314
Positive qrels17,925
Languagesfa
Categoriesnatural_language
Positives / query avg6.04

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_ana_faNanoFaMTEB-v2fanatural_language1998,6691990.28600.32870.3128Dense
fever_faNanoFaMTEB-v2fanatural_language20010,0002290.80250.89720.8396Dense
fi_qa2018_faNanoFaMTEB-v2fanatural_language20010,0005340.29230.35250.3722Reranking hybrid
hotpot_qa_faNanoFaMTEB-v2fanatural_language20010,0004000.77350.80600.8366Reranking hybrid
miracl_faNanoFaMTEB-v2fanatural_language20010,0004270.49290.63180.5931Dense
msmarco_faNanoFaMTEB-v2fanatural_language438,7662,8260.47370.61390.6119Dense
neu_clir2023_fasNanoFaMTEB-v2fanatural_language7410,0003,6690.43360.57660.5595Dense
nq_faNanoFaMTEB-v2fanatural_language20010,0002510.44700.58170.5274Dense
persian_web_documentNanoFaMTEB-v2fanatural_language20010,0002,1860.69900.77800.7703Dense
quora_faNanoFaMTEB-v2fanatural_language20010,0005700.83930.91220.8861Dense
sci_fact_faNanoFaMTEB-v2fanatural_language2005,1832250.62940.56100.6100BM25
scidocs_faNanoFaMTEB-v2fanatural_language20010,0009860.17450.19370.2143Reranking hybrid
syn_per_chatbot_ragfaqNanoFaMTEB-v2fanatural_language2008,6962000.28820.43040.3826Dense
syn_per_qaNanoFaMTEB-v2fanatural_language20010,0002000.86090.92040.9173Dense
treccovid_faNanoFaMTEB-v2fanatural_language5010,0004,6230.35190.35940.4161Reranking hybrid
web_faq_fasNanoFaMTEB-v2fanatural_language20010,0002000.86800.87560.9029Reranking hybrid
wikipedia_multilingual_faNanoFaMTEB-v2fanatural_language20010,0002000.89340.90070.8958Dense