MNanoBEIR / NanoBEIR-ar / NanoNQ

Overview

NanoBEIR-ar / NanoNQ is the Arabic NanoBEIR version of Natural Questions, the open-domain question answering benchmark introduced in Natural Questions: A Benchmark for Question Answering Research. Each query is an Arabic translated natural search question, and the retrieval target is an Arabic translated Wikipedia passage containing the answer evidence. The Nano task contains 50 queries, 5,035 passages, and 57 positive qrels. Most queries have a single positive, with a small multi-positive tail. The task tests whether a retriever can find relation-specific answer evidence for everyday factual questions. Dense retrieval gives the best nDCG@10, while reranking_hybrid gives the best hit@10 and Recall@100.

Details

What the Original Data Measures

Natural Questions was built from real Google search questions and Wikipedia evidence. Annotators selected long answers, often paragraphs or table regions, and short answers when available. This distinguishes the task from synthetic paragraph-based QA: the query is a naturally occurring information need, and the evidence passage may not be written in the same wording as the question.

The Arabic NanoBEIR version turns that setup into translated passage retrieval. The system does not output the answer string directly. It ranks passages, and a positive passage is one that contains the answer-bearing evidence needed by a downstream reader or QA model.

Observed Data Profile

The metadata records 50 queries, 5,035 documents, and 57 positive qrels. The average is 1.14 positives per query; 7 queries have multiple positives. Query text averages 40.16 characters, and passages average 447.30 characters. Examples ask where the Final Four is held, whether a film was originally a Disney production, why the Angel of the North statue is located where it is, where the Three-Fifths Compromise first appears in the U.S. Constitution, and who sings a song with Michael Jackson.

The task is mostly single-hop factual evidence retrieval. It is less multi-hop than HotpotQA and less noisy than MS MARCO, but it still requires matching a question relation to a passage containing the answer. Entity overlap is helpful but not sufficient.

BM25 Evaluation Profile

The BM25 candidate subset reaches nDCG@10 = 0.3555, hit@10 = 0.5800, and Recall@100 = 0.8772. BM25 can recover passages when questions contain names, titles, places, constitutional phrases, songs, films, or other exact lexical anchors. It is also stronger than dense on Recall@100 in this task, which shows that exact Arabic query terms and translated proper names still preserve important candidate coverage.

BM25's weakness is top-rank relation matching. A passage can share the entity or title but fail to answer the requested relation: location, date, authorship, reason, definition, or participant. This explains why BM25 coverage is useful but top-10 ranking is weaker than dense.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m reaches nDCG@10 = 0.4600, hit@10 = 0.6600, and Recall@100 = 0.8421. Dense retrieval is the best top-rank sorter among the single retrievers. It likely helps map Arabic questions to answer-bearing passages even when the wording differs between the question and the passage.

Dense retrieval's weakness is candidate coverage. It ranks strong positives high when it finds them, but its Recall@100 is lower than BM25 and hybrid. This suggests that some exact names, titles, or Arabic translated surface forms are lost in the embedding neighborhood.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 = 0.4247, hit@10 = 0.7200, and Recall@100 = 0.9123. Hybrid is not the best nDCG@10 signal because dense ranks the strongest positives slightly better, but it is the safest candidate source. It has the best hit@10 and Recall@100, and the metadata records 3 rows with the optional rank-101 safeguard.

For reranker experiments, hybrid is the preferred pool. It combines BM25's proper-name and title coverage with dense retrieval's relation matching. A reranker can then focus on choosing the passage that actually answers the question.

Metric Interpretation for Model Researchers

NanoNQ-ar separates final ranking quality from candidate coverage. Dense retrieval has the best nDCG@10, so it is strong at placing answer-bearing passages near the top. BM25 has better Recall@100 than dense, so lexical anchors remain important for the first-stage pool. Hybrid has the best hit@10 and Recall@100, meaning it is strongest for candidate generation and reranking.

A model that only improves dense-like top ranking but loses rare names or titles may be brittle. A model that only improves BM25-style coverage may still need reranking to select the answer-bearing relation. The ideal retriever combines exact entity preservation with semantic question-to-evidence matching.

Query and Relevance Type Tendencies

Queries are Arabic natural questions about people, places, films, songs, institutions, historical clauses, definitions, dates, and reasons. Relevant documents are Wikipedia-style passages containing the answer evidence. The question often names an entity but asks for a relation, such as who, where, when, why, or what distinction.

Lexical-heavy cases include exact titles and names. Dense-heavy cases include questions where the passage answers in explanatory form or uses different wording. Hybrid retrieval is strongest when the entity anchor and the relation must both be preserved.

Representative Failure Modes

BM25 can retrieve a passage about the named entity but miss the relation that answers the question. Dense retrieval can retrieve a semantically plausible page while losing an exact title, song name, constitutional phrase, or proper noun. Both methods can confuse nearby works, people, or events. Hard negatives should therefore include same-entity wrong-relation passages and same-title neighboring pages.

Arabic-Specific Notes

Arabic NQ retrieval must handle translated Wikipedia style, proper nouns, transliterated titles, named works, constitutional terms, and question words. Sparse retrieval benefits from analyzers that preserve names and titles. Dense retrieval needs Arabic factual QA coverage so it can connect question phrasing to answer-bearing evidence. Small differences in names, articles, and transliterations can determine whether the correct passage appears in the candidate pool.

Training and Leakage Notes

Training should exclude NQ, BEIR, or NanoBEIR records likely to overlap with these evaluation questions or evidence passages. Useful non-overlapping data includes Natural Questions training examples, Arabic or multilingual open-domain QA retrieval pairs, Wikipedia question-to-passage supervision, and KILT-style question-to-Wikipedia evidence pairs.

Model Improvement Hints

The main improvement target is relation-specific evidence selection. First-stage retrievers should preserve entity and title anchors while using dense similarity to match the requested relation. Rerankers should be trained on same-entity wrong-answer passages, because those are the candidates most likely to fool a topical retriever.

Training Data That May Help

Useful training data includes non-overlapping NQ examples, Arabic Wikipedia QA, multilingual open-domain QA retrieval, KILT-style evidence retrieval, and synthetic fact questions over Wikipedia-style passages with hard negatives from the same entity neighborhood.

Synthetic Data Guidance

Generate Arabic natural search-style factual questions from non-evaluation Wikipedia passages. Cover who, where, when, why, what, how long, definition, release-date, location, author, participant, and relation questions. Positives should contain the requested answer evidence, not just the same entity.

Example Data

Query	Positive document
أين تقام أربع النهائي هذا العام؟ [32 chars]	بطولة كرة السلة للرجال في القسم الأول من الاتحاد الوطني للرياضات الجامعية لعام 2018 كانت بطولة إقصاء فردي تضم 68 فريقًا لتحديد بطل كرة السلة الجامعية للرجال في القسم الأول للاتحاد الوطني للرياضات الجامعية للعام الدراسي 2017–18. بدأت النسخة الثمانين من البطولة في 13 مارس 2018 وانتهت بمباراة النهائي في 2 أبريل في ملعب ألامودوم في سان أنطونيو، تكساس. [349 chars]
هل كان فيلم "ليلة مرعبة قبل عيد الميلاد" من إنتاج ديزني في البداية؟ [67 chars]	بدأ فكرة فيلم ليلة مرعبة في عيد الميلاد بقصيدة كتبها تيم بورتون في عام 1982، وهو يعمل كرسام متحرك في استوديوهات والت ديزني للرسوم المتحركة. بعد نجاح فيلم فينسنت في نفس العام، بدأت استوديوهات والت ديزني في التفكير في تطوير فيلم ليلة مرعبة في عيد الميلاد إما كفيلم قصير أو برنامج تلفزيوني خاص لمدة 30 دقيقة. عبر السنوات، استمر بورتون في العودة إلى الفكرة، وفي عام 1990، أبرم صفقة تطوير مع ديزني. بدأت الإنتاج في يوليو 1991 في سان فرانسيسكو؛ أصدرت ديزني الفيلم تحت لواء شركة تاتشستون بيكتشرز لأن الاستوديو اعتقد أن الفيلم سيكون "مظلمة ومخيفة جدًا للأطفال".[4] [556 chars]
لماذا يوجد تمثال الملاك الشمالي هناك؟ [37 chars]	وفقًا لـ غورملي، كان لمعنى التمثال الملاك ثلاثة جوانب: أولاً، لإظهار أن تحت موقع بنائه عمل عمال المناجم على مدى قرنين؛ ثانيًا، لاستيعاب الانتقال من عصر صناعي إلى عصر المعلومات؛ وثالثًا، لأن يكون مركزًا لأملنا وخوفنا المتطور. [224 chars]

Source Reference Table

Title	Year	Type	URL
Natural Questions: A Benchmark for Question Answering Research	2019	task paper	https://aclanthology.org/Q19-1026/
Google Research Natural Questions publication page		project page	https://research.google/pubs/natural-questions-a-benchmark-for-question-answering-research/
BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models	2021	benchmark paper	https://arxiv.org/abs/2104.08663
MMTEB: Massive Multilingual Text Embedding Benchmark	2025	benchmark paper	https://arxiv.org/abs/2502.13595
NanoBEIR: Smaller BEIR dataset subsets	2024	dataset collection	https://huggingface.co/collections/zeta-alpha-ai/nanobeir

Dataset Information

Field	Value
Nano set	MNanoBEIR
Backing dataset	NanoBEIR-ar
Task / split	NanoNQ
Hugging Face dataset	hakari-bench/NanoBEIR-ar
Language	ar
Category	natural_language
Queries	50
Documents	5,035
Positive qrels	57
Positives / query avg	1.14
Positives / query min	1
Positives / query median	1.00
Positives / query max	2
Multi-positive queries	7 (14.00%)
Query length avg chars	40.16
Document length avg chars	447.30

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.3555	0.5800	0.8772	top-500
Dense	`harrier_oss_v1_270m`	0.4600	0.6600	0.8421	top-500
Reranking hybrid	`reranking_hybrid`	0.4247	0.7200	0.9123	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MNanoBEIR Arabic NanoBEIR task split from hakari-bench/NanoBEIR-ar
Train/eval overlap audit: not_audited
Leakage note: prefer excluding NQ, BEIR, or NanoBEIR records likely to overlap with these evaluation questions or evidence passages
Multi-positive training: single_positive_question_document_focus
Useful training data: non-overlapping Natural Questions train examples, Arabic or multilingual open-domain QA retrieval pairs, Wikipedia question-to-passage evidence supervision, KILT-style question-to-Wikipedia evidence pairs