HAKARI-Bench

NanoMIRACL / bn

Overview

NanoMIRACL / bn is the Bengali split of the MIRACL-style multilingual monolingual retrieval benchmark. Bengali queries retrieve Bengali Wikipedia passages, not translated evidence. The Nano split has 200 queries, 10,000 documents, and 407 positive qrel rows. More than half of the queries have multiple positives, so the task measures retrieval of a small answer-bearing passage set rather than a single canonical passage. Current diagnostics show dense retrieval as the strongest top-rank profile, reranking_hybrid as the strongest recall profile, and BM25 as a useful but more brittle Bengali lexical baseline.

Details

What the Original Data Measures

MIRACL was introduced as a multilingual ad hoc retrieval benchmark across many languages. Its design is monolingual: Bengali queries retrieve Bengali passages from Bengali Wikipedia. The benchmark emphasizes native-language queries, Wikipedia passage evidence, and manual relevance judgments.

For Bengali, this means the model must handle native Bengali question wording, named entities, inflected forms, compound expressions, and encyclopedic passage structure. The retrieval unit is a passage rather than a full article, so a system must choose the passage that contains the requested definition, count, date, location, person relation, country fact, or other answer evidence.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 407 positive qrel rows. Positives per query average 2.04, with a minimum of 1, a median of 2, and a maximum of 8. There are 110 multi-positive queries, representing 55.0 percent of the split. Queries average 47.23 characters, while documents average 446.23 characters.

The examples are Bengali fact questions, often entity-first rather than question-word-first. Many queries introduce the topic or entity and then ask for কী, কোথায়, কত, কোন, or কবে. Observed topics include religious concepts, Bangladeshi and Indian public figures, rivers, prisons, country statistics, software support, literary works, films, sports organizations, and historical entities.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.5033, hit@10 = 0.7800, and recall@100 = 0.9582. BM25 is useful because many Bengali queries contain distinctive names, titles, locations, or rare terms that also appear in relevant Wikipedia passages. Examples include entity names, river names, book or film titles, and software or organization names.

The sparse profile is substantially weaker than dense retrieval at the top of the ranking. A common difficulty is entity-neighbor confusion: a query may contain a person, place, or concept name that appears in many related passages, but only one passage states the requested attribute. BM25 can also over-rank a specific article about a related subtopic when the correct answer requires a broader passage, such as a general article passage containing a count or definition.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.7661, hit@10 = 0.9450, and recall@100 = 0.9484. Dense retrieval is the strongest observed profile by nDCG@10 and hit@10. It appears to handle Bengali question intent more effectively than BM25, especially when the wording of the passage does not exactly repeat the query.

This is an important Bengali MIRACL pattern. Many questions are short but relation-heavy: they ask for an attribute of an entity, not just any passage that mentions the entity. Dense retrieval helps connect the semantic intent of the question to answer-bearing passages about locations, dates, counts, roles, definitions, or biographical facts. Its recall@100 is slightly lower than BM25 and hybrid retrieval, so the main strength is top-rank ordering rather than complete positive-set coverage.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains exactly 100 candidates per query, with no safeguard rows. It achieves nDCG@10 = 0.6537, hit@10 = 0.9350, and recall@100 = 0.9975. Hybrid retrieval is not the best top-10 ranking profile, but it is the strongest candidate-generation profile by recall@100.

This profile emulates the benefit of combining lexical and dense retrieval. BM25 contributes exact Bengali names and rare surface terms, while dense retrieval contributes semantic matching for answer relations. The combined candidate set preserves almost all judged positives for downstream reranking, but dense retrieval alone ranks the best evidence passages higher in the observed top-10 metrics.

Metric Interpretation for Model Researchers

This task is multi-positive for more than half of the queries. Hit@10 measures whether at least one relevant passage appears near the top. nDCG@10 rewards placing relevant passages high, and recall@100 measures how much of the judged positive set remains available for reranking.

The metric pattern is therefore diagnostic. Dense retrieval is currently best for top-rank quality, reranking_hybrid is best for positive coverage, and BM25 remains useful for exact Bengali lexical anchors. A strong Bengali retriever should ideally combine dense semantic intent matching with preservation of rare names, titles, and numeric or location-bearing passages.

Query and Relevance Type Tendencies

Queries tend to be compact Bengali information needs about people, places, religion, history, geography, organizations, software, and cultural works. They often place the entity before the interrogative expression, so models must parse the relation requested by the later part of the question.

Relevant documents are Bengali Wikipedia passages with article-title context and answer-bearing prose. The task rewards systems that can distinguish a passage that merely mentions the entity from a passage that states the queried attribute, such as a birthplace, father name, location, number, definition, religious classification, or historical date.

Representative Failure Modes

BM25 can retrieve the wrong member of an entity family when the surface form is shared by several passages. For example, a question about the father of a person can retrieve passages about other people with overlapping names. A question about the location of Lalon's akhra can retrieve passages about related singers, films, or cultural works before the location-bearing passage. A question about the number of suras in the Qur'an can retrieve individual sura pages above a general Qur'an passage.

Dense retrieval can fail by choosing a semantically related passage that lacks the exact requested attribute. Hybrid retrieval reduces missing positives but still depends on a reranker to choose the passage with the best answer evidence.

Training Data That May Help

Useful training data includes non-overlapping MIRACL Bengali training data, native Bengali Wikipedia question-to-passage pairs, Bengali entity-attribute QA evidence retrieval, and hard negatives from related Bengali Wikipedia pages. Synthetic data can also help when it produces Bengali Wikipedia-style passages and fact questions with both topic-first and question-word-first forms.

Comparable evaluation should avoid upstream development or test data, or other MIRACL-derived data likely to overlap with the NanoMIRACL evaluation questions and passages.

Model Improvement Notes

Dense retrievers should improve Bengali relation matching while preserving exact entity, title, and numeric signals. Sparse systems benefit from better Bengali tokenization and normalization, especially for compound forms and variant surface expressions. Rerankers should explicitly combine exact-entity evidence with semantic answer relation matching.

For hybrid systems, NanoMIRACL / bn supports using reranking_hybrid as a high-recall candidate stage followed by a stronger reranker. The dense baseline shows that Bengali semantic matching can be very effective at top ranks, while the hybrid profile shows that lexical evidence remains important for complete candidate coverage.

Example Data

QueryPositive document
শ্রীনিবাস রামানুজনের বাবার নাম কি ছিল ? [39 chars]শ্রীনিবাস রামানুজন রামানুজন ১৮৮৭ খ্রিস্টাব্দের ২২ ডিসেম্বর প্রাচীন ভারতের মাদ্রাজ প্রদেশের তাঞ্জোর জেলার ইরেভদ শহরের এক দরিদ্র ব্রাহ্মণ পরিবারে জন্মগ্রহণ করেন। তাঁর পিতা "কে শ্রীনিবাস ইয়েঙ্গার" ছিলেন শহরের একটি কাপড়ের দোকানের হিসাব রক্ষক। তাঁর মা "কোমালাটাম্মাল" একজন গৃহিনী ছিলেন এবং একটি স্থানীয় মন্দিরে গান গাইতেন। তিনি ছিলেন তীক্ষ্ম বুদ্ধিসম্পন্ন মহিলা। প্রচলিত আছে যে, রামানুজনের মায়ের বিয়ের পর বেশ কয়েকবছর কোন সন্তান না হওয়ায়, রামানুজনের মাতামহ "নামাক্কল" শহরের বিখ্যাত নামগিরি দেবীর নিকট নিজ কন্যা সন্তানের জন্য প্রার্থনা করেন। এরপরই জ্যেষ্ঠ সন্তান রামানুজন জন্মগ্রহণ করেন। [589 chars]
জে কে রাউলিং রচিত হ্যারি পটার উপন্যাসের প্রকাশক কে ? [52 chars]হ্যারি পটার এই বইয়ের সাফল্য রাউলিংকে ইতিহাসে সবচেয়ে বেশী উপার্জন করা লেখকের তালিকায় শীর্ষস্থান দিয়েছে। বইগুলোর ইংরেজি সংস্করণণ প্রকাশ করে ব্লুমসবারি যুক্তরাজ্যে, স্কলাস্টিক প্রেস যুক্তরাষ্ট্রে, অ্যালেন ও আনউইন অস্ট্রেলিয়ায় ও রেইনকোস্ট বুকস কানাডায়। [256 chars]
খেজুর গাছে খেজুর ফল আসতে কতদিন সময় লাগে ? [41 chars]খেজুর গাছে ফল উৎপাদনের জন্য সচরাচর ৪ থেকে ৮ বছর পর্যন্ত অপেক্ষা করতে হয়। তবে বাণিজ্যিকভাবে ফসল উৎপাদন উপযোগী খেজুর গাছে ফল আসতে ৭ থেকে ১০ বছর সময় লেগে যায়। পূর্ণাঙ্গ খেজুর গাছে প্রতি মৌসুমে গড়ে ৮০-১২০ কিলোগ্রাম (১৭৬-২৬৪ পাউন্ড) ফল পাওয়া যায়। বাজারজাতকরণের উপযোগী ফল আহরণের জন্য ফলের শাখাকে পাতলা হতে হয়। এর ফলে ফলগুচ্ছ নুইয়ে পড়ে, নীচের ফলগুলো বড় আকারের হয় এবং বাজে আবহাওয়া ও পাখির আক্রমণ থেকে রক্ষা পায়। [417 chars]

Source Reference Table

TitleYearTypeURL
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages2022paperhttps://arxiv.org/abs/2210.09984
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages2023paperhttps://aclanthology.org/2023.tacl-1.63/
MIRACL GitHub repositoryproject repositoryhttps://github.com/project-miracl/miracl
miracl/miracl-corpusdataset cardhttps://huggingface.co/datasets/miracl/miracl-corpus

Dataset Information

FieldValue
Nano setNanoMIRACL
Backing datasetNanoMIRACL
Task / splitbn
Hugging Face datasethakari-bench/NanoMIRACL
Languagebn
Categorynatural_language
Queries200
Documents10,000
Positive qrels407
Positives / query avg2.04
Positives / query min1
Positives / query median2.00
Positives / query max8
Multi-positive queries110 (55.00%)
Query length avg chars47.23
Document length avg chars446.23

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.50330.78000.9582top-500
Denseharrier_oss_v1_270m0.76610.94500.9484top-500
Reranking hybridreranking_hybrid0.65370.93500.9975top-100

Training and Leakage Metadata