HAKARI-Bench

NanoIndicQA

Overview

NanoIndicQA is a language-specific Nano benchmark for IndicQA retrieval. It covers eleven Indic language splits: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Each split turns an IndicQA reading-comprehension example into retrieval: the query is a question in the target language, and the positive document is the context paragraph containing the evidence needed to answer it.

The group is useful as a controlled multilingual passage-selection benchmark. All languages share the same retrieval shape, so differences mainly reflect script, morphology, paragraph length, named entities, and model coverage for Indic languages. BM25 shows how far exact same-language term matching goes, dense retrieval tests cross-script semantic passage matching, and reranking_hybrid shows whether sparse and dense candidates complement each other in small paragraph pools.

What This Group Measures

Towards Leaving No Indic Language Behind introduces IndicXTREME and includes IndicQA as a manually curated reading-comprehension benchmark for Indic languages. The retrieval version uses each question as a query and the original context paragraph as the relevant document. NanoIndicQA keeps this setup in compact per-language corpora.

The group measures same-language evidence paragraph retrieval. It is not answer-string extraction and not cross-lingual retrieval. A model must retrieve the supporting paragraph in the same Indic language as the query.

Task Families

Dataset Shape

NanoIndicQA contains 11 task pages, 2,200 queries, 2,759 split-local documents, and 2,205 positive qrel rows. Every language has exactly 200 queries. The document pools are small, roughly 241 to 261 context paragraphs per language. The group is nearly single-positive: most queries have exactly one positive paragraph, and only a few splits include one query with two positives.

Query and document length vary by language. Malayalam has the longest average query length, while Telugu and Hindi have especially long context paragraphs. Odia and Kannada have shorter average documents. Because the document pools are small, top-rank ordering is often more informative than broad candidate recall.

Retrieval Behavior

BM25 Profile

BM25 is strong when the question repeats distinctive names, places, dates, titles, or entity phrases from the evidence paragraph. Telugu, Bengali, Malayalam, Assamese, Gujarati, Odia, and Punjabi all show useful sparse signal in the current metadata. This reflects the same-language design: there is no translation step, and exact terms can point directly to the paragraph.

BM25 is weaker for Tamil, Hindi, Kannada, and Marathi in the current metadata. These failures often arise when the question wording differs from the paragraph or when a short question does not provide enough exact anchors. Sparse retrieval can also be affected by tokenizer quality for each script.

Dense Profile

Dense retrieval is the best profile for most NanoIndicQA languages. It improves paragraph matching when the question and evidence express the same fact with different wording. This is especially visible for Tamil, Kannada, Hindi, Marathi, Bengali, Gujarati, Odia, and Malayalam.

Dense retrieval should still be evaluated per language. A model may have strong general Indic representation for one script but weaker coverage for another. Dense gains are most meaningful when they improve semantic matching without losing named entities and local script forms.

Reranking Hybrid Profile

reranking_hybrid is rarely the best nDCG@10 profile in this group, but it is often close to dense. Punjabi is the main hybrid-led split in the current metadata. The hybrid view is useful when exact entity anchors and semantic paragraph matching recover different candidates, but the small document pools mean dense retrieval often has enough coverage by itself.

For reranking, this group is a clean same-language passage benchmark: the key question is whether first-stage retrieval places the evidence paragraph near the top, not whether it searches a huge web-scale corpus.

Language Summary

LanguageTaskQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
Assameseas2002502000.61110.74160.7283Dense
Bengalibn2002502010.69710.77730.7460Dense
Gujaratigu2002482010.60600.74870.7207Dense
Hindihi2002612010.45450.65110.5738Dense
Kannadakn2002572000.47300.70370.6111Dense
Malayalamml2002472000.65280.82140.7807Dense
Marathimr2002502000.46120.67200.5916Dense
Odiaor2002522010.60410.76050.7033Dense
Punjabipa2002412000.59830.64450.6885Reranking hybrid
Tamilta2002532010.29320.64150.4551Dense
Telugute2002502000.76740.71860.7582BM25

Interpretation Notes for Model Researchers

NanoIndicQA is a controlled way to compare Indic-language passage retrieval because all tasks share the same basic structure. Language-level differences should therefore be interpreted through script coverage, tokenizer behavior, paragraph length, and training data availability rather than task-family differences.

The dense-versus-BM25 profile is especially important. Dense-led splits show where semantic passage matching helps beyond repeated terms. BM25-led or BM25-competitive splits show where exact names and local orthography remain central. Tamil is a useful stress case because dense retrieval greatly improves over BM25 in the current metadata.

Training and Leakage Notes

Useful training data includes non-overlapping IndicQA-style question-context pairs, same-language Wikipedia passage retrieval, extractive QA in each language, and hard negatives from related biographies, places, events, or cultural topics. Training should keep the target as the full evidence paragraph, not only the answer span.

Exclude NanoIndicQA evaluation questions, positive paragraphs, qrels, and direct translations or paraphrases of them. Upstream IndicQA and MTEB retrieval splits should be audited for overlap before training.

Source Reference Table

SourceYearTypeURL
Towards Leaving No Indic Language Behind2022paperhttps://arxiv.org/abs/2212.05409
MTEB: Massive Text Embedding Benchmark2022paperhttps://arxiv.org/abs/2210.07316

Metadata Summary

FieldValue
Task pages11
Queries2,200
Split-local documents2,759
Positive qrels2,205
Languagesas, bn, gu, hi, kn, ml, mr, or, pa, ta, te
Categoriesnatural_language
Positives / query avg1.00

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
asNanoIndicQAasnatural_language2002502000.61110.74160.7283Dense
bnNanoIndicQAbnnatural_language2002502010.69710.77730.7460Dense
guNanoIndicQAgunatural_language2002482010.60600.74870.7207Dense
hiNanoIndicQAhinatural_language2002612010.45450.65110.5738Dense
knNanoIndicQAknnatural_language2002572000.47300.70370.6111Dense
mlNanoIndicQAmlnatural_language2002472000.65280.82140.7807Dense
mrNanoIndicQAmrnatural_language2002502000.46120.67200.5916Dense
orNanoIndicQAornatural_language2002522010.60410.76050.7033Dense
paNanoIndicQApanatural_language2002412000.59830.64450.6885Reranking hybrid
taNanoIndicQAtanatural_language2002532010.29320.64150.4551Dense
teNanoIndicQAtenatural_language2002502000.76740.71860.7582BM25