NanoIndicQA

Overview

NanoIndicQA is a language-specific Nano benchmark for IndicQA retrieval. It covers eleven Indic language splits: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Each split turns an IndicQA reading-comprehension example into retrieval: the query is a question in the target language, and the positive document is the context paragraph containing the evidence needed to answer it.

The group is useful as a controlled multilingual passage-selection benchmark. All languages share the same retrieval shape, so differences mainly reflect script, morphology, paragraph length, named entities, and model coverage for Indic languages. BM25 shows how far exact same-language term matching goes, dense retrieval tests cross-script semantic passage matching, and reranking_hybrid shows whether sparse and dense candidates complement each other in small paragraph pools.

What This Group Measures

Towards Leaving No Indic Language Behind introduces IndicXTREME and includes IndicQA as a manually curated reading-comprehension benchmark for Indic languages. The retrieval version uses each question as a query and the original context paragraph as the relevant document. NanoIndicQA keeps this setup in compact per-language corpora.

The group measures same-language evidence paragraph retrieval. It is not answer-string extraction and not cross-lingual retrieval. A model must retrieve the supporting paragraph in the same Indic language as the query.

Task Families

Same-language QA evidence retrieval: all 11 tasks use question-to-context paragraph retrieval.
Eastern Indo-Aryan scripts: Assamese, Bengali, and Odia test related but distinct scripts and orthographic conventions.
Western and northern Indo-Aryan scripts: Gujarati, Hindi, Marathi, and Punjabi test different scripts, morphology, and named-entity patterns.
Dravidian scripts: Kannada, Malayalam, Tamil, and Telugu test non-Indo- Aryan languages with longer paragraph evidence in several splits.

Dataset Shape

NanoIndicQA contains 11 task pages, 2,200 queries, 2,759 split-local documents, and 2,205 positive qrel rows. Every language has exactly 200 queries. The document pools are small, roughly 241 to 261 context paragraphs per language. The group is nearly single-positive: most queries have exactly one positive paragraph, and only a few splits include one query with two positives.

Query and document length vary by language. Malayalam has the longest average query length, while Telugu and Hindi have especially long context paragraphs. Odia and Kannada have shorter average documents. Because the document pools are small, top-rank ordering is often more informative than broad candidate recall.

Retrieval Behavior

BM25 Profile

BM25 is strong when the question repeats distinctive names, places, dates, titles, or entity phrases from the evidence paragraph. Telugu, Bengali, Malayalam, Assamese, Gujarati, Odia, and Punjabi all show useful sparse signal in the current metadata. This reflects the same-language design: there is no translation step, and exact terms can point directly to the paragraph.

BM25 is weaker for Tamil, Hindi, Kannada, and Marathi in the current metadata. These failures often arise when the question wording differs from the paragraph or when a short question does not provide enough exact anchors. Sparse retrieval can also be affected by tokenizer quality for each script.

Dense Profile

Dense retrieval is the best profile for most NanoIndicQA languages. It improves paragraph matching when the question and evidence express the same fact with different wording. This is especially visible for Tamil, Kannada, Hindi, Marathi, Bengali, Gujarati, Odia, and Malayalam.

Dense retrieval should still be evaluated per language. A model may have strong general Indic representation for one script but weaker coverage for another. Dense gains are most meaningful when they improve semantic matching without losing named entities and local script forms.

Reranking Hybrid Profile

reranking_hybrid is rarely the best nDCG@10 profile in this group, but it is often close to dense. Punjabi is the main hybrid-led split in the current metadata. The hybrid view is useful when exact entity anchors and semantic paragraph matching recover different candidates, but the small document pools mean dense retrieval often has enough coverage by itself.

For reranking, this group is a clean same-language passage benchmark: the key question is whether first-stage retrieval places the evidence paragraph near the top, not whether it searches a huge web-scale corpus.

Language Summary

Language	Task	Queries	Docs	Positives	BM25 nDCG@10	Dense nDCG@10	Reranking hybrid nDCG@10	Best profile
Assamese	as	200	250	200	0.6111	0.7416	0.7283	Dense
Bengali	bn	200	250	201	0.6971	0.7773	0.7460	Dense
Gujarati	gu	200	248	201	0.6060	0.7487	0.7207	Dense
Hindi	hi	200	261	201	0.4545	0.6511	0.5738	Dense
Kannada	kn	200	257	200	0.4730	0.7037	0.6111	Dense
Malayalam	ml	200	247	200	0.6528	0.8214	0.7807	Dense
Marathi	mr	200	250	200	0.4612	0.6720	0.5916	Dense
Odia	or	200	252	201	0.6041	0.7605	0.7033	Dense
Punjabi	pa	200	241	200	0.5983	0.6445	0.6885	Reranking hybrid
Tamil	ta	200	253	201	0.2932	0.6415	0.4551	Dense
Telugu	te	200	250	200	0.7674	0.7186	0.7582	BM25

Interpretation Notes for Model Researchers

NanoIndicQA is a controlled way to compare Indic-language passage retrieval because all tasks share the same basic structure. Language-level differences should therefore be interpreted through script coverage, tokenizer behavior, paragraph length, and training data availability rather than task-family differences.

The dense-versus-BM25 profile is especially important. Dense-led splits show where semantic passage matching helps beyond repeated terms. BM25-led or BM25-competitive splits show where exact names and local orthography remain central. Tamil is a useful stress case because dense retrieval greatly improves over BM25 in the current metadata.

Training and Leakage Notes

Useful training data includes non-overlapping IndicQA-style question-context pairs, same-language Wikipedia passage retrieval, extractive QA in each language, and hard negatives from related biographies, places, events, or cultural topics. Training should keep the target as the full evidence paragraph, not only the answer span.

Exclude NanoIndicQA evaluation questions, positive paragraphs, qrels, and direct translations or paraphrases of them. Upstream IndicQA and MTEB retrieval splits should be audited for overlap before training.

Source Reference Table

Source	Year	Type	URL
Towards Leaving No Indic Language Behind	2022	paper	https://arxiv.org/abs/2212.05409
MTEB: Massive Text Embedding Benchmark	2022	paper	https://arxiv.org/abs/2210.07316

Metadata Summary

Field	Value
Task pages	11
Queries	2,200
Split-local documents	2,759
Positive qrels	2,205
Languages	as, bn, gu, hi, kn, ml, mr, or, pa, ta, te
Categories	natural_language
Positives / query avg	1.00

Task Metadata Summary

Task	Backing dataset	Lang	Category	Queries	Docs	Positives	BM25 nDCG@10	Dense nDCG@10	Reranking hybrid nDCG@10	Best profile
as	NanoIndicQA	as	natural_language	200	250	200	0.6111	0.7416	0.7283	Dense
bn	NanoIndicQA	bn	natural_language	200	250	201	0.6971	0.7773	0.7460	Dense
gu	NanoIndicQA	gu	natural_language	200	248	201	0.6060	0.7487	0.7207	Dense
hi	NanoIndicQA	hi	natural_language	200	261	201	0.4545	0.6511	0.5738	Dense
kn	NanoIndicQA	kn	natural_language	200	257	200	0.4730	0.7037	0.6111	Dense
ml	NanoIndicQA	ml	natural_language	200	247	200	0.6528	0.8214	0.7807	Dense
mr	NanoIndicQA	mr	natural_language	200	250	200	0.4612	0.6720	0.5916	Dense
or	NanoIndicQA	or	natural_language	200	252	201	0.6041	0.7605	0.7033	Dense
pa	NanoIndicQA	pa	natural_language	200	241	200	0.5983	0.6445	0.6885	Reranking hybrid
ta	NanoIndicQA	ta	natural_language	200	253	201	0.2932	0.6415	0.4551	Dense
te	NanoIndicQA	te	natural_language	200	250	200	0.7674	0.7186	0.7582	BM25