NanoMIRACL / ko

Overview

NanoMIRACL / ko is the Korean split of the MIRACL-style multilingual monolingual retrieval benchmark. Korean queries retrieve Korean Wikipedia passages, not translated evidence. The Nano split has 200 queries, 10,000 documents, and 508 positive qrel rows. Queries are short, often entity-first, and frequently express intent through Korean endings rather than an initial question word. Current diagnostics show reranking_hybrid as the strongest profile across nDCG@10, hit@10, and recall@100, with dense retrieval improving top-rank quality over BM25 and BM25 preserving useful lexical coverage.

Details

What the Original Data Measures

MIRACL was introduced as a multilingual ad hoc retrieval benchmark over Wikipedia passages. Its design is monolingual: Korean queries retrieve Korean passages from Korean Wikipedia. The benchmark emphasizes native-language questions, passage-level evidence, and human relevance judgments.

Korean is one of the MIRACL languages connected to the TyDi/Mr. TyDi lineage. The MIRACL framing adds passage-level relevance judgments over a segmented Wikipedia corpus. For this split, the relevant item is a Korean passage that supports the question, not a translated English passage or a short answer.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 508 positive qrel rows. Positives per query average 2.54, with a minimum of 1, a median of 2, and a maximum of 12. There are 103 multi-positive queries, representing 51.5 percent of the split. Queries average 21.71 characters, while documents average 205.28 characters.

The examples include short Korean questions and yes/no checks. Many begin with the entity or topic, such as 세상에서, 일본의, 임진왜란이, 중국의, 태양은, 대한민국의, or 히틀러가, while intent appears through forms such as 무엇인가, 언제, 어디, 누구, 몇, 인가요, and 있나요. Topics include science, history, geography, politics, entertainment, religion, technology, definitions, universities, and public figures.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.4994, hit@10 = 0.8000, and recall@100 = 0.9606. BM25 is useful when there is near-verbatim overlap or a distinctive Korean title, entity, date, or technical term. It also preserves many positives somewhere in the top-100 pool.

The weak point is top-rank ordering. Short Korean questions can share generic endings and forms such as 어디인가요, 무엇인가, and 몇 년도 with unrelated passages. BM25 can also retrieve a passage about the right entity family while missing the passage that states the requested capital, date, definition, or yes/no fact.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.6910, hit@10 = 0.9100, and recall@100 = 0.9213. Dense retrieval substantially improves top-rank quality over BM25 by matching the semantic relation expressed in the Korean question.

The tradeoff is coverage. Dense retrieval has better nDCG@10 and hit@10 than BM25, but lower recall@100. It is stronger at placing direct evidence near the top, while lexical retrieval still helps preserve additional judged positives for reranking.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains mostly 100 candidates per query, with three queries using a rank-101 safeguard row. It achieves nDCG@10 = 0.7026, hit@10 = 0.9400, and recall@100 = 0.9882. Hybrid retrieval is the best observed profile on all three metrics.

This makes Korean a strong hybrid-search case. BM25 contributes exact Korean surface forms, names, and title matches, while dense retrieval contributes relation-sensitive evidence matching. The combined candidate set ranks better than dense alone and preserves more positives than either single-source profile.

Metric Interpretation for Model Researchers

This task is multi-positive for 51.5 percent of queries. Hit@10 measures whether at least one relevant passage appears near the top. nDCG@10 rewards ranking relevant passages high, and recall@100 measures how much of the judged positive set remains available for reranking.

The Korean pattern is complementary: BM25 supplies lexical recall, dense retrieval supplies semantic ordering, and reranking_hybrid combines both into the best observed candidate ranking. Researchers should evaluate whether models handle Korean endings and entity-first phrasing without losing exact title and name signals.

Query and Relevance Type Tendencies

Queries are short Korean information needs about capitals, dates, historical events, scientific definitions, yes/no claims, countries, organizations, entertainment, and technology. Many are topic-led rather than wh-word-led, so a model must infer the requested relation from the whole sentence.

Relevant documents are Korean Wikipedia passages with title context and answer-bearing prose. The task rewards entity recognition, morphology-aware matching, question-ending interpretation, and disambiguation among related title pages or topic-near passages.

Representative Failure Modes

BM25 can over-rank passages that share generic question endings. A question about Iceland's capital can retrieve unrelated pages containing 어딘가 before the Reykjavík passage. Similar issues occur for first-university and Tang dynasty location questions where 어디-like overlap is strong. Snow, Hitler, or historical-date questions can retrieve plausible but non-labeled passages around the same event or entity.

Dense retrieval can fail by selecting a semantically related Korean passage that lacks the exact requested evidence. Hybrid retrieval reduces both missing positive and top-rank failures, but reranking remains useful when the candidate set contains several near-answer passages.

Training Data That May Help

Useful training data includes non-overlapping MIRACL Korean training data, Korean Wikipedia question-to-passage retrieval pairs, Korean open-domain QA evidence retrieval datasets, and entity-attribute supervision for dates, locations, historical roles, definitions, and yes/no factual checks. Hard negatives should include near-title passages and generic-question distractors.

Synthetic data can help when it creates Korean Wikipedia-style passages with titles, aliases, dates, places, organizations, definitions, and factual evidence. Generated questions should include entity-first wording and forms such as 무엇인가, 언제, 어디, 누구, 몇, 인가요, and 있나요. Comparable evaluation should exclude upstream development/test data or other MIRACL-derived examples likely to overlap with this Nano split.

Model Improvement Notes

Dense retrievers should improve Korean semantic relation matching while preserving exact names, titles, and dates. Sparse systems benefit from Korean morphological handling and better weighting of generic question endings. Rerankers should choose the passage that directly states the fact, not merely a passage with the right entity or a matching ending.

For hybrid systems, NanoMIRACL / ko is a positive hybrid benchmark: reranking_hybrid improves nDCG@10, hit@10, and recall@100 over both individual profiles. The main improvement target is robust reranking among multiple short Korean evidence passages.

Example Data

Query	Positive document
헤라클레스는 그리스 신들 중 한 명인가? [22 chars]	그리스 신화 헤라클레스는 에트루리아와 로마의 신화 및 숭배에도 등장하며, 로마인이 쓰던 라틴어 감탄사 "mehercule"은 그리스어인 "Herakleis"에서 유래한 것이었다. 이탈리아에서는 헤라클레스를 상인의 신으로 숭배하였는데, 다른 나라에서는 그의 특징적인 재능인 행운이나 위험에서의 구조를 염원하기도 하였다. [178 chars]
숙종은 몇 번째 왕인가? [13 chars]	조선 숙종 숙종(肅宗, 1661년 10월 7일(음력 8월 15일) ~ 1720년 7월 12일(음력 6월 8일))은 조선의 제19대 왕이다. 성은 이(李), 휘는 돈(焞), 본관은 전주(全州)., 초명은 용상(龍祥), 광(爌), 자는 명보(明譜), 사후 시호는 숙종현의광륜예성영렬장문헌무경명원효대왕(肅宗顯義光倫睿聖英烈章文憲武敬明元孝大王)이며 이후 존호가 더해져 정식 시호는 숙종현의광륜예성영렬유모영운홍인준덕배천합도계휴독경정중협극신의대훈장문헌무경명원효대왕(肅宗顯義光倫睿聖英烈裕謨永運洪仁峻德配天合道啓休篤慶正中恊極神毅大勳章文憲武敬明元孝大王)이다. 현종과 명성왕후의 외아들로 비는 김만기의 딸 인경왕후, 계비는 민유중의 딸 인현왕후, 제2계비는 김주신의 딸 인원왕후이다. [371 chars]
가톨릭교회의 교회법(CIC)은 교회의 고유한 조직과 운영, 그리고 신자들이 교회의 목적을 좇아 이루도록 합법적인 교회의 권위로 제정한 법을 말하나요? [83 chars]	로마 가톨릭교회 가톨릭교회의 교회법(CIC)은 교회의 고유한 조직과 운영, 그리고 신자들이 교회의 목적을 좇아 이루도록 합법적인 교회의 권위로 제정한 법을 말한다. 가톨릭교회는 영신적이면서도 가시적인 형태로 존재하며, 신적인 것과 인간적인 것이 함께 존재한다. 그러므로 교회법도 자연히 신약성경과 성전 안에 나오는 신법과, 교회와 인간이 제정한 실정법으로 이루어진다. 이러한 법의 제정 및 공표는 교황만이 할 수 있다. 교황은 보편 교회의 최고 목자로서 자기 임무에 의하여 교회에서 최고의 완전하고 직접적이며 보편적인 직권을 가지며 이를 언제나 자유로이 행사할 수 있다. [320 chars]

Source Reference Table

Title	Year	Type	URL
Making a MIRACL: Multilingual Information Retrieval Across a Continuum of Languages	2022	paper	https://arxiv.org/abs/2210.09984
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages	2023	paper	https://aclanthology.org/2023.tacl-1.63/
MIRACL GitHub repository		project repository	https://github.com/project-miracl/miracl
miracl/miracl-corpus		dataset card	https://huggingface.co/datasets/miracl/miracl-corpus

Dataset Information

Field	Value
Nano set	NanoMIRACL
Backing dataset	NanoMIRACL
Task / split	ko
Hugging Face dataset	hakari-bench/NanoMIRACL
Language	ko
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	508
Positives / query avg	2.54
Positives / query min	1
Positives / query median	2.00
Positives / query max	12
Multi-positive queries	103 (51.50%)
Query length avg chars	21.71
Document length avg chars	205.28

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4994	0.8000	0.9606	top-500
Dense	`harrier_oss_v1_270m`	0.6910	0.9100	0.9213	top-500
Reranking hybrid	`reranking_hybrid`	0.7026	0.9400	0.9882	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: unknown
Train/eval overlap audit: not_audited
Leakage note: prefer excluding upstream development/test data or other MIRACL-derived data likely to overlap with the NanoMIRACL evaluation questions and passages
Multi-positive training: single_positive_question_document_focus
Useful training data: non-overlapping MIRACL Korean train split data, Korean Wikipedia question-to-passage retrieval pairs, Korean open-domain QA evidence retrieval datasets