NanoMTEB-Korean
Overview
NanoMTEB-Korean is a compact Korean retrieval group with five tasks spanning RAG evidence retrieval, implicit-reasoning evidence retrieval, legal article lookup, MIRACL-style Wikipedia retrieval, and KorQuAD/SQuAD-style context retrieval. It is a useful group for model researchers because Korean retrieval quality is shaped by morphology, spacing variation, domain-specific terms, and the difference between literal evidence matching and semantic answerability.
The group contains 914 queries, 24,493 task-local documents, and 1,400 positive qrel rows. autorag, lawir_ko, and squad_kor_v1 are single-positive in the Nano splits. ko_strategy_qa and miracl_ko are multi-positive, so a model can receive credit for retrieving several acceptable evidence passages. This means the group combines exact target retrieval with broader evidence-list ranking.
What This Group Measures
The group measures whether Korean retrieval systems can handle five related but different relevance relations. autorag retrieves public-document chunks needed for Korean RAG questions. ko_strategy_qa retrieves Korean evidence for StrategyQA-style questions where the supporting fact may be implicit. lawir_ko retrieves statute articles from law and provision queries. miracl_ko retrieves Korean Wikipedia passages for information-seeking questions. squad_kor_v1 retrieves answer-bearing Korean Wikipedia contexts.
The benchmark is not just a Korean QA suite. It includes legal lookup and public document retrieval, and it separates tasks where exact terminology is highly predictive from tasks where dense semantic matching has a clearer advantage. This makes it useful for diagnosing whether a model is strong because it understands Korean semantic relations, because it matches surface terms well, or because it balances both signals in a hybrid setting.
Task Families
- Korean RAG retrieval:
autoragretrieves answer-supporting chunks from public and domain documents. - Implicit-reasoning retrieval:
ko_strategy_qaretrieves evidence passages for Korean StrategyQA-style questions. - Legal retrieval:
lawir_koretrieves Korean statutory article text from law/provision queries. - Korean Wikipedia retrieval:
miracl_koretrieves MIRACL Korean Wikipedia passages. - Reading-comprehension retrieval:
squad_kor_v1retrieves KorQuAD/SQuAD answer contexts.
Dataset Shape
The group has five Korean-language tasks. The largest document pools are miracl_ko with 10,000 documents and ko_strategy_qa with 9,251 documents. autorag is the smallest pool with 720 documents, while squad_kor_v1 has 960 documents and very direct QA-context matching. The group-level document count is the sum of task-local candidate pools, not a deduplicated Korean corpus.
Positive density varies by task. Three tasks have exactly one positive per query. ko_strategy_qa averages 1.89 positives per query, and miracl_ko averages 2.54 positives per query. Those two tasks should be read with recall and listwise ranking behavior in mind, because one question may have multiple valid evidence passages.
Retrieval Behavior
BM25 Profile
BM25 is strongest on tasks with direct lexical anchors. It is the best nDCG@10 profile for autorag and squad_kor_v1. autorag reaches 0.9053 nDCG@10, suggesting that many RAG queries retain domain terms, numbers, named entities, or report language that appear in the positive chunk. squad_kor_v1 reaches 0.9618 nDCG@10, which indicates that the question and target context often share enough Korean surface evidence for sparse retrieval to rank the answer context very highly.
BM25 is less dominant on ko_strategy_qa, lawir_ko, and miracl_ko. StrategyQA-style evidence may not repeat the exact question wording, legal article lookup can require mapping a provision description to formal statutory language, and MIRACL-style Wikipedia retrieval often benefits from semantic matching beyond exact token overlap. At group level BM25 is still strong (0.6525 query-weighted nDCG@10), but it is not the leading profile overall.
Dense Profile
Dense retrieval with harrier-oss-270m is the best profile for ko_strategy_qa and lawir_ko. The gain on ko_strategy_qa is large: 0.7084 nDCG@10 for dense against 0.4740 for BM25. This is the expected pattern for implicit-reasoning evidence retrieval, where the useful passage can express the supporting fact rather than repeat the question. lawir_ko also favors dense at 0.6534 nDCG@10, showing that semantic matching helps bridge query phrasing and formal legal article text.
Dense is competitive but not best on autorag and squad_kor_v1, and it is slightly behind hybrid on miracl_ko. This profile suggests that Korean dense retrieval is valuable when evidence is paraphrastic or domain phrasing diverges, but exact surface forms remain important in direct QA and RAG chunk retrieval.
Reranking Hybrid Profile
The reranking hybrid profile is the strongest query-weighted profile for the group: 0.7557 nDCG@10, 0.9114 hit@10, and 0.9640 recall@100. It is the best individual nDCG@10 profile for miracl_ko, where dense and sparse signals complement each other in a multi-positive Wikipedia retrieval setting. It is also close to the best profile on squad_kor_v1 and lawir_ko.
Hybrid is not uniformly superior. It trails BM25 on autorag and squad_kor_v1, and it trails dense on ko_strategy_qa and lawir_ko. However, its recall@100 is the best or tied-best on most tasks. This makes NanoMTEB-Korean a clear example where hybrid retrieval can be a strong candidate generation strategy even when the top-10 ordering still depends on task family.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| autorag | RAG evidence retrieval | ko | 114 | 720 | 114 | 1.00 | 0.9053 | 0.7745 | 0.8530 | BM25 |
| ko_strategy_qa | Implicit-reasoning evidence retrieval | ko | 200 | 9,251 | 378 | 1.89 | 0.4740 | 0.7084 | 0.6476 | Dense |
| lawir_ko | Legal retrieval | ko | 200 | 3,562 | 200 | 1.00 | 0.5232 | 0.6534 | 0.6491 | Dense |
| miracl_ko | Wikipedia retrieval | ko | 200 | 10,000 | 508 | 2.54 | 0.5069 | 0.6997 | 0.7121 | Reranking hybrid |
| squad_kor_v1 | Reading-comprehension retrieval | ko | 200 | 960 | 200 | 1.00 | 0.9618 | 0.9158 | 0.9585 | BM25 |
Interpretation Notes for Model Researchers
The group separates three patterns. First, autorag and squad_kor_v1 reward models that preserve Korean lexical evidence and rank direct answer chunks early. Second, ko_strategy_qa and lawir_ko reward semantic matching, because the relevant passage can differ from the query in wording or legal formulation. Third, miracl_ko benefits most from a hybrid candidate set, suggesting that MIRACL-style Korean Wikipedia retrieval needs both named-entity overlap and semantic relatedness.
Because best profiles are split across BM25, dense, and hybrid, a single group average can hide important behavior. A model that improves ko_strategy_qa may be learning semantic evidence retrieval, while a model that improves squad_kor_v1 may simply be better at precise context matching. Per-task inspection is required before drawing conclusions about Korean retrieval quality.
Training and Leakage Notes
Useful training data includes non-overlapping Korean RAG question-to-chunk pairs, Korean Wikipedia QA retrieval pairs, MIRACL Korean training examples, Ko-StrategyQA evidence data, and Korean law/provision retrieval pairs. For multi-positive tasks, listwise or multi-positive contrastive objectives match the evaluation better than forcing each query to a single positive.
Leakage control should exclude Nano evaluation queries, positive passages, qrels, and overlapping upstream test/dev examples from AutoRAG, Ko-StrategyQA, LawIRKo, MIRACL Korean, and KorQuAD/SQuADKor sources. Synthetic data should preserve Korean spacing and morphology, named entities, legal article numbers, domain terms, dates, and quantities. Strong hard negatives should come from the same report, same statute, same Wikipedia page family, or related evidence chain.
Source Reference Table
| Source | Year | Type | URL |
| AutoRAG: Automated Framework for optimization of Retrieval Augmented Generation Pipeline | 2024 | paper | https://arxiv.org/abs/2410.20878 |
| Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies | 2021 | paper | https://arxiv.org/abs/2101.02235 |
| KorQuAD1.0: Korean QA Dataset for Machine Reading Comprehension | 2019 | paper | https://arxiv.org/abs/1909.07005 |
| MIRACL | benchmark page | http://miracl.ai/ | |
| Massive Text Embedding Benchmark (MTEB) | benchmark repository | https://github.com/embeddings-benchmark/mteb | |
| yjoonjang/markers_bm | dataset card | https://huggingface.co/datasets/yjoonjang/markers_bm | |
| taeminlee/Ko-StrategyQA | dataset card | https://huggingface.co/datasets/taeminlee/Ko-StrategyQA | |
| on-and-on/lawgov_ir-ko | dataset card | https://huggingface.co/datasets/on-and-on/lawgov_ir-ko | |
| mteb/MIRACLRetrieval | dataset card | https://huggingface.co/datasets/mteb/MIRACLRetrieval | |
| yjoonjang/squad_kor_v1 | dataset card | https://huggingface.co/datasets/yjoonjang/squad_kor_v1 |
Metadata Summary
| Field | Value |
| Task pages | 5 |
| Queries | 914 |
| Split-local documents | 24,493 |
| Positive qrels | 1,400 |
| Languages | ko |
| Categories | natural_language |
| Positives / query avg | 1.53 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| autorag | NanoMTEB-Korean | ko | natural_language | 114 | 720 | 114 | 0.9053 | 0.7745 | 0.8530 | BM25 |
| ko_strategy_qa | NanoMTEB-Korean | ko | natural_language | 200 | 9,251 | 378 | 0.4740 | 0.7084 | 0.6476 | Dense |
| lawir_ko | NanoMTEB-Korean | ko | natural_language | 200 | 3,562 | 200 | 0.5232 | 0.6534 | 0.6491 | Dense |
| miracl_ko | NanoMTEB-Korean | ko | natural_language | 200 | 10,000 | 508 | 0.5069 | 0.6997 | 0.7121 | Reranking hybrid |
| squad_kor_v1 | NanoMTEB-Korean | ko | natural_language | 200 | 960 | 200 | 0.9618 | 0.9158 | 0.9585 | BM25 |