HAKARI-Bench

NanoJMTEB-v2

Overview

NanoJMTEB-v2 is a compact Japanese retrieval group derived from JMTEB, MTEB, and related Japanese datasets. It covers Japanese casual web search, government FAQ matching, quiz-to-entity retrieval, answer-label retrieval, MIRACL and Mr. TyDi passage retrieval, long-document retrieval, and four Japanese NLP Journal paper-component matching tasks.

The group is useful because it is not simply Japanese passage retrieval. Some tasks retrieve short answer labels, some retrieve noisy web snippets or FAQ answers, some retrieve Wikipedia-like passages or full entity pages, and some match titles or abstracts to academic paper sections. BM25 exposes Japanese term and entity anchoring, dense retrieval tests semantic passage and label matching, and reranking_hybrid indicates whether sparse and dense retrieval recover complementary candidates.

What This Group Measures

JMTEB and MTEB define Japanese retrieval tasks for embedding evaluation. This Nano group collects several Japanese retrieval sources: JaCWIR, JaGovFaqs, JAQKET, Mintaka Japanese, MIRACL, Mr. TyDi, MultiLongDocRetrieval, and Japanese NLP Journal paper-component matching tasks.

The group measures Japanese retrieval robustness across target types. A relevant document may be a title/description snippet, an official FAQ answer, a Wikipedia entity page, a short answer label, an answer-bearing passage, a full long article, or an academic paper section. A model that works well on one of these surfaces may not work on the others.

Task Families

Dataset Shape

NanoJMTEB-v2 contains 11 task pages, 2,200 queries, 64,140 split-local documents, and 2,432 positive qrel rows. Each split has 200 queries. Most tasks are single-positive; miracl_ja and mr_tidy_japanese contain multi-positive queries.

Text length varies widely. mintaka_ja documents are short answer labels, while multi_long_doc_ja and nlpjournal_abs_article contain very long documents. The NLP Journal abstract queries are much longer than web, FAQ, entity, or passage queries. This makes the group a mix of short Japanese search, entity inference, evidence passage retrieval, and long academic-document matching.

Retrieval Behavior

BM25 Profile

BM25 is very strong on the Japanese NLP Journal component tasks and on ja_cwir, where exact technical terms, titles, web keywords, and shared paper vocabulary are highly informative. It is also strong on jaqket, ja_gov_faqs, and long-document retrieval. These tasks often preserve Japanese surface anchors that sparse retrieval can exploit.

BM25 is weaker on mintaka_ja, where the target is a short answer label, and on MIRACL/Mr. TyDi passage retrieval, where short factual questions may not repeat the evidence passage wording. It remains useful, but semantic inference and answerability become more important.

Dense Profile

Dense retrieval is strongest on many passage and answer-oriented tasks: ja_gov_faqs, mintaka_ja, miracl_ja, and mr_tidy_japanese all benefit from embedding similarity. It connects short Japanese questions to passages or labels that express the requested answer without exact overlap.

Dense retrieval is not always best. For the NLP Journal tasks and multi_long_doc_ja, BM25 can outperform dense because titles, abstracts, and long articles share distinctive technical terms. That makes this group useful for testing whether dense models preserve exact Japanese terminology.

Reranking Hybrid Profile

reranking_hybrid is best on ja_gov_faqs and jaqket, and is competitive across many other tasks. These are cases where exact Japanese terms and semantic matching both matter: an FAQ answer or entity page may contain the right keywords, but ranking still needs intent or entity disambiguation.

For reranker experiments, the hybrid profile is most useful on tasks where BM25 and dense disagree. It provides a safer candidate pool for FAQ, entity, and passage retrieval than either signal alone.

Task Summary

TaskRetrieval shapeQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ja_cwirgenerated web question to page snippet20010,0002000.91810.83670.8810BM25
ja_gov_faqsgovernment FAQ question to answer20010,0002000.71960.74870.7614Reranking hybrid
jaqketquiz clue to entity page20010,0002000.78370.78300.7876Reranking hybrid
mintaka_jacomplex question to answer label2001,5922000.25610.36870.3354Dense
miracl_jaJapanese question to Wikipedia passage20010,0003730.53610.69230.6252Dense
mr_tidy_japaneseJapanese question to Mr. TyDi passage20010,0002590.55180.73990.6633Dense
multi_long_doc_jagenerated question to long article20010,0002000.59290.39560.5008BM25
nlpjournal_abs_articleabstract to full article2006372000.99820.97630.9863BM25
nlpjournal_abs_introabstract to introduction2006372000.98960.95530.9545BM25
nlpjournal_title_abstitle to abstract2006372000.95260.92900.9428BM25
nlpjournal_title_introtitle to introduction2006372000.91320.86320.8704BM25

Interpretation Notes for Model Researchers

NanoJMTEB-v2 is best read by target format. Academic component matching is lexically strong because titles, abstracts, and papers share technical terms. FAQ, entity, Mintaka, MIRACL, and Mr. TyDi tasks require more semantic matching or entity inference. A single Japanese average will hide those differences.

The BM25-heavy tasks are not trivial; they show that exact Japanese terminology and paper vocabulary matter. Dense-led tasks show where answerability, short-label matching, or passage semantics matter more. Hybrid-led tasks show where exact entity anchors and semantic disambiguation both contribute.

Training and Leakage Notes

Useful training data includes Japanese search logs, government FAQ pairs, Wikipedia passage retrieval, JAQKET-style quiz/entity data, Mintaka-style complex QA, long-document question generation, and Japanese academic paper component matching. Hard negatives should come from same entities, same government programs, same article families, or same research subfields.

Exclude NanoJMTEB-v2 evaluation queries, positives, qrels, paper sections, answer labels, and direct synthetic variants. Upstream JMTEB, MTEB, JAQKET, Mintaka, MIRACL, Mr. TyDi, MLDR, and NLP Journal evaluation examples should be audited before use in training.

Source Reference Table

SourceYearTypeURL
MTEB: Massive Text Embedding Benchmark2022paperhttps://arxiv.org/abs/2210.07316
MIRACL: A Multilingual Retrieval Dataset Covering 18 Diverse Languages2023paperhttps://aclanthology.org/2023.tacl-1.63/
Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval2021paperhttps://arxiv.org/abs/2108.08787
Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering2022paperhttps://aclanthology.org/2022.coling-1.138/

Metadata Summary

FieldValue
Task pages11
Queries2,200
Split-local documents64,140
Positive qrels2,432
Languagesja
Categoriesnatural_language
Positives / query avg1.11

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
ja_cwirNanoJMTEB-v2janatural_language20010,0002000.91810.83670.8810BM25
ja_gov_faqsNanoJMTEB-v2janatural_language20010,0002000.71960.74870.7614Reranking hybrid
jaqketNanoJMTEB-v2janatural_language20010,0002000.78370.78300.7876Reranking hybrid
mintaka_jaNanoJMTEB-v2janatural_language2001,5922000.25610.36870.3354Dense
miracl_jaNanoJMTEB-v2janatural_language20010,0003730.53610.69230.6252Dense
mr_tidy_japaneseNanoJMTEB-v2janatural_language20010,0002590.55180.73990.6633Dense
multi_long_doc_jaNanoJMTEB-v2janatural_language20010,0002000.59290.39560.5008BM25
nlpjournal_abs_articleNanoJMTEB-v2janatural_language2006372000.99820.97630.9863BM25
nlpjournal_abs_introNanoJMTEB-v2janatural_language2006372000.98960.95530.9545BM25
nlpjournal_title_absNanoJMTEB-v2janatural_language2006372000.95260.92900.9428BM25
nlpjournal_title_introNanoJMTEB-v2janatural_language2006372000.91320.86320.8704BM25