HAKARI-Bench

NanoMTEB-Thai

Overview

NanoMTEB-Thai is a compact Thai and Thai-English retrieval group aligned with MTEB-style task families. It includes Belebele reading-comprehension retrieval in cross-lingual and monolingual directions, Thai MIRACL and Mr. TyDi Wikipedia retrieval, Thai MKQA answer-label retrieval, Thai long-document retrieval, WebFAQ question-answer retrieval, and Thai XQuAD context retrieval. The group tests Thai retrieval across script handling, word segmentation, answer granularity, cross-lingual alignment, and document length.

The group contains 1,800 queries, 48,356 task-local documents, and 2,077 positive qrel rows. Most tasks are single-positive or near single-positive, but MIRACL, MKQA, and Mr. TyDi include multiple positives for some queries. It is a useful diagnostic because Thai retrieval quality changes sharply depending on whether the target is a passage, a short answer label, a long document, or a document in another language.

What This Group Measures

The group measures several Thai retrieval relations. Belebele tests whether reading-comprehension relevance survives Thai-English direction changes. miracl_th and mr_tidy_thai retrieve Thai Wikipedia-style evidence passages. mkqa_th retrieves short accepted answers such as names, dates, numbers, or locations. multi_long_doc_th retrieves full long Thai documents from generated questions. web_faq_tha retrieves FAQ answers, and xqu_ad_th retrieves the Thai context paragraph that answers a translated QA question.

This mix separates language segmentation from retrieval semantics. BM25 can be excellent when Thai query and passage wording overlaps, but it collapses on cross-lingual Belebele directions and MKQA answer labels. Dense retrieval handles cross-lingual and semantic matching much better, while hybrid retrieval helps with recall and with tasks where exact Thai terms and semantic relatedness both matter.

Task Families

Dataset Shape

The group has nine task pages with 200 queries each. Candidate pools range from 240 documents for xqu_ad_th to 10,000 documents for MIRACL, Mr. TyDi, WebFAQ, and the long-document task. mkqa_th has short answer labels with an average of 1.5 positives per query. miracl_th and mr_tidy_thai are also multi-positive, while the Belebele, WebFAQ, XQuAD, and long-document tasks are single-positive in the Nano splits.

Document length is uneven. mkqa_th documents are short labels. Belebele, MIRACL, Mr. TyDi, WebFAQ, and XQuAD use passage-style documents. multi_long_doc_th is the outlier, with very long Thai documents. This makes the group sensitive to tokenization, truncation, and whether a model can rank long noisy pages without losing the relevant evidence.

Retrieval Behavior

BM25 Profile

BM25 is best for multi_long_doc_th and xqu_ad_th, and it is nearly tied on Thai-to-Thai Belebele. It performs very well when the query and document share Thai lexical evidence: xqu_ad_th reaches 0.9835 nDCG@10, mr_tidy_thai reaches 0.8502, web_faq_tha reaches 0.7607, and Thai-to-Thai Belebele reaches 0.9297. These scores show that sparse retrieval can be strong when segmentation and exact terms line up.

The failures are equally clear. BM25 scores 0.0891 and 0.0944 on the two cross-lingual Belebele directions, and only 0.0182 on mkqa_th. Thai-English matching has little direct lexical overlap, and short answer labels often do not repeat the question wording. BM25 is therefore a strong monolingual passage baseline, but not a robust solution for the whole Thai group.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.6978 nDCG@10. It is best for both cross-lingual Belebele directions, miracl_th, mkqa_th, and mr_tidy_thai. The cross-lingual gain is dramatic: Thai-to-English Belebele rises from 0.0891 BM25 nDCG@10 to 0.8483 dense nDCG@10, and English-to-Thai rises from 0.0944 to 0.8046. This is the main semantic alignment signal in the group.

Dense is also the best profile for mkqa_th, although the absolute score is low at 0.0359. That indicates that answer-label retrieval remains difficult even with embeddings. Dense is weaker than BM25 on multi_long_doc_th and xqu_ad_th, where exact Thai terms and small context pools favor sparse matching.

Reranking Hybrid Profile

The reranking hybrid profile is best for Thai-to-Thai Belebele and web_faq_tha, and it has the best query-weighted recall@100 at 0.8556. It works well when exact Thai terms and semantic similarity are both useful: Thai-to-Thai Belebele reaches 0.9615 nDCG@10 and WebFAQ reaches 0.7866. Hybrid also improves recall for MIRACL, Mr. TyDi, XQuAD, and the long-document task.

Hybrid is much weaker than dense on the cross-lingual Belebele tasks. Sparse evidence adds little when query and document are in different languages, so the hybrid profile falls far below dense in top-10 ranking. This group therefore supports a task-aware interpretation: hybrid is useful for Thai monolingual candidate coverage, but dense retrieval is essential for Thai-English semantic alignment.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
belebele_eng_latn_tha_thaiCross-lingual reading retrievalmultilingual2004882001.000.08910.84830.2919Dense
belebele_tha_thai_eng_latnCross-lingual reading retrievalmultilingual2004882001.000.09440.80460.2741Dense
belebele_tha_thai_tha_thaiMonolingual reading retrievalth2004882001.000.92970.92870.9615Reranking hybrid
miracl_thWikipedia retrievalth20010,0003431.720.59990.80760.7250Dense
mkqa_thAnswer-label retrievalmultilingual2006,6523001.500.01820.03590.0272Dense
mr_tidy_thaiWikipedia retrievalth20010,0002341.170.85020.91470.8914Dense
multi_long_doc_thLong-document retrievalth20010,0002001.000.36840.21250.3672BM25
web_faq_thaFAQ retrievalth20010,0002001.000.76070.78220.7866Reranking hybrid
xqu_ad_thQA context retrievalth2002402001.000.98350.94590.9674BM25

Interpretation Notes for Model Researchers

NanoMTEB-Thai has one of the clearest divisions between retrieval profiles. Dense retrieval is crucial for Thai-English tasks and semantic passage retrieval. BM25 remains very strong for monolingual Thai contexts with lexical overlap. Hybrid helps with monolingual passage and FAQ coverage but does not solve cross-lingual retrieval when sparse evidence is absent.

mkqa_th should be read separately from the passage tasks. All profiles score low because the target is a short answer label, not an explanatory passage. A model can improve MIRACL or Belebele substantially while still failing to rank short Thai answer labels. Long-document retrieval is also separate: success there depends on handling long noisy documents and exact evidence anchors.

Training and Leakage Notes

Useful training data includes Thai Wikipedia QA, MIRACL Thai, Mr. TyDi Thai, XQuAD-style question-context pairs, Thai-English parallel reading-comprehension data, MKQA-like answer-label supervision, Thai long-document question-to-article pairs, and Thai FAQ question-answer pairs. Cross-lingual directions should be kept explicit rather than mixed into one undifferentiated retrieval objective.

Leakage control should exclude Nano evaluation queries, qrels, positive documents, answer labels, generated long-document questions, and upstream evaluation rows. Synthetic examples should preserve Thai script, segmentation, named entities, dates, numbers, answer types, FAQ wording, and long-document evidence locations. Hard negatives should be drawn from related entities, adjacent FAQ entries, same-topic Wikipedia pages, or nearby sections of long documents.

Source Reference Table

SourceYearTypeURL
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
The Belebele Benchmark2023source task paperhttps://arxiv.org/abs/2308.16884
Making a MIRACL2022source task paperhttps://arxiv.org/abs/2210.09984
MKQA2020source task paperhttps://arxiv.org/abs/2007.15207
Mr. TyDi2021source task paperhttps://arxiv.org/abs/2108.08787
M3-Embedding / MLDR2024source task paperhttps://arxiv.org/abs/2402.03216
WebFAQ2025source task paperhttps://arxiv.org/abs/2502.20936
On the Cross-lingual Transferability of Monolingual Representations2019source task paperhttps://arxiv.org/abs/1910.11856
mteb/belebeledataset cardhttps://huggingface.co/datasets/mteb/belebele
mteb/MIRACLRetrievalHardNegativesdataset cardhttps://huggingface.co/datasets/mteb/MIRACLRetrievalHardNegatives
mteb/MKQARetrievaldataset cardhttps://huggingface.co/datasets/mteb/MKQARetrieval
mteb/mrtidydataset cardhttps://huggingface.co/datasets/mteb/mrtidy
mteb/MultiLongDocRetrievaldataset cardhttps://huggingface.co/datasets/mteb/MultiLongDocRetrieval

Metadata Summary

FieldValue
Task pages9
Queries1,800
Split-local documents48,356
Positive qrels2,077
Languagesmultilingual, th
Categoriesnatural_language
Positives / query avg1.15

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
belebele_eng_latn_tha_thaiNanoMTEB-Thaimultilingualnatural_language2004882000.08910.84830.2919Dense
belebele_tha_thai_eng_latnNanoMTEB-Thaimultilingualnatural_language2004882000.09440.80460.2741Dense
belebele_tha_thai_tha_thaiNanoMTEB-Thaithnatural_language2004882000.92970.92870.9615Reranking hybrid
miracl_thNanoMTEB-Thaithnatural_language20010,0003430.59990.80760.7250Dense
mkqa_thNanoMTEB-Thaimultilingualnatural_language2006,6523000.01820.03590.0272Dense
mr_tidy_thaiNanoMTEB-Thaithnatural_language20010,0002340.85020.91470.8914Dense
multi_long_doc_thNanoMTEB-Thaithnatural_language20010,0002000.36840.21250.3672BM25
web_faq_thaNanoMTEB-Thaithnatural_language20010,0002000.76070.78220.7866Reranking hybrid
xqu_ad_thNanoMTEB-Thaithnatural_language2002402000.98350.94590.9674BM25