NanoMTEB-Thai
Overview
NanoMTEB-Thai is a compact Thai and Thai-English retrieval group aligned with MTEB-style task families. It includes Belebele reading-comprehension retrieval in cross-lingual and monolingual directions, Thai MIRACL and Mr. TyDi Wikipedia retrieval, Thai MKQA answer-label retrieval, Thai long-document retrieval, WebFAQ question-answer retrieval, and Thai XQuAD context retrieval. The group tests Thai retrieval across script handling, word segmentation, answer granularity, cross-lingual alignment, and document length.
The group contains 1,800 queries, 48,356 task-local documents, and 2,077 positive qrel rows. Most tasks are single-positive or near single-positive, but MIRACL, MKQA, and Mr. TyDi include multiple positives for some queries. It is a useful diagnostic because Thai retrieval quality changes sharply depending on whether the target is a passage, a short answer label, a long document, or a document in another language.
What This Group Measures
The group measures several Thai retrieval relations. Belebele tests whether reading-comprehension relevance survives Thai-English direction changes. miracl_th and mr_tidy_thai retrieve Thai Wikipedia-style evidence passages. mkqa_th retrieves short accepted answers such as names, dates, numbers, or locations. multi_long_doc_th retrieves full long Thai documents from generated questions. web_faq_tha retrieves FAQ answers, and xqu_ad_th retrieves the Thai context paragraph that answers a translated QA question.
This mix separates language segmentation from retrieval semantics. BM25 can be excellent when Thai query and passage wording overlaps, but it collapses on cross-lingual Belebele directions and MKQA answer labels. Dense retrieval handles cross-lingual and semantic matching much better, while hybrid retrieval helps with recall and with tasks where exact Thai terms and semantic relatedness both matter.
Task Families
- Belebele reading-comprehension retrieval: three tasks cover Thai-to-English, English-to-Thai, and Thai-to-Thai retrieval.
- Thai Wikipedia retrieval:
miracl_thandmr_tidy_thairetrieve evidence passages for Thai information needs. - Short answer-label retrieval:
mkqa_thretrieves canonical answer labels for Thai questions. - Long-document retrieval:
multi_long_doc_thretrieves full long Thai documents. - FAQ retrieval:
web_faq_tharetrieves Thai web FAQ answer snippets. - Translated QA context retrieval:
xqu_ad_thretrieves answer-bearing Thai XQuAD contexts.
Dataset Shape
The group has nine task pages with 200 queries each. Candidate pools range from 240 documents for xqu_ad_th to 10,000 documents for MIRACL, Mr. TyDi, WebFAQ, and the long-document task. mkqa_th has short answer labels with an average of 1.5 positives per query. miracl_th and mr_tidy_thai are also multi-positive, while the Belebele, WebFAQ, XQuAD, and long-document tasks are single-positive in the Nano splits.
Document length is uneven. mkqa_th documents are short labels. Belebele, MIRACL, Mr. TyDi, WebFAQ, and XQuAD use passage-style documents. multi_long_doc_th is the outlier, with very long Thai documents. This makes the group sensitive to tokenization, truncation, and whether a model can rank long noisy pages without losing the relevant evidence.
Retrieval Behavior
BM25 Profile
BM25 is best for multi_long_doc_th and xqu_ad_th, and it is nearly tied on Thai-to-Thai Belebele. It performs very well when the query and document share Thai lexical evidence: xqu_ad_th reaches 0.9835 nDCG@10, mr_tidy_thai reaches 0.8502, web_faq_tha reaches 0.7607, and Thai-to-Thai Belebele reaches 0.9297. These scores show that sparse retrieval can be strong when segmentation and exact terms line up.
The failures are equally clear. BM25 scores 0.0891 and 0.0944 on the two cross-lingual Belebele directions, and only 0.0182 on mkqa_th. Thai-English matching has little direct lexical overlap, and short answer labels often do not repeat the question wording. BM25 is therefore a strong monolingual passage baseline, but not a robust solution for the whole Thai group.
Dense Profile
Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.6978 nDCG@10. It is best for both cross-lingual Belebele directions, miracl_th, mkqa_th, and mr_tidy_thai. The cross-lingual gain is dramatic: Thai-to-English Belebele rises from 0.0891 BM25 nDCG@10 to 0.8483 dense nDCG@10, and English-to-Thai rises from 0.0944 to 0.8046. This is the main semantic alignment signal in the group.
Dense is also the best profile for mkqa_th, although the absolute score is low at 0.0359. That indicates that answer-label retrieval remains difficult even with embeddings. Dense is weaker than BM25 on multi_long_doc_th and xqu_ad_th, where exact Thai terms and small context pools favor sparse matching.
Reranking Hybrid Profile
The reranking hybrid profile is best for Thai-to-Thai Belebele and web_faq_tha, and it has the best query-weighted recall@100 at 0.8556. It works well when exact Thai terms and semantic similarity are both useful: Thai-to-Thai Belebele reaches 0.9615 nDCG@10 and WebFAQ reaches 0.7866. Hybrid also improves recall for MIRACL, Mr. TyDi, XQuAD, and the long-document task.
Hybrid is much weaker than dense on the cross-lingual Belebele tasks. Sparse evidence adds little when query and document are in different languages, so the hybrid profile falls far below dense in top-10 ranking. This group therefore supports a task-aware interpretation: hybrid is useful for Thai monolingual candidate coverage, but dense retrieval is essential for Thai-English semantic alignment.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| belebele_eng_latn_tha_thai | Cross-lingual reading retrieval | multilingual | 200 | 488 | 200 | 1.00 | 0.0891 | 0.8483 | 0.2919 | Dense |
| belebele_tha_thai_eng_latn | Cross-lingual reading retrieval | multilingual | 200 | 488 | 200 | 1.00 | 0.0944 | 0.8046 | 0.2741 | Dense |
| belebele_tha_thai_tha_thai | Monolingual reading retrieval | th | 200 | 488 | 200 | 1.00 | 0.9297 | 0.9287 | 0.9615 | Reranking hybrid |
| miracl_th | Wikipedia retrieval | th | 200 | 10,000 | 343 | 1.72 | 0.5999 | 0.8076 | 0.7250 | Dense |
| mkqa_th | Answer-label retrieval | multilingual | 200 | 6,652 | 300 | 1.50 | 0.0182 | 0.0359 | 0.0272 | Dense |
| mr_tidy_thai | Wikipedia retrieval | th | 200 | 10,000 | 234 | 1.17 | 0.8502 | 0.9147 | 0.8914 | Dense |
| multi_long_doc_th | Long-document retrieval | th | 200 | 10,000 | 200 | 1.00 | 0.3684 | 0.2125 | 0.3672 | BM25 |
| web_faq_tha | FAQ retrieval | th | 200 | 10,000 | 200 | 1.00 | 0.7607 | 0.7822 | 0.7866 | Reranking hybrid |
| xqu_ad_th | QA context retrieval | th | 200 | 240 | 200 | 1.00 | 0.9835 | 0.9459 | 0.9674 | BM25 |
Interpretation Notes for Model Researchers
NanoMTEB-Thai has one of the clearest divisions between retrieval profiles. Dense retrieval is crucial for Thai-English tasks and semantic passage retrieval. BM25 remains very strong for monolingual Thai contexts with lexical overlap. Hybrid helps with monolingual passage and FAQ coverage but does not solve cross-lingual retrieval when sparse evidence is absent.
mkqa_th should be read separately from the passage tasks. All profiles score low because the target is a short answer label, not an explanatory passage. A model can improve MIRACL or Belebele substantially while still failing to rank short Thai answer labels. Long-document retrieval is also separate: success there depends on handling long noisy documents and exact evidence anchors.
Training and Leakage Notes
Useful training data includes Thai Wikipedia QA, MIRACL Thai, Mr. TyDi Thai, XQuAD-style question-context pairs, Thai-English parallel reading-comprehension data, MKQA-like answer-label supervision, Thai long-document question-to-article pairs, and Thai FAQ question-answer pairs. Cross-lingual directions should be kept explicit rather than mixed into one undifferentiated retrieval objective.
Leakage control should exclude Nano evaluation queries, qrels, positive documents, answer labels, generated long-document questions, and upstream evaluation rows. Synthetic examples should preserve Thai script, segmentation, named entities, dates, numbers, answer types, FAQ wording, and long-document evidence locations. Hard negatives should be drawn from related entities, adjacent FAQ entries, same-topic Wikipedia pages, or nearby sections of long documents.
Source Reference Table
| Source | Year | Type | URL |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| The Belebele Benchmark | 2023 | source task paper | https://arxiv.org/abs/2308.16884 |
| Making a MIRACL | 2022 | source task paper | https://arxiv.org/abs/2210.09984 |
| MKQA | 2020 | source task paper | https://arxiv.org/abs/2007.15207 |
| Mr. TyDi | 2021 | source task paper | https://arxiv.org/abs/2108.08787 |
| M3-Embedding / MLDR | 2024 | source task paper | https://arxiv.org/abs/2402.03216 |
| WebFAQ | 2025 | source task paper | https://arxiv.org/abs/2502.20936 |
| On the Cross-lingual Transferability of Monolingual Representations | 2019 | source task paper | https://arxiv.org/abs/1910.11856 |
| mteb/belebele | dataset card | https://huggingface.co/datasets/mteb/belebele | |
| mteb/MIRACLRetrievalHardNegatives | dataset card | https://huggingface.co/datasets/mteb/MIRACLRetrievalHardNegatives | |
| mteb/MKQARetrieval | dataset card | https://huggingface.co/datasets/mteb/MKQARetrieval | |
| mteb/mrtidy | dataset card | https://huggingface.co/datasets/mteb/mrtidy | |
| mteb/MultiLongDocRetrieval | dataset card | https://huggingface.co/datasets/mteb/MultiLongDocRetrieval |
Metadata Summary
| Field | Value |
| Task pages | 9 |
| Queries | 1,800 |
| Split-local documents | 48,356 |
| Positive qrels | 2,077 |
| Languages | multilingual, th |
| Categories | natural_language |
| Positives / query avg | 1.15 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| belebele_eng_latn_tha_thai | NanoMTEB-Thai | multilingual | natural_language | 200 | 488 | 200 | 0.0891 | 0.8483 | 0.2919 | Dense |
| belebele_tha_thai_eng_latn | NanoMTEB-Thai | multilingual | natural_language | 200 | 488 | 200 | 0.0944 | 0.8046 | 0.2741 | Dense |
| belebele_tha_thai_tha_thai | NanoMTEB-Thai | th | natural_language | 200 | 488 | 200 | 0.9297 | 0.9287 | 0.9615 | Reranking hybrid |
| miracl_th | NanoMTEB-Thai | th | natural_language | 200 | 10,000 | 343 | 0.5999 | 0.8076 | 0.7250 | Dense |
| mkqa_th | NanoMTEB-Thai | multilingual | natural_language | 200 | 6,652 | 300 | 0.0182 | 0.0359 | 0.0272 | Dense |
| mr_tidy_thai | NanoMTEB-Thai | th | natural_language | 200 | 10,000 | 234 | 0.8502 | 0.9147 | 0.8914 | Dense |
| multi_long_doc_th | NanoMTEB-Thai | th | natural_language | 200 | 10,000 | 200 | 0.3684 | 0.2125 | 0.3672 | BM25 |
| web_faq_tha | NanoMTEB-Thai | th | natural_language | 200 | 10,000 | 200 | 0.7607 | 0.7822 | 0.7866 | Reranking hybrid |
| xqu_ad_th | NanoMTEB-Thai | th | natural_language | 200 | 240 | 200 | 0.9835 | 0.9459 | 0.9674 | BM25 |