HAKARI-Bench

NanoMTEB-Polish

Overview

NanoMTEB-Polish is a Polish retrieval group dominated by translated community-question duplicate retrieval. Ten of its fourteen tasks are Polish CQADupStack domains, covering Android, English usage, GIS, Mathematica, Physics, Programmers, Statistics, TeX, Webmasters, and WordPress. The remaining tasks cover Polish financial QA retrieval, Natural Questions-style fact retrieval, PUGG Polish Wikipedia QA retrieval, and Quora duplicate-question retrieval.

The group contains 2,800 queries, 140,000 task-local documents, and 8,151 positive qrel rows. Every task has 200 queries and a 10,000-document candidate pool, so per-task differences are easy to compare. The group is useful because it asks whether a model can retrieve Polish paraphrases and duplicates across technical domains while also handling finance, Wikipedia QA, and short duplicate-question retrieval.

What This Group Measures

Most tasks measure duplicate intent rather than simple topical relatedness. In the CQADupStack and Quora tasks, the model must retrieve another question that asks the same thing, often with different wording, examples, software versions, or technical details. This is harder than matching the same domain: two LaTeX, WordPress, or Mathematica posts can share many tokens while solving different problems.

The non-duplicate tasks broaden the group. fiqa retrieves finance answers, nq retrieves answer-bearing passages for Polish fact questions, and pugg retrieves Polish Wikipedia-style passages. These tasks make the group a diagnostic for Polish semantic retrieval, not only translated Stack Exchange duplicate detection.

Task Families

Dataset Shape

All fourteen tasks are Polish (pl) and use equal-sized Nano splits: 200 queries and 10,000 candidate documents per task. Positive density varies much more than corpus size. pugg has exactly one positive per query, while cqadupstack_english averages 6.78 positives per query and several CQADupStack domains contain large duplicate clusters. Across the group, the average is 2.91 positives per query.

The text style is heterogeneous. CQADupStack documents often contain translated technical posts, product names, code-like tokens, formulas, and Stack Exchange formatting. Quora documents are short questions. FiQA answers are explanatory finance passages. PUGG and NQ are closer to factoid QA retrieval. This makes the group sensitive to both Polish language modeling and preservation of technical surface forms.

Retrieval Behavior

BM25 Profile

BM25 is the best nDCG@10 profile for none of the fourteen tasks in the current Nano data, but it remains an important baseline. It is strongest on quora (0.7704 nDCG@10) and pugg (0.6390), where short questions or factoid prompts often contain distinctive entities and overlapping terms. BM25 is also competitive in some CQADupStack domains when duplicates share product names, commands, packages, or terminology.

Its weakness is duplicate-intent matching. Many CQADupStack positives express the same underlying problem with different Polish wording or different examples. The hardest BM25 tasks are cqadupstack_mathematica, fiqa, cqadupstack_gis, and cqadupstack_webmasters, all below 0.25 nDCG@10. These results show that exact word frequency alone is not enough for Polish technical duplicate retrieval.

Dense Profile

Dense retrieval with harrier-oss-270m is the best profile for eight tasks: cqadupstack_android, cqadupstack_english, cqadupstack_physics, cqadupstack_stats, fiqa, nq, pugg, and quora. The largest gains appear on tasks where relevance depends on paraphrase or answerability. nq rises from 0.3026 BM25 nDCG@10 to 0.6154 dense nDCG@10, and fiqa rises from 0.2353 to 0.3890. Quora also benefits from dense paraphrase matching, reaching 0.9073.

Dense is not uniformly best across the technical duplicate tasks. Some domains with specialized terminology, code-like names, or narrow technical phrasing are better handled by the reranking hybrid profile. Still, the query-weighted dense nDCG@10 of 0.4271 is the highest group-level nDCG@10 profile, which makes NanoMTEB-Polish a strong test of Polish embedding similarity.

Reranking Hybrid Profile

The reranking hybrid profile is best for six tasks: cqadupstack_gis, cqadupstack_mathematica, cqadupstack_programmers, cqadupstack_tex, cqadupstack_webmasters, and cqadupstack_wordpress. These are mostly technical CQADupStack domains where exact technical strings and semantic duplicate intent both matter. Hybrid retrieval can recover candidates that dense misses because they share command names, package names, programming terms, or product-specific vocabulary.

At group level, hybrid has lower nDCG@10 than dense (0.4088 versus 0.4271), but it has the best recall@100 at 0.6087. This suggests that hybrid search is useful for candidate generation in Polish technical duplicate retrieval, even when a dense model gives a better final top-10 ordering for the broader group.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
cqadupstack_androidTechnical duplicate retrievalpl20010,0008094.040.33790.41390.4121Dense
cqadupstack_englishTechnical duplicate retrievalpl20010,0001,3566.780.31880.39260.3725Dense
cqadupstack_gisTechnical duplicate retrievalpl20010,0003131.560.24230.28610.3143Reranking hybrid
cqadupstack_mathematicaTechnical duplicate retrievalpl20010,0005062.530.21290.21710.2411Reranking hybrid
cqadupstack_physicsTechnical duplicate retrievalpl20010,0006213.100.33590.43060.4024Dense
cqadupstack_programmersTechnical duplicate retrievalpl20010,0006343.170.31910.32750.3607Reranking hybrid
cqadupstack_statsTechnical duplicate retrievalpl20010,0003731.860.26620.33750.3314Dense
cqadupstack_texTechnical duplicate retrievalpl20010,0008434.210.25550.28050.3147Reranking hybrid
cqadupstack_webmastersTechnical duplicate retrievalpl20010,0008824.410.24400.30450.3162Reranking hybrid
cqadupstack_wordpressTechnical duplicate retrievalpl20010,0003441.720.31390.29510.3289Reranking hybrid
fiqaFinancial QA retrievalpl20010,0005342.670.23530.38900.3574Dense
nqOpen-domain fact retrievalpl20010,0002511.250.30260.61540.4363Dense
puggNative Polish QA retrievalpl20010,0002001.000.63900.78170.7146Dense
quoraShort duplicate-question retrievalpl20010,0004852.420.77040.90730.8207Dense

Interpretation Notes for Model Researchers

NanoMTEB-Polish is best interpreted by separating technical duplicate retrieval from QA retrieval. Dense retrieval leads the group overall and is especially important for Quora, NQ, PUGG, and FiQA. Hybrid retrieval is more valuable in technical CQADupStack domains where exact software or mathematical terms should not be lost. BM25 is a useful sanity baseline, but it does not win any task in this slice.

The group is also sensitive to Polish translation quality and domain terms. Strong scores may reflect the ability to preserve technical tokens and code-like strings, not only general Polish semantic understanding. For model comparison, inspect the CQADupStack block separately from NQ/PUGG/FiQA/Quora before making claims about Polish retrieval quality.

Training and Leakage Notes

Useful training data includes non-overlapping Polish duplicate-question pairs, translated Stack Exchange duplicates, native Polish paraphrase data, Polish technical QA, Polish Wikipedia QA retrieval, FiQA-style finance QA, and PUGG training records. For CQADupStack, hard negatives should come from the same technical site and share product names, function names, formulas, packages, or domain terminology while asking a different question.

Leakage control is important because duplicate-question datasets are highly clustered. Training should exclude Nano evaluation queries, qrels, positive documents, and overlapping upstream test records from CQADupStack-PL, Quora-PL, FiQA-PL, NQ-PL, and PUGG. Synthetic examples should preserve Polish wording, technical tokens, code snippets, mathematical notation, financial terms, and named entities.

Source Reference Table

SourceYearTypeURL
CQADupStack: A Benchmark Data Set for Community Question-Answering Research2015paperhttps://ir.webis.de/anthology/2015.adcs_conference-2015.3/
BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language2024paperhttps://aclanthology.org/2024.lrec-main.194/
Developing PUGG for Polish: A Modern Approach to KBQA, MRC, and IR Dataset Construction2024paperhttps://aclanthology.org/2024.findings-acl.652/
FiQA challenge siteproject pagehttps://sites.google.com/view/fiqa/
Natural Questionsproject pagehttps://ai.google.com/research/NaturalQuestions/
First Quora Dataset Release: Question Pairsdataset pagehttps://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs
Massive Text Embedding Benchmark (MTEB)benchmark repositoryhttps://github.com/embeddings-benchmark/mteb
mteb/FiQA-PLdataset cardhttps://huggingface.co/datasets/mteb/FiQA-PL
mteb/NQ-PLHardNegativesdataset cardhttps://huggingface.co/datasets/mteb/NQ-PLHardNegatives

Metadata Summary

FieldValue
Task pages14
Queries2,800
Split-local documents140,000
Positive qrels8,151
Languagespl
Categoriesnatural_language
Positives / query avg2.91

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
cqadupstack_androidNanoMTEB-Polishplnatural_language20010,0008090.33790.41390.4121Dense
cqadupstack_englishNanoMTEB-Polishplnatural_language20010,0001,3560.31880.39260.3725Dense
cqadupstack_gisNanoMTEB-Polishplnatural_language20010,0003130.24230.28610.3143Reranking hybrid
cqadupstack_mathematicaNanoMTEB-Polishplnatural_language20010,0005060.21290.21710.2411Reranking hybrid
cqadupstack_physicsNanoMTEB-Polishplnatural_language20010,0006210.33590.43060.4024Dense
cqadupstack_programmersNanoMTEB-Polishplnatural_language20010,0006340.31910.32750.3607Reranking hybrid
cqadupstack_statsNanoMTEB-Polishplnatural_language20010,0003730.26620.33750.3314Dense
cqadupstack_texNanoMTEB-Polishplnatural_language20010,0008430.25550.28050.3147Reranking hybrid
cqadupstack_webmastersNanoMTEB-Polishplnatural_language20010,0008820.24400.30450.3162Reranking hybrid
cqadupstack_wordpressNanoMTEB-Polishplnatural_language20010,0003440.31390.29510.3289Reranking hybrid
fiqaNanoMTEB-Polishplnatural_language20010,0005340.23530.38900.3574Dense
nqNanoMTEB-Polishplnatural_language20010,0002510.30260.61540.4363Dense
puggNanoMTEB-Polishplnatural_language20010,0002000.63900.78170.7146Dense
quoraNanoMTEB-Polishplnatural_language20010,0004850.77040.90730.8207Dense