HAKARI-Bench

NanoMTEB-v2

Overview

NanoMTEB-v2 is the English retrieval group derived from MTEB/BEIR-style retrieval tasks. It combines ten compact splits covering counterargument retrieval, claim evidence retrieval, StackExchange duplicate-question retrieval, financial QA, multi-hop Wikipedia QA, scientific paper relatedness, controversial-question argument retrieval, and biomedical literature search. The group is useful because it is not a single English passage-retrieval task: the relevant item may be a counterargument, evidence passage, duplicate question, answer passage, related paper, argument, or literature record.

The group contains 1,698 queries, 98,626 task-local documents, and 10,158 positive qrel rows. Most tasks are multi-positive, but the number of positives varies sharply. argu_ana is single-positive, while treccovid has 4,584 positives for only 50 queries. Aggregate scores therefore mix exact target retrieval, many-relevant-document ranking, and relation types that are not simple semantic similarity.

What This Group Measures

The benchmark measures whether an English retrieval model can preserve source task semantics across heterogeneous BEIR-style tasks. argu_ana retrieves an opposing argument, not a supporting or near-duplicate passage. climate_fever and fever retrieve Wikipedia evidence for claims. cqadupstack_gaming and cqadupstack_unix retrieve duplicate community questions. fi_qa2018 retrieves finance answers, hotpot_qa retrieves supporting Wikipedia passages, scidocs retrieves related scientific papers, touche2020_v3 retrieves arguments for controversial questions, and treccovid retrieves COVID-19 literature records for broad information needs.

This group is therefore an English heterogeneity check. It can reveal whether a model is strong because it matches entities and terms, because it understands paraphrase and answerability, because it can model scientific relatedness, or because it retrieves broad biomedical evidence sets.

Task Families

Dataset Shape

The group has ten task pages. Most splits have 200 queries; argu_ana has 199, touche2020_v3 has 49, and treccovid has 50. Candidate pools are usually 10,000 documents, with argu_ana using 8,626 documents. The document count is a sum over task-local pools rather than a deduplicated shared English corpus.

Positive density is central to interpretation. argu_ana has exactly one positive per query. hotpot_qa has two positives per query. touche2020_v3 averages 34.78 positives per query, and treccovid averages 91.68. These broad relevance sets make recall and early ranking behavior important in ways that differ from single-positive duplicate or evidence retrieval.

Retrieval Behavior

BM25 Profile

BM25 is best for none of the ten tasks in the current Nano data, but it remains strong where exact entities, claims, or technical terms dominate. hotpot_qa and fever are both near 0.89 nDCG@10 with BM25, and touche2020_v3 reaches 0.8424. These tasks often expose names, dates, entities, or argument terms that also appear in the relevant documents.

BM25 is weakest on climate_fever, scidocs, and treccovid. Climate evidence can sit under broader Wikipedia topics that do not repeat the claim wording. Scientific relatedness can be citation- or topic-based rather than title-token based. TREC-COVID has many relevant documents per query, and exact term overlap does not reliably rank the best judged literature records early. The query-weighted BM25 nDCG@10 is 0.4827.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.5751 nDCG@10. It is best for seven tasks: argu_ana, climate_fever, cqadupstack_gaming, cqadupstack_unix, fever, fi_qa2018, and scidocs. This pattern is meaningful. Dense retrieval helps when the relevance relation depends on paraphrase, evidence semantics, duplicate intent, finance answerability, or scientific relatedness rather than pure term frequency.

Dense is not best for hotpot_qa, touche2020_v3, or treccovid, where the reranking hybrid profile performs better. It is also only slightly ahead of BM25 on argu_ana, indicating that counterargument retrieval remains difficult for both sparse and dense methods. Overall, dense retrieval gives the best single-profile view of this English heterogeneous group.

Reranking Hybrid Profile

The reranking hybrid profile is best for hotpot_qa, touche2020_v3, and treccovid. These are tasks where sparse and dense signals are complementary: multi-hop QA needs entity anchors and semantic support, controversial-question argument retrieval benefits from both topic terms and argument meaning, and COVID literature search needs biomedical terminology plus broader semantic matching.

Hybrid has the best query-weighted recall@100 at 0.8087, even though its nDCG@10 is below dense. This suggests that hybrid search is a strong candidate generation strategy for English BEIR-style tasks, while final top-10 ranking may still favor a dense profile on duplicate, evidence, finance, and scientific relatedness tasks.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_anaCounterargument retrievalen1998,6261991.000.34640.40920.3775Dense
climate_feverClaim-evidence retrievalen20010,0006213.100.17190.32760.2794Dense
cqadupstack_gamingDuplicate-question retrievalen20010,0004152.080.50730.63750.5970Dense
cqadupstack_unixDuplicate-question retrievalen20010,0004862.430.40010.50950.4658Dense
feverClaim-evidence retrievalen20010,0002291.150.88930.96520.9450Dense
fi_qa2018Financial QA retrievalen20010,0005342.670.37990.54940.5258Dense
hotpot_qaMulti-hop evidence retrievalen20010,0004002.000.89500.89040.9156Reranking hybrid
scidocsScientific related-paper retrievalen20010,0009864.930.20670.27570.2565Dense
touche2020_v3Argument retrievalen4910,0001,70434.780.84240.88100.8835Reranking hybrid
treccovidBiomedical literature retrievalen5010,0004,58491.680.38930.41770.4521Reranking hybrid

Interpretation Notes for Model Researchers

NanoMTEB-v2 should not be treated as one plain English retrieval score. The same model behavior can mean different things across tasks: improving FEVER may reflect entity-evidence matching, improving CQADupStack may reflect duplicate intent modeling, improving SCIDOCS may reflect scientific representation quality, and improving TREC-COVID may reflect broad biomedical coverage.

Dense retrieval leads most tasks, but hybrid retrieval is important for multi-hop, argument, and biomedical settings. BM25 remains a strong sanity baseline for entity-heavy evidence tasks, yet it does not win any task in this Nano slice. Per-family analysis is required before using the group score to make claims about English retrieval quality.

Training and Leakage Notes

Useful training data should be source-family specific: counterargument pairs, FEVER-style claim-evidence data, StackExchange duplicate questions, finance QA pairs, HotpotQA-style supporting evidence, citation-linked scientific papers, argument retrieval data, and PubMed/TREC-style biomedical search data. For multi-positive tasks, training should preserve multiple positives instead of collapsing the relevance set.

Leakage control should exclude Nano evaluation queries, qrels, positive documents, upstream test examples, and common benchmark package duplicates from ArguAna, CLIMATE-FEVER, CQADupStack, FEVER, FiQA, HotpotQA, SCIDOCS, Touché, and TREC-COVID. Synthetic data should preserve relation type: counterargument, evidence, duplicate question, answer passage, related paper, argument passage, or biomedical relevance record.

Source Reference Table

SourceYearTypeURL
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
Retrieval of the Best Counterargument without Prior Topic Knowledge2018source task paperhttps://aclanthology.org/P18-1023/
CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims2020source task paperhttps://arxiv.org/abs/2012.00614
CQADupStack: A Benchmark Data Set for Community Question-Answering Research2015source task paperhttps://eltimster.github.io/www/pubs/adcs2015.pdf
FEVER: a Large-scale Dataset for Fact Extraction and VERification2018source task paperhttps://arxiv.org/abs/1803.05355
Financial Opinion Mining and Question Answering2018source task paperhttps://doi.org/10.1145/3184558.3192301
HotpotQA2018source task paperhttps://arxiv.org/abs/1809.09600
SPECTER2020source task paperhttps://arxiv.org/abs/2004.07180
Overview of Touché 2020: Argument Retrieval2020source task paperhttps://downloads.webis.de/touche/publications/papers/bondarenko_2020d.pdf
TREC-COVID2020source task paperhttps://arxiv.org/abs/2005.04474

Metadata Summary

FieldValue
Task pages10
Queries1,698
Split-local documents98,626
Positive qrels10,158
Languagesen
Categoriesnatural_language
Positives / query avg5.98

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
argu_anaNanoMTEB-v2ennatural_language1998,6261990.34640.40920.3775Dense
climate_feverNanoMTEB-v2ennatural_language20010,0006210.17190.32760.2794Dense
cqadupstack_gamingNanoMTEB-v2ennatural_language20010,0004150.50730.63750.5970Dense
cqadupstack_unixNanoMTEB-v2ennatural_language20010,0004860.40010.50950.4658Dense
feverNanoMTEB-v2ennatural_language20010,0002290.88930.96520.9450Dense
fi_qa2018NanoMTEB-v2ennatural_language20010,0005340.37990.54940.5258Dense
hotpot_qaNanoMTEB-v2ennatural_language20010,0004000.89500.89040.9156Reranking hybrid
scidocsNanoMTEB-v2ennatural_language20010,0009860.20670.27570.2565Dense
touche2020_v3NanoMTEB-v2ennatural_language4910,0001,7040.84240.88100.8835Reranking hybrid
treccovidNanoMTEB-v2ennatural_language5010,0004,5840.38930.41770.4521Reranking hybrid