NanoMTEB-v2
Overview
NanoMTEB-v2 is the English retrieval group derived from MTEB/BEIR-style retrieval tasks. It combines ten compact splits covering counterargument retrieval, claim evidence retrieval, StackExchange duplicate-question retrieval, financial QA, multi-hop Wikipedia QA, scientific paper relatedness, controversial-question argument retrieval, and biomedical literature search. The group is useful because it is not a single English passage-retrieval task: the relevant item may be a counterargument, evidence passage, duplicate question, answer passage, related paper, argument, or literature record.
The group contains 1,698 queries, 98,626 task-local documents, and 10,158 positive qrel rows. Most tasks are multi-positive, but the number of positives varies sharply. argu_ana is single-positive, while treccovid has 4,584 positives for only 50 queries. Aggregate scores therefore mix exact target retrieval, many-relevant-document ranking, and relation types that are not simple semantic similarity.
What This Group Measures
The benchmark measures whether an English retrieval model can preserve source task semantics across heterogeneous BEIR-style tasks. argu_ana retrieves an opposing argument, not a supporting or near-duplicate passage. climate_fever and fever retrieve Wikipedia evidence for claims. cqadupstack_gaming and cqadupstack_unix retrieve duplicate community questions. fi_qa2018 retrieves finance answers, hotpot_qa retrieves supporting Wikipedia passages, scidocs retrieves related scientific papers, touche2020_v3 retrieves arguments for controversial questions, and treccovid retrieves COVID-19 literature records for broad information needs.
This group is therefore an English heterogeneity check. It can reveal whether a model is strong because it matches entities and terms, because it understands paraphrase and answerability, because it can model scientific relatedness, or because it retrieves broad biomedical evidence sets.
Task Families
- Argument retrieval:
argu_anaretrieves counterarguments andtouche2020_v3retrieves argument passages for controversial questions. - Claim-evidence retrieval:
climate_feverandfeverretrieve evidence passages for factual claims. - Community duplicate retrieval:
cqadupstack_gamingandcqadupstack_unixretrieve duplicate StackExchange questions. - Question-answer retrieval:
fi_qa2018andhotpot_qaretrieve answer or supporting evidence passages. - Scientific and biomedical retrieval:
scidocsretrieves related papers, andtreccovidretrieves COVID-19 article records.
Dataset Shape
The group has ten task pages. Most splits have 200 queries; argu_ana has 199, touche2020_v3 has 49, and treccovid has 50. Candidate pools are usually 10,000 documents, with argu_ana using 8,626 documents. The document count is a sum over task-local pools rather than a deduplicated shared English corpus.
Positive density is central to interpretation. argu_ana has exactly one positive per query. hotpot_qa has two positives per query. touche2020_v3 averages 34.78 positives per query, and treccovid averages 91.68. These broad relevance sets make recall and early ranking behavior important in ways that differ from single-positive duplicate or evidence retrieval.
Retrieval Behavior
BM25 Profile
BM25 is best for none of the ten tasks in the current Nano data, but it remains strong where exact entities, claims, or technical terms dominate. hotpot_qa and fever are both near 0.89 nDCG@10 with BM25, and touche2020_v3 reaches 0.8424. These tasks often expose names, dates, entities, or argument terms that also appear in the relevant documents.
BM25 is weakest on climate_fever, scidocs, and treccovid. Climate evidence can sit under broader Wikipedia topics that do not repeat the claim wording. Scientific relatedness can be citation- or topic-based rather than title-token based. TREC-COVID has many relevant documents per query, and exact term overlap does not reliably rank the best judged literature records early. The query-weighted BM25 nDCG@10 is 0.4827.
Dense Profile
Dense retrieval with harrier-oss-270m is the strongest query-weighted profile at 0.5751 nDCG@10. It is best for seven tasks: argu_ana, climate_fever, cqadupstack_gaming, cqadupstack_unix, fever, fi_qa2018, and scidocs. This pattern is meaningful. Dense retrieval helps when the relevance relation depends on paraphrase, evidence semantics, duplicate intent, finance answerability, or scientific relatedness rather than pure term frequency.
Dense is not best for hotpot_qa, touche2020_v3, or treccovid, where the reranking hybrid profile performs better. It is also only slightly ahead of BM25 on argu_ana, indicating that counterargument retrieval remains difficult for both sparse and dense methods. Overall, dense retrieval gives the best single-profile view of this English heterogeneous group.
Reranking Hybrid Profile
The reranking hybrid profile is best for hotpot_qa, touche2020_v3, and treccovid. These are tasks where sparse and dense signals are complementary: multi-hop QA needs entity anchors and semantic support, controversial-question argument retrieval benefits from both topic terms and argument meaning, and COVID literature search needs biomedical terminology plus broader semantic matching.
Hybrid has the best query-weighted recall@100 at 0.8087, even though its nDCG@10 is below dense. This suggests that hybrid search is a strong candidate generation strategy for English BEIR-style tasks, while final top-10 ranking may still favor a dense profile on duplicate, evidence, finance, and scientific relatedness tasks.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| argu_ana | Counterargument retrieval | en | 199 | 8,626 | 199 | 1.00 | 0.3464 | 0.4092 | 0.3775 | Dense |
| climate_fever | Claim-evidence retrieval | en | 200 | 10,000 | 621 | 3.10 | 0.1719 | 0.3276 | 0.2794 | Dense |
| cqadupstack_gaming | Duplicate-question retrieval | en | 200 | 10,000 | 415 | 2.08 | 0.5073 | 0.6375 | 0.5970 | Dense |
| cqadupstack_unix | Duplicate-question retrieval | en | 200 | 10,000 | 486 | 2.43 | 0.4001 | 0.5095 | 0.4658 | Dense |
| fever | Claim-evidence retrieval | en | 200 | 10,000 | 229 | 1.15 | 0.8893 | 0.9652 | 0.9450 | Dense |
| fi_qa2018 | Financial QA retrieval | en | 200 | 10,000 | 534 | 2.67 | 0.3799 | 0.5494 | 0.5258 | Dense |
| hotpot_qa | Multi-hop evidence retrieval | en | 200 | 10,000 | 400 | 2.00 | 0.8950 | 0.8904 | 0.9156 | Reranking hybrid |
| scidocs | Scientific related-paper retrieval | en | 200 | 10,000 | 986 | 4.93 | 0.2067 | 0.2757 | 0.2565 | Dense |
| touche2020_v3 | Argument retrieval | en | 49 | 10,000 | 1,704 | 34.78 | 0.8424 | 0.8810 | 0.8835 | Reranking hybrid |
| treccovid | Biomedical literature retrieval | en | 50 | 10,000 | 4,584 | 91.68 | 0.3893 | 0.4177 | 0.4521 | Reranking hybrid |
Interpretation Notes for Model Researchers
NanoMTEB-v2 should not be treated as one plain English retrieval score. The same model behavior can mean different things across tasks: improving FEVER may reflect entity-evidence matching, improving CQADupStack may reflect duplicate intent modeling, improving SCIDOCS may reflect scientific representation quality, and improving TREC-COVID may reflect broad biomedical coverage.
Dense retrieval leads most tasks, but hybrid retrieval is important for multi-hop, argument, and biomedical settings. BM25 remains a strong sanity baseline for entity-heavy evidence tasks, yet it does not win any task in this Nano slice. Per-family analysis is required before using the group score to make claims about English retrieval quality.
Training and Leakage Notes
Useful training data should be source-family specific: counterargument pairs, FEVER-style claim-evidence data, StackExchange duplicate questions, finance QA pairs, HotpotQA-style supporting evidence, citation-linked scientific papers, argument retrieval data, and PubMed/TREC-style biomedical search data. For multi-positive tasks, training should preserve multiple positives instead of collapsing the relevance set.
Leakage control should exclude Nano evaluation queries, qrels, positive documents, upstream test examples, and common benchmark package duplicates from ArguAna, CLIMATE-FEVER, CQADupStack, FEVER, FiQA, HotpotQA, SCIDOCS, Touché, and TREC-COVID. Synthetic data should preserve relation type: counterargument, evidence, duplicate question, answer passage, related paper, argument passage, or biomedical relevance record.
Source Reference Table
| Source | Year | Type | URL |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| Retrieval of the Best Counterargument without Prior Topic Knowledge | 2018 | source task paper | https://aclanthology.org/P18-1023/ |
| CLIMATE-FEVER: A Dataset for Verification of Real-World Climate Claims | 2020 | source task paper | https://arxiv.org/abs/2012.00614 |
| CQADupStack: A Benchmark Data Set for Community Question-Answering Research | 2015 | source task paper | https://eltimster.github.io/www/pubs/adcs2015.pdf |
| FEVER: a Large-scale Dataset for Fact Extraction and VERification | 2018 | source task paper | https://arxiv.org/abs/1803.05355 |
| Financial Opinion Mining and Question Answering | 2018 | source task paper | https://doi.org/10.1145/3184558.3192301 |
| HotpotQA | 2018 | source task paper | https://arxiv.org/abs/1809.09600 |
| SPECTER | 2020 | source task paper | https://arxiv.org/abs/2004.07180 |
| Overview of Touché 2020: Argument Retrieval | 2020 | source task paper | https://downloads.webis.de/touche/publications/papers/bondarenko_2020d.pdf |
| TREC-COVID | 2020 | source task paper | https://arxiv.org/abs/2005.04474 |
Metadata Summary
| Field | Value |
| Task pages | 10 |
| Queries | 1,698 |
| Split-local documents | 98,626 |
| Positive qrels | 10,158 |
| Languages | en |
| Categories | natural_language |
| Positives / query avg | 5.98 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| argu_ana | NanoMTEB-v2 | en | natural_language | 199 | 8,626 | 199 | 0.3464 | 0.4092 | 0.3775 | Dense |
| climate_fever | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 621 | 0.1719 | 0.3276 | 0.2794 | Dense |
| cqadupstack_gaming | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 415 | 0.5073 | 0.6375 | 0.5970 | Dense |
| cqadupstack_unix | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 486 | 0.4001 | 0.5095 | 0.4658 | Dense |
| fever | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 229 | 0.8893 | 0.9652 | 0.9450 | Dense |
| fi_qa2018 | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 534 | 0.3799 | 0.5494 | 0.5258 | Dense |
| hotpot_qa | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 400 | 0.8950 | 0.8904 | 0.9156 | Reranking hybrid |
| scidocs | NanoMTEB-v2 | en | natural_language | 200 | 10,000 | 986 | 0.2067 | 0.2757 | 0.2565 | Dense |
| touche2020_v3 | NanoMTEB-v2 | en | natural_language | 49 | 10,000 | 1,704 | 0.8424 | 0.8810 | 0.8835 | Reranking hybrid |
| treccovid | NanoMTEB-v2 | en | natural_language | 50 | 10,000 | 4,584 | 0.3893 | 0.4177 | 0.4521 | Reranking hybrid |