NanoMTEB-v2 / touche2020_v3
Overview
NanoMTEB-v2 / touche2020_v3 is an English argument retrieval task from Touché 2020. Queries are short controversial questions, and relevant documents are argumentative passages or debate-style texts that address those questions. Touché 2020 Task 1 focused on argument retrieval: systems should retrieve passages containing relevant arguments, often with pro or con positions. This Nano split is small in query count but dense in relevance judgments, with 49 questions and 1,704 positive qrels. It is useful for studying retrieval when each information need has many relevant passages and where ranking quality depends on argument usefulness, not just topical match.
Details
What the Original Data Measures
Touché 2020 measures argument retrieval for controversial questions. A relevant document should contain an argument that helps address the question, whether it supports, opposes, or otherwise contributes to the debate. This differs from fact retrieval because relevance can include multiple positions and many acceptable passages.
The MTEB version exposes the task as English retrieval over a web/debate-style document collection. Documents are often longer than ordinary passage datasets and may contain lists of points, claims, examples, and argumentative framing.
Observed Data Profile
The Nano split contains 49 queries, 10,000 documents, and 1,704 positive qrel rows. Queries have 34.77551 positives on average, with a median of 33 and a maximum of 65. Every query is multi-positive. Queries average 43.43 characters, while documents average 2,386.21 characters.
The examples include homework, direct-to-consumer prescription drug advertising, child vaccination requirements, abortion legality, and standardized testing. Relevant documents may present pro arguments, con arguments, or structured debate material.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.8424, hit@10 of 1.0000, and recall@100 of 0.9243. BM25 is very strong because controversial questions contain topic terms that repeat throughout relevant debate passages.
However, the high score should not be read as solving argument quality. BM25 can find the controversy topic, but it may not rank the most useful, well-supported, or directly responsive arguments first. The task's many positives make hit@10 easy; fine ranking among relevant and partially relevant passages is the harder issue.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.8810, hit@10 of 1.0000, and recall@100 of 0.9343. Dense retrieval improves over BM25 in both nDCG@10 and recall@100. This suggests that semantic matching helps identify argumentative relevance beyond repeated topic words.
Dense retrieval is useful when a passage answers the controversy through paraphrase, policy framing, ethical reasoning, or examples that do not repeat the query exactly. It still needs to distinguish usable arguments from broad commentary.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates with no safeguard positives. It reaches nDCG@10 of 0.8835, hit@10 of 1.0000, and recall@100 of 0.9495. This is the strongest profile overall, though the margin over dense retrieval is small.
The hybrid result shows that sparse topic matching and dense argument-intent matching are complementary. Because every query has many positives, a hybrid pool can expose more diverse arguments to a reranker, including both exact-topic passages and semantically responsive passages.
Metric Interpretation for Model Researchers
Hit@10 is saturated and therefore not very informative. nDCG@10 and recall@100 are more useful. nDCG@10 captures whether high-quality relevant arguments appear early; recall@100 shows whether the candidate pool covers enough of the many relevant passages for downstream reranking or diversification.
This task should be analyzed as many-positive argument retrieval rather than single-answer search. A model can look strong on hit rate while still failing to rank the best arguments.
Query and Relevance Type Tendencies
Queries are short controversial questions, usually phrased as "Should..." or "Is..." questions. Relevant documents are longer argumentative passages, debate posts, or structured lists of points. They may support either side of the controversy.
The relevance relation is argumentative usefulness for the question. A passage must address the issue with a usable argument, not just mention the topic.
Representative Failure Modes
Common failures include retrieving generic commentary instead of an argument, ranking off-aspect passages that discuss the same controversy, failing to distinguish pro and con argumentative roles, and over-ranking long documents with many repeated topic terms. Dense systems may retrieve semantically broad passages; sparse systems may retrieve keyword-heavy but weak arguments.
Training Data That May Help
Useful training data includes argument retrieval collections, debate passages aligned to controversial questions, stance corpora, claim-evidence corpora, and hard negatives from the same topic with weak or off-target argumentation. Multi-positive training is required because each query has many relevant arguments.
Model Improvement Notes
Models should learn argument responsiveness and not only topicality. Rerankers should be trained to identify whether a passage gives a clear reason, evidence, or counterpoint for the question. Diversification may also matter because the relevant set often spans multiple argumentative stances and aspects.
Example Data
| Query | Positive document |
| Is homework beneficial? [23 chars] | First, there are three arguments for why homework is excellent and ought to continue in modern schools. 1. Homework aids doer-learners. It is generally accepted that there are three types of learners: those who learn by hearing, those who learn by seeing, and those who learn by doing. While many are content to hear or see instruction of a given subject, some need to actually do it. Thus, homework is beneficial for this latter group because the instruction is learn through action. 2. Homework reinforces instruction. Although many would probably be thrilled to not have homework, the quality of the education received would certainly suffer if it was removed. Whether the homework is assigned reading, term papers, etc. , all of it is designed to reinforce the instruction in the students' minds. After all, those who do their homework are more academically successful than those who do not. I feel that this is a self-evident truth, but I'll leave it Pro to dissuade you. 3. Homework mirrors rea... [1,000 / 3,553 chars] |
| Should prescription drugs be advertised directly to consumers? [62 chars] | Many ads don't include enough information on how well drugs work. For example, Lunesta is advertised by a moth floating through a bedroom window, above a peacefully sleeping person. Actually, Lunesta helps patients sleep 15 minutes faster after six months of treatment and gives 37 minutes more sleep per night. The Majority of ads are based on emotional appeals, but few include causes of the condition, risk factors, or important lifestyle changes. In a study of 38 pharmaceutical advertisements researchers found that 82 percent made a factual claim and 86 percent made rational arguments for product use. Only 26 percent described condition causes, risk factors, or prevalence.[1] Thus not giving the patients balanced information that would make them aware, that taking one of the pills is not a magic solution to their problem. Actually, according to a study conducted in the US and New Zealand, patients requested prescriptions in 12% of surveyed visits. Of these requests, 42% were for produc... [1,000 / 1,682 chars] |
| Should any vaccines be required for children? [45 chars] | Not a full case yet.. Just some little points I put together... Governments should not have the right to intervene in the health decisions parents make for their children. 31% of parents believe they should have the right to refuse mandated school entry vaccinations for their children, according to a 2010 survey by the University of Michigan. Many parents hold religious beliefs against vaccination. Forcing such parents to vaccinate their children would violate the 1st Amendment which guarantees citizens the right to the free exercise of their religion. Vaccines are often unnecessary in many cases where the threat of death from disease is small. During the early nineteenth century, mortality for the childhood diseases whooping cough, measles, and scarlet fever fell drastically before immunization became available. This decreased mortality has been attributed to improved personal hygiene, water purification, effective sewage disposal, and better food hygiene and nutrition. Vaccines inter... [1,000 / 4,497 chars] |
Source Reference Table
| Title | Year | Type | URL |
| Overview of Touché 2020: Argument Retrieval | 2020 | source task paper | https://downloads.webis.de/touche/publications/papers/bondarenko_2020d.pdf |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| mteb/webis-touche2020-v3 | dataset card | https://huggingface.co/datasets/mteb/webis-touche2020-v3 |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-v2 |
| Backing dataset | NanoMTEB-v2 |
| Task / split | touche2020_v3 |
| Hugging Face dataset | hakari-bench/NanoMTEB-v2 |
| Language | en |
| Category | natural_language |
| Queries | 49 |
| Documents | 10,000 |
| Positive qrels | 1,704 |
| Positives / query avg | 34.78 |
| Positives / query min | 6 |
| Positives / query median | 33.00 |
| Positives / query max | 65 |
| Multi-positive queries | 49 (100.00%) |
| Query length avg chars | 43.43 |
| Document length avg chars | 2,386.21 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.8424 | 1.0000 | 0.9243 | top-500 |
| Dense | harrier_oss_v1_270m | 0.8810 | 1.0000 | 0.9343 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.8835 | 1.0000 | 0.9495 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: MTEB Webis Touché 2020 v3 test split
- Train/eval overlap audit: not_audited
- Leakage note: exclude NanoMTEB-v2 touche2020_v3 controversial questions and passages
- Multi-positive training: required
- Useful training data: argument retrieval data, debate passage collections, stance and claim-evidence corpora