HAKARI-Bench

NanoMMTEB-v2 / argu_ana

Overview

NanoMMTEB-v2 / argu_ana is the ArguAna counterargument retrieval task. Each query is a long debate argument, and the relevant document is the paired counterargument. The Nano split has 199 queries, 8,626 documents, and 199 positive qrel rows, with exactly one positive document per query. Current diagnostics show dense retrieval as the strongest top-rank profile, reranking_hybrid as the strongest recall@100 profile, and BM25 as useful but weaker because same-topic supporting arguments are strong lexical distractors.

Details

What the Original Data Measures

ArguAna was introduced for retrieval of the best counterargument without prior topic knowledge. The task asks a system to take one argument and retrieve a paired argument that responds against it. MTEB includes ArguAna in its English retrieval suite, using ranking metrics such as nDCG@10.

The important distinction is that relevance is not simple topical similarity. The positive must address the same debate issue and aspect while taking an opposing stance. A same-topic argument that supports the query can be a hard negative.

Observed Data Profile

The split contains 199 queries, 8,626 documents, and 199 positive qrel rows. Every query has exactly one positive document. Queries average 1,199.80 characters, while documents average 1,029.60 characters.

The text consists of long debate prose with topic labels, claims, warrants, examples, and citations. Examples involve abortion policy, technological development, vegetarianism and food safety, baseball collisions, community radio, climate change, animal ethics, sports policy, media, and good government.

BM25 Evaluation Profile

The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.3464, hit@10 = 0.7387, and recall@100 = 0.9548. BM25 can find the debate topic because the query and counterargument often share issue terms, policy names, and domain vocabulary.

Its weakness is stance and argumentative relation. Lexical overlap can retrieve same-topic arguments that agree with the query or discuss a different aspect. Term frequency is therefore useful for candidate generation but insufficient for identifying the best counterargument.

Dense Evaluation Profile

The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.3998, hit@10 = 0.8141, and recall@100 = 0.9497. Dense retrieval is the strongest observed top-rank profile.

This suggests that embedding similarity helps capture argument-level relationships beyond exact word overlap. A dense model can connect related claims, consequences, examples, and policy frames even when the counterargument uses different wording. However, the single-positive setup still punishes models that retrieve a plausible same-topic response that is not the annotated pair.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset contains mostly 100 candidates per query, with two queries using a rank-101 safeguard row. It achieves nDCG@10 = 0.3716, hit@10 = 0.7638, and recall@100 = 0.9899. Hybrid retrieval has the best recall@100 but is below dense retrieval for top-rank quality.

This profile indicates that combining lexical and dense evidence is valuable for keeping the positive in the candidate pool. The ranking itself still needs argument-relation modeling: a reranker must distinguish counterarguments from supporting or merely adjacent debate texts.

Metric Interpretation for Model Researchers

This task is single-positive: each query has one annotated paired counterargument. Hit@10 measures whether that paired document appears near the top. nDCG@10 is sensitive to its exact rank, and recall@100 measures whether it is available to a downstream reranker.

Because many documents can be topically plausible, false positives may look reasonable to a generic semantic retriever. The key metric signal is whether the model retrieves the specific opposing response, not merely an argument on the same topic.

Query and Relevance Type Tendencies

Queries are long English arguments with explicit claims, supporting reasons, examples, and policy framing. They may include stance markers, topic labels, and citations. Relevant documents are long counterarguments that share the same issue while challenging a premise, consequence, analogy, factual assumption, or policy recommendation.

The task rewards aspect-level matching and stance awareness. It penalizes models that collapse all debate texts about the same topic into a single semantic neighborhood.

Representative Failure Modes

BM25 can retrieve a same-topic same-stance argument because it shares vocabulary with the query. Dense retrieval can retrieve a semantically related but non-countering document, especially when several arguments discuss the same policy frame. Hybrid retrieval can increase candidate coverage while preserving these same-topic distractors.

Rerankers can fail when they do not model opposition explicitly. A useful reranker should identify what claim is being answered and whether the candidate attacks, qualifies, or supports that claim.

Training Data That May Help

Useful training data includes argument-counterargument pairs outside the evaluation split, stance-labeled debate data, argument mining datasets, and same-topic same-stance hard negatives. Training on duplicate or paraphrase retrieval alone is not enough because the positive often disagrees with the query.

Synthetic data can generate long argument and counterargument pairs where the positive challenges a premise, consequence, analogy, or policy framing. Negatives should include same-topic supporting arguments so the model learns opposition and aspect matching. Evaluation arguments and paired positives from this Nano split should be excluded from training.

Model Improvement Notes

Dense retrievers should encode stance and argumentative role, not only topic. Sparse systems can help preserve issue vocabulary but need reranking to avoid same-stance distractors. Cross-encoders or instruction-tuned rerankers should compare the query argument and candidate response at the claim and warrant level.

For hybrid systems, NanoMMTEB-v2 / argu_ana is a good candidate-generation test: reranking_hybrid gives the best recall@100, while dense retrieval gives the best nDCG@10. The next improvement is a reranker that converts hybrid coverage into correct counterargument ordering.

Example Data

QueryPositive document
Opposition to partial birth abortion is part of a strategy intended to ban abortion in general Partial-birth abortions form a tiny proportion of all abortions, but from a medical and psychological point of view they ought to be the least controversial. The reason for this focus is that late-term abortions are the most obviously distasteful, because late-term foetuses look more like babies than embryos or foetuses at an earlier developmental stage. Late-term abortions therefore make for the best... [500 / 704 chars]pregnancy philosophy ethics life family house would ban partial birth abortions Although many people who are against partial-birth abortion are against abortion in general, there is no necessary link, as partial-birth abortion is a particularly horrifying form of abortion. This is for the reasons already explained: it involves a deliberate, murderous physical assault on a half-born baby, whom we know for certain will feel pain and suffer as a result. We accept that there is some legitimate medical debate about whether embryos and earlier foetuses feel pain; there is no such debate in this case, and this is why partial-birth abortion is uniquely horrific, and uniquely unjustifiable. [691 chars]
New Technology Humanity has revolutionized the world repeatedly through such monumental inventions as agriculture, steel, anti-biotics, and microchips. And as technology has improved, so too has the rate at which technology improves. It is predicted that there will be 32 times more change between 2000 and 2050 than there was between 1950 and 2000. In the midst of this, many great minds will be focussed on emissions abatement and climate control technologies. So, even if the most severe climate p... [500 / 1,013 chars]climate house believes were too late global climate change Technological improvements will almost certainly be developed for those who can afford them (as most technology is). However, climate change will have the greatest effect on poor countries that cannot afford mitigation. Potentially, being able to protect the wealthy does not mean that we are not too late on global climate change. [391 chars]
Being vegetarian reduces risks of food poisoning Almost all dangerous types of food poisoning are passed on through meat or eggs. So Campylobacter bacteria, the most common cause of food poisoning in England, are usually found in raw meat and poultry, unpasteurised milk and untreated water. Salmonella come from raw meat, poultry and dairy products and most cases of escherichia coli (E-Coli) food poisoning occur after eating undercooked beef or drinking unpasteurised milk. [1] Close contact betwe... [500 / 810 chars]animals environment general health health general weight philosophy ethics Food safety and hygiene are very important for everyone, and governments should act to ensure that high standards are in place particularly in restaurants and other places where people get their food from. But food poisoning can occur anywhere “People don't like to admit that the germs might have come from their own home” [1] and while meat is particularly vulnerable to contamination there are bacteria that can be transmitted on vegetables, for example Listeria monocytogenes can be transmitted raw vegetables. [2] Almost three-quarters of zoonotic transmissions are caused by pathogens of wildlife origin; even some that could have been caused by livestock such as avian flu could equally have come from wild animals. There is little we can do about the transmission of such diseases except by reducing close contact. Thus changing to vegetarianism may reduce such diseases by reducing contact but would not eliminate th... [1,000 / 1,580 chars]

Source Reference Table

TitleYearTypeURL
Retrieval of the Best Counterargument without Prior Topic Knowledge2018task paperhttps://aclanthology.org/P18-1023/
MTEB: Massive Text Embedding Benchmark2023benchmark paperhttps://arxiv.org/abs/2210.07316
mteb/arguana2024dataset cardhttps://huggingface.co/datasets/mteb/arguana

Dataset Information

FieldValue
Nano setNanoMMTEB-v2
Backing datasetNanoMMTEB-v2
Task / splitargu_ana
Hugging Face datasethakari-bench/NanoMMTEB-v2
Languageen
Categorynatural_language
Queries199
Documents8,626
Positive qrels199
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars1,199.80
Document length avg chars1,029.60

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.34640.73870.9548top-500
Denseharrier_oss_v1_270m0.39980.81410.9497top-500
Reranking hybridreranking_hybrid0.37160.76380.9899top-100

Training and Leakage Metadata