MNanoBEIR / NanoBEIR-es / NanoArguAna

Overview

This task is the Spanish NanoBEIR version of ArguAna, an argument retrieval benchmark where the query is an argument and the relevant document is its counterargument. The original ArguAna task studies retrieval of the best counterargument without assuming prior topic knowledge, using debate-portal argument pairs where relevant documents often discuss the same issue while taking the opposite stance. In this NanoBEIR slice, long Spanish translated arguments must retrieve Spanish translated counterarguments from 3,635 candidate documents. There are 50 queries and 50 positive relevance judgments, with exactly one positive per query. The task is a compact diagnostic for stance-aware retrieval: models must recognize topic continuity, shared argumentative aspects, and rebuttal relation, not just ordinary semantic similarity.

Details

What the Original Data Measures

ArguAna measures counterargument retrieval. A good result should answer an argument with an opposing argument that targets the same issue or premise. This differs from standard topical retrieval because a document that agrees with the query can be lexically and semantically close while still being irrelevant. The task therefore tests whether a retriever can separate same-topic support from same-topic rebuttal, using long argumentative passages with premises, conclusions, examples, and cited evidence.

Observed Data Profile

The Spanish Nano task has 50 queries, 3,635 documents, and 50 positives. Every query has one positive counterargument. Queries are very long, averaging about 1,220 characters, and documents average about 1,111 characters. The examples cover reform of the House of Lords, Heathrow expansion, excessive consumer choice, cyberattacks by non-state actors, and limits on religiously motivated speech. Both query and positive passages are translated debate arguments rather than short questions or factual statements.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.413, Hit@10 of 0.700, and Recall@100 of 0.940. This shows that sparse retrieval is effective at candidate discovery because counterarguments usually reuse the same topic vocabulary, named institutions, policy terms, and debate-specific language. However, BM25 is weaker at top ranking because lexical overlap does not distinguish rebuttal from support. A same-topic pro argument can look highly relevant to a sparse model even when the true positive is the opposing stance.

Dense Evaluation Profile

The dense harrier-oss-270m baseline gives the best top-10 ranking, with nDCG@10 of 0.481, Hit@10 of 0.860, and Recall@100 of 0.900. Dense retrieval can capture broader argumentative fit and paraphrased premise relations, which helps it rank counterarguments above simple lexical matches. Its Recall@100 is slightly lower than BM25, suggesting that exact topic terms still matter for broad candidate coverage. The dense profile is nevertheless the strongest direct retrieval signal for this Spanish slice because it better handles long-passage semantic relation.

Reranking Hybrid Evaluation Profile

The reranking_hybrid profile reaches nDCG@10 of 0.436, Hit@10 of 0.740, and Recall@100 of 0.980, with one safeguard row at 101 candidates. It has the best Recall@100 but does not beat dense retrieval in top-10 ranking. This is a typical pattern for stance-aware retrieval: hybrid search is valuable for ensuring the counterargument appears in the candidate set, but dense ranking may be better at ordering the correct rebuttal above other same-topic passages. The hybrid result is most useful as a reranking pool.

Metric Interpretation for Model Researchers

Because each query has exactly one positive, Recall@100 directly measures whether the correct counterargument is available for reranking. nDCG@10 and Hit@10 measure whether the model can place that counterargument where a user would see it. The gap between hybrid recall and dense nDCG suggests a two-stage opportunity: use hybrid retrieval for coverage, then apply a stance-aware reranker to separate rebuttal from topical similarity.

Query and Relevance Type Tendencies

Queries are long arguments that include claims, supporting reasons, and examples. Relevant documents are counterarguments that share the same controversial issue but reverse the stance or challenge the premise. Many hard negatives are likely to discuss the same topic and even use similar evidence, so relevance depends on argumentative role and opposition, not merely topical overlap.

Representative Failure Modes

BM25 can retrieve supporting arguments because they repeat many of the same words. Dense retrieval can retrieve conceptually close arguments that do not actually rebut the query. Hybrid retrieval can include the positive but rank a same-side argument above it. Failure analysis should inspect the stance relation: does the retrieved document challenge the query's core claim, or does it simply discuss the issue?

Training and Leakage Considerations

Training should exclude ArguAna, BEIR, NanoBEIR, and translated debate records likely to overlap with the evaluation arguments. Useful non-overlapping data includes argument-counterargument pairs, stance-aware retrieval datasets, debate portal argument pairs, claim rebuttal data, and Spanish or multilingual argument mining corpora. Synthetic data should create paired pro and con arguments for the same issue with explicit stance reversal and same-topic hard negatives.

Model Improvement Signals

Strong models should improve stance-sensitive ranking while preserving topic coverage. Useful training signals include long-passage contrastive examples, premise-targeted rebuttals, pro/con hard negatives, and multilingual argument mining data. Hybrid systems should use BM25 for reliable issue matching and dense or cross-encoder scoring for the counterargument relation.

Example Data

Query	Positive document
El público es apático ante la reforma. Es discutible si la reforma de la Cámara de los Lores debería ser una prioridad en el actual clima económico, ya no digamos si un gobierno de coalición podría iniciar e implementar tales medidas. Los intentos de reformar la Cámara de los Lores han sido pospuestos una y otra vez, demostrando las reticencias de la Cámara de los Comunes al cambio. Un sentimiento que sin duda se refleja en la opinión pública británica, como se demostró con el reciente resultado... [500 / 572 chars]	La campaña de voto alternativo no puede compararse con una reforma del sistema político. Además, no se debe confundir a un público mal informado debido a la manipulación política con apatía. A menudo, los votantes expresan que son apáticos porque sienten que no pueden cambiar nada, que su voto no tendrá valor. Una reforma que garantice que las personas que gobiernan el país sean directamente elegidas por el pueblo ayudaría a contrarrestar estos sentimientos. [462 chars]
La expansión de Heathrow es vital para la economía. La expansión de Heathrow garantizaría muchos de los empleos actuales y crearía nuevos. Actualmente, Heathrow sostiene aproximadamente 250,000 empleos. Además, cientos de miles más dependen del comercio turístico en Londres, que se basa en buenas conexiones de transporte como Heathrow. Perder competitividad frente a otros aeropuertos europeos no solo podría implicar desperdiciar la oportunidad de crear nuevos empleos, sino también perder algunos... [500 / 1,285 chars]	La comunidad empresarial está lejos de estar unida en su supuesto apoyo a una tercera pista. Las encuestas sugieren que muchos negocios influyentes, en realidad, no apoyan la expansión. Una carta expresando preocupación fue firmada por Justin King, el Director Ejecutivo de J Sainsbury, y James Murdoch de BskyB. [1] Por lo tanto, considerar a la comunidad empresarial como una sola voz que pide la expansión es un error. También debemos recordar, al considerar las alternativas a la nueva pista de Heathrow, como una nueva pista en otro aeropuerto de Londres o un aeropuerto completamente nuevo, que estas probablemente tendrían un impacto económico similar al de la expansión de Heathrow. Si lo que importa son las conexiones para atraer negocios y turistas, siempre y cuando la conexión sea con Londres, no importa desde qué aeropuerto provenga. Incluso podría haber menos necesidad de que el aeropuerto sea un centro de conexiones si nos enfocamos en los beneficios para Londres. Como dijo Bob Ay... [1,000 / 1,438 chars]
Las personas tienen demasiadas opciones, lo que las hace menos felices. La publicidad lleva a muchas personas a sentirse abrumadas por la necesidad interminable de decidir entre demandas competitivas de su atención – esto se conoce como la tiranía de la elección o sobrecarga de opciones. Investigaciones recientes sugieren que, en promedio, las personas son menos felices que hace 30 años, a pesar de estar mejor y tener muchas más opciones de cosas en las que gastar su dinero. Las afirmaciones de... [500 / 989 chars]	Las personas están descontentas porque no pueden tenerlo todo, no porque se les ofrezca demasiadas opciones y eso les resulte estresante. De hecho, los anuncios juegan un papel crucial al asegurar que el dinero que las personas tienen, lo gasten en el producto más adecuado para ellas. Si no se permitieran los anuncios, la gente desperdiciaría dinero en un producto inicial cuando, de tener la opción, claramente elegirían otro. Un meta-análisis que incorporó investigaciones de 50 estudios independientes no encontró ninguna conexión significativa entre la elección y la ansiedad, pero especuló que la varianza en los estudios dejaba abierta la posibilidad de que la sobrecarga de opciones podría estar relacionada con ciertas condiciones muy específicas y aún poco comprendidas. 1 ^ Scheibehenne, Benjamin; Greifeneder, R. & Todd, P. M. (2010). "¿Puede haber demasiadas opciones? Una revisión meta-analítica de la sobrecarga de elección". Journal of Consumer Research 37: 409-425. [983 chars]

Source Reference Table

Label

URL

| ArguAna paper (https://aclanthology.org/P18-1023/) | | BEIR benchmark (https://github.com/beir-cellar/beir) | | MMTEB benchmark (https://arxiv.org/abs/2502.13595) | | NanoBEIR dataset (https://huggingface.co/collections/zeta-alpha-ai/nanobeir) |

Dataset Information

Field	Value
Nano set	MNanoBEIR
Backing dataset	NanoBEIR-es
Task / split	NanoArguAna
Hugging Face dataset	hakari-bench/NanoBEIR-es
Language	es
Category	natural_language
Queries	50
Documents	3,635
Positive qrels	50
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	1,219.96
Document length avg chars	1,110.85

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4133	0.7000	0.9400	top-500
Dense	`harrier_oss_v1_270m`	0.4808	0.8600	0.9000	top-500
Reranking hybrid	`reranking_hybrid`	0.4365	0.7400	0.9800	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MNanoBEIR Spanish NanoBEIR task split from hakari-bench/NanoBEIR-es
Train/eval overlap audit: not_audited
Leakage note: prefer excluding ArguAna, BEIR, or NanoBEIR records likely to overlap with these evaluation arguments
Multi-positive training: not_required_for_this_sample
Useful training data: non-overlapping argument-counterargument pairs, stance-aware retrieval datasets, debate portal argument pairs, Spanish or multilingual argument mining corpora