NanoRuMTEB / ria_news

Overview

ria_news is a Russian headline-to-article retrieval task from NanoRuMTEB. The query is a compact Russian news headline, and the relevant document is the corresponding RIA news article body. Each query has one positive article among 10,000 documents. All retrieval profiles are strong because headlines and articles often share named entities, locations, and event phrases. Dense retrieval has the best nDCG@10 and hit@10, while reranking_hybrid has the best recall@100.

Details

What the Original Data Measures

ruMTEB includes RiaNewsRetrieval as a Russian retrieval task built from RIA news data. The underlying RIA corpus was originally used for headline generation, with Russian news articles and their titles from the Rossiya Segodnya news collection.

In retrieval form, the task reverses headline generation: given a headline, the system must retrieve the article body it describes. This tests asymmetric news retrieval, where a short summary-like query must match a longer event report.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 61.99 characters, while article bodies average 1,145.34 characters.

Example headlines mention temporary accommodation points in the Russian Far East, Taliban leaders receiving Afghan passports, Tatneft work outside Tatarstan, RTS and MICEX index changes, and a football transfer involving Zenit.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.9135, hit@10 of 0.9500, and recall@100 of 0.9750. BM25 is very strong because the headline usually repeats the central entities, places, and event words that appear in the article.

The remaining difficulty is event disambiguation. News corpora contain many near-duplicate stories about the same officials, locations, market indicators, disasters, or sports teams, so exact overlap can retrieve a related but different event.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.9478, hit@10 of 0.9700, and recall@100 of 0.9750. Dense retrieval is the strongest direct ranker.

This shows that semantic similarity helps align compact headlines with longer article bodies, especially when the article elaborates the event using quotes, paraphrases, or additional context.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 2 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.9272, hit@10 of 0.9450, and recall@100 of 0.9900. Hybrid retrieval has the best broad coverage but slightly weaker early ranking than dense retrieval.

This suggests that sparse event anchors improve candidate coverage, while dense retrieval orders the exact article better among same-topic news items.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 measures how early the corresponding article appears, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures reranker availability.

For ria_news, the metrics are high enough that residual errors likely reflect near-duplicate event confusion, short or noisy headlines, and article-body variants rather than general topical failure.

Query and Relevance Type Tendencies

Queries are short lowercase Russian headlines. Documents are longer Russian news bodies with agency style, quotes, dates, named entities, and event details.

Relevance is article-level identity. A topically similar article about the same actor or event category is wrong if it is not the article paired with the headline.

Representative Failure Modes

Common failures include retrieving another article about the same location, official, team, market index, or disaster; overmatching repeated news vocabulary; and missing terse headlines with little context. BM25 is strong but can be distracted by shared event terms; dense retrieval can still confuse related articles.

Training Data That May Help

Useful training data includes non-overlapping Russian headline-to-article pairs, Russian news summarization pairs converted to retrieval, same-topic hard-negative clusters, and Russian news search or click data with overlap removed. Evaluation headlines, articles, and qrels should be excluded.

Model Improvement Notes

Models should preserve named entities and dates while learning event-level paraphrase between headline and body. Hard negatives should be same-day or same-topic articles with overlapping actors and locations. Dense retrieval is the best direct ranker, while hybrid retrieval is useful for recall-oriented reranking.

Example Data

Query	Positive document
около 1 тыс человек остаются в пунктах временного содержания в дфо [66 chars]	глава мчс россии владимир пучков заявил, что около тысячи человек, пострадавших от наводнения на дальнем востоке, смогут находиться в пунктах временного размещения, пока не будут решены их жилищные проблемы. ранее президент рф владимир путин потребовал, чтобы все пострадавшие от наводнения были как можно быстрее обеспечены комфортным жильем вне зависимости от выбранной формы государственной поддержки. "до конца всего этапа строительства все, кто проживает (в пунктах временного размещения), будут обеспечены питанием и проживанием", — заявил министр на совещании о ходе ликвидации последствий наводнения на дальнем востоке. он добавил, что в самих пунктах для оказания помощи гражданам работают медики и сотрудники прокуратуры. по словам министра, в соответствии с решением президента россии, расходы на содержание и питание граждан, находящихся в таких пунктах, а это около одной тысячи человек, были увеличены в 2,5 раза. пучков добавил, что мчс совместно с минфином уже резервирует дополни... [1,000 / 1,563 chars]
афганистан выдал паспорта освобожденным в пакистане главам "талибан" [68 chars]	консульство афганистана в пакистанском городе пешавар выдало афганские паспорта деятелям радикального движения "талибан", которые были на днях освобождены пакистанскими властями из тюрем по просьбе афганского правительства и высшего совета мира (всм) афганистана, сообщило в понедельник информационное агентство пажвак. по мнению председателя всм салахуддина раббани, освобождение влиятельных талибов может сыграть решающую роль в начале переговоров о национальном примирении между представителями афганской вооруженной оппозиции и нынешнего кабульского режима. по данным пакистанских сми, на свободу вышли талибский губернатор провинции кабул дауд джилани, сын известнейшего моджахеда юнуса халеса, воевавшего с советскими войсками в 80-х годах прошлого столетия анварульхак моджахед, полевые командиры талибов саид салахуддин ага и маулави матиулла. из пенитенциарных учреждений были также выпущены талибский губернатор провинции нангархар маулави мохаммад, полевой командир амир ахмадголь,... [1,000 / 1,537 chars]
минниханов не против работы "татнефти" за пределами татарстана [62 chars]	оао "татнефть" продолжит работу над существующими проектами за пределами татарстана и будет принимать участие в новых интересных проектах, если такие появятся, заявил агентству "прайм" президент татарстана, председатель совета директоров "татнефти" рустам минниханов. в конце декабря минниханов высказался за то, чтобы "татнефть", крупнейшая нефтяная компания татарстана, развивалась в основном в республике. по его словам, от реализации проектов за пределами региона выигрывает только компания, тогда как от проектов на территории татарстана — и "татнефть", и республика. "я как руководитель субъекта заинтересован (в работе компании в татарстане — ред.), мне же рабочие места нужны. но это ни в коей мере не значит, что "татнефть" не будет работать за пределами республики", — сказал минниханов агентству в четверг в кулуарах гайдаровского форума. глава татарстана подчеркнул, что компания не прекратит работу по существующим проектам за пределами республики, и не исключил ее участия в новых.... [1,000 / 1,607 chars]

Source Reference Table

Title	Year	Type	URL
The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design	2025	arXiv paper	https://arxiv.org/abs/2408.12503
Self-Attentive Model for Headline Generation	2019	arXiv paper	https://arxiv.org/abs/1901.07786
mteb/RiaNewsRetrieval_test_top_250_only_w_correct-v2	2025	dataset card	https://huggingface.co/datasets/mteb/RiaNewsRetrieval_test_top_250_only_w_correct-v2

Dataset Information

Field	Value
Nano set	NanoRuMTEB
Backing dataset	NanoRuMTEB
Task / split	ria_news
Hugging Face dataset	hakari-bench/NanoRuMTEB
Language	ru
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	61.99
Document length avg chars	1,145.34

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.9135	0.9500	0.9750	top-500
Dense	`harrier_oss_v1_270m`	0.9478	0.9700	0.9750	top-500
Reranking hybrid	`reranking_hybrid`	0.9272	0.9450	0.9900	top-100

Training and Leakage Metadata

Original train split: not_found
Evaluation split origin: RiaNewsRetrieval test split through RiaNewsRetrievalHardNegatives.v2
Train/eval overlap audit: not_audited
Leakage note: exclude RiaNewsRetrieval test headlines, qrels, and overlapping article texts
Multi-positive training: single_positive_question_document_focus
Useful training data: non-overlapping Russian headline-to-article retrieval pairs, Russian news summarization corpora converted to asymmetric retrieval pairs, same-topic Russian news hard-negative clusters, Russian news search and click data with overlap removed