NanoRuMTEB / ru_bq
Overview
ru_bq is a Russian KBQA evidence retrieval task from NanoRuMTEB. The queries are concise Russian open-domain questions derived from RuBQ 2.0, and the documents are Russian Wikipedia paragraphs. The retriever must find paragraphs that support the answer relation, not just paragraphs about the same entity. Dense retrieval is the strongest top-rank profile, reranking_hybrid has the best recall@100, and BM25 is a strong but clearly weaker lexical baseline.
Details
What the Original Data Measures
RuBQ 2.0 is a Russian question answering dataset over knowledge-base relations, with questions, answers, SPARQL queries, and verified Wikipedia evidence for many questions.
ruMTEB includes RuBQRetrieval as a Russian retrieval benchmark. In the Nano task, the query is a KBQA-style question and the relevant documents are answer-bearing paragraphs from Russian Wikipedia.
Observed Data Profile
The Nano split contains 200 queries, 10,000 documents, and 334 positive qrel rows. Queries average 52.19 characters, while documents average 484.49 characters. Positives per query average 1.67, with a minimum of 1, a median of 1, and a maximum of 4. There are 89 multi-positive queries, 44.5% of the split.
Example questions ask what Christmas Eve is otherwise called, which river Baghdad stands on, which theater Vladimir Vysotsky performed in, who created Alisa Selezneva, and which city is the capital of the Swiss Confederation.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.6979, hit@10 of 0.8400, and recall@100 of 0.9042. BM25 often succeeds when the subject entity, answer entity, or relation words appear directly in the paragraph.
Its limitation is relation matching. A paragraph about the same person, book, city, or country is not enough unless it contains the requested relation. Lexical overlap can retrieve same-entity distractors that do not answer the question.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.8739, hit@10 of 0.9400, and recall@100 of 0.9341. Dense retrieval is the strongest top-rank profile.
This indicates that embedding similarity handles Russian relation questions better than term frequency. It can connect question forms such as capital, author, location, alias, and family relation to evidence paragraphs even when wording differs.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 2 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.7767, hit@10 of 0.8950, and recall@100 of 0.9790. Hybrid retrieval has the best coverage but lower early ranking than dense retrieval.
The result shows that sparse entity matching expands the candidate pool, while dense retrieval better orders the answer-bearing paragraphs. Hybrid is a useful reranking source when relation-aware second-stage scoring is available.
Metric Interpretation for Model Researchers
Because many queries have multiple positives, nDCG@10 measures whether answer-bearing evidence is ranked early, hit@10 measures whether at least one positive appears in the first ten, and recall@100 measures evidence availability for reranking.
For ru_bq, dense nDCG@10 is the strongest first-stage signal. Hybrid recall@100 is important for systems that can re-check the requested relation inside candidate paragraphs.
Query and Relevance Type Tendencies
Queries are short Russian questions about entities and relations. Relevant documents are Russian Wikipedia paragraphs that explicitly support the answer. They may include broader article context rather than answer-only snippets.
Relevance is relation evidence. A paragraph that names the entity but omits the requested answer relation is not relevant.
Representative Failure Modes
Common failures include retrieving a same-entity paragraph without the relation, confusing capitals or locations, overmatching author or work names, and missing paraphrased relation expressions. BM25 is sensitive to shared entity terms; dense retrieval can still confuse closely related relations.
Training Data That May Help
Useful training data includes non-overlapping RuBQ questions and supporting paragraphs, Russian KBQA evidence retrieval, Wikidata relation questions paired with Russian Wikipedia evidence, and Russian open-domain QA with entity hard negatives. Evaluation questions, positive paragraphs, and qrels should be excluded.
Model Improvement Notes
Models should encode entity relation semantics and Russian morphology. Hard negatives should come from the same article, same entity type, or same relation family but omit the answer. Dense retrieval is the best direct ranker, while hybrid retrieval is useful for high-recall candidate generation.
Example Data
| Query | Positive document |
| Как иначе называется канун Рождества Христова? [46 chars] | В списке представлены страны, в которых выходными днями (государственными праздниками), являются Рождественский сочельник (день перед Рождеством), Рождество Христово, Второй день Рождества и День подарков (26 декабря). [218 chars] |
| На какой реке стоит город Багдад? [33 chars] | Багдад расположен почти в центре Ирака, на берегу реки Тигр, неподалёку от устья реки Дияла. Погодные условия в черте города и его окрестностях складываются под влиянием субтропического и средиземноморского климата. В январе средняя температура воздуха составляет около +10 °C, в июле — около +34 °C. Среднегодовой уровень осадков — от 160 до 180 мм. Наибольшее количество осадков выпадает в декабре — январе. Лето длится с мая по октябрь: в это время в Багдаде отмечается очень жаркая, знойная погода (в июле днём температура воздуха в среднем составляет около +43 градусов), дожди крайне редки. Зима длится с декабря по март; максимальная температура воздуха зимой не превышает +18 градусов. Бывали случаи выпадения снега (последний раз такое было в январе 2008 года). 21 января 2011 года зафиксированы заморозки: от −1 до −3 °C, что близко к абсолютным минимальным значениям. [878 chars] |
| В каком театре выступал Владимир Высоцкий? [42 chars] | После окончания Школы-студии МХАТ в жизни Высоцкого наступил четырёхлетний период, связанный с поиском «своего театра». Молодой актёр успел поработать — с перерывами — в Театре имени Пушкина и других коллективах. Весной 1964 года он пришёл на показ в Театр на Таганке. Как вспоминал позже Юрий Любимов, перед ним предстал молодой человек в кепке и сером пиджаке, «сигареточку, конечно, погасил». Прочитанные им стихи Маяковского не произвели на режиссёра большого впечатления («что-то маловразумительное, бравадное»), зато пение под гитару заставило отложить все дела и слушать артиста в течение сорока пяти минут. Перед принятием решения Любимову довелось услышать разного рода предостережения: «Мне говорят: „Знаете, лучше не брать. Он пьющий человек“. Ну подумаешь, говорю, ещё один в России пьющий, тоже невидаль». [818 chars] |
Source Reference Table
| Title | Year | Type | URL |
| RuBQ 2.0: An Innovated Russian Question Answering Dataset | 2021 | OpenReview paper | https://openreview.net/forum?id=P5UQFFoQ4PJ |
| The Russian-focused embedders' exploration: ruMTEB benchmark and Russian embedding model design | 2025 | arXiv paper | https://arxiv.org/abs/2408.12503 |
| RuBQ project repository | 2021 | source repository | https://github.com/vladislavneon/RuBQ |
| RuBQ Zenodo record | 2020 | dataset record | https://doi.org/10.5281/zenodo.4345696 |
| ai-forever/rubq-retrieval | 2025 | dataset card | https://huggingface.co/datasets/ai-forever/rubq-retrieval |
Dataset Information
| Field | Value |
| Nano set | NanoRuMTEB |
| Backing dataset | NanoRuMTEB |
| Task / split | ru_bq |
| Hugging Face dataset | hakari-bench/NanoRuMTEB |
| Language | ru |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 334 |
| Positives / query avg | 1.67 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 4 |
| Multi-positive queries | 89 (44.50%) |
| Query length avg chars | 52.19 |
| Document length avg chars | 484.49 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.6979 | 0.8400 | 0.9042 | top-500 |
| Dense | harrier_oss_v1_270m | 0.8739 | 0.9400 | 0.9341 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.7767 | 0.8950 | 0.9790 | top-100 |
Training and Leakage Metadata
- Original train split: not_found
- Evaluation split origin: RuBQRetrieval test split
- Train/eval overlap audit: not_audited
- Leakage note: exclude RuBQRetrieval test questions, qrels, and answer-bearing positive paragraphs
- Multi-positive training: multi_positive_objective
- Useful training data: non-overlapping RuBQ development questions and supporting paragraphs when allowed, Russian KBQA question-to-Wikipedia evidence pairs, Wikidata relation questions paired with Russian Wikipedia evidence, Russian open-domain QA data with paragraph-level positives and entity hard negatives