MNanoBEIR / NanoBEIR-pt / NanoHotpotQA
Overview
NanoBEIR-pt NanoHotpotQA is a Portuguese multi-hop question answering retrieval task derived from HotpotQA. Queries are translated questions, and documents are translated Wikipedia passages that provide supporting evidence. Every query in this Nano subset has exactly two positive passages, so the task is not finished when a model finds one obvious entity page. A strong retriever must recover both pieces of evidence needed for a multi-hop answer. This makes the task a compact benchmark for bridge-entity retrieval, comparison, and multi-positive evidence coverage in Portuguese.
Details
What the Original Data Measures
HotpotQA was designed for explainable multi-hop question answering with supporting facts. In BEIR, the retrieval task measures whether systems can find the passages required to answer a question. The MNanoBEIR Portuguese version keeps this evidence-retrieval objective after translation. It measures whether models can follow question constraints across entities and retrieve multiple supporting passages, including the bridge passage and the answer-bearing passage.
Observed Data Profile
This Nano subset contains 50 queries, 5,090 documents, and 100 positive qrels. Every query has exactly two positives, so the average, median, minimum, and maximum positives per query are all 2.00. All queries are multi-positive. Queries average 91.12 characters, and documents are short Wikipedia passages averaging 377.51 characters. This fixed two-positive structure makes recall and evidence-set retrieval especially important: a model may look successful with one hit while still missing the second support.
BM25 Evaluation Profile
BM25 uses the bm25 top-500 candidate subset. It reaches nDCG@10 0.7604, hit@10 0.9600, and recall@100 0.9300. This strong lexical baseline reflects the entity-heavy nature of HotpotQA questions. Names, titles, dates, and places often appear in at least one support passage, letting BM25 retrieve an obvious hop. The limitation is complete multi-hop coverage: lexical overlap can favor the most explicit entity while under-ranking the second passage needed to answer the question.
Dense Evaluation Profile
Dense retrieval uses the harrier_oss_v1_270m top-500 candidate subset. It scores nDCG@10 0.7948, hit@10 0.9800, and recall@100 0.9600, improving over BM25 across the reported metrics. Dense retrieval helps connect question semantics to supporting passages even when the bridge relation is not a simple word match. It can better represent paraphrased titles, roles, and relations across the two-hop chain. The remaining errors likely come from partial-match distractors, where a passage mentions one entity from the question but does not provide the needed bridge or final answer evidence.
Reranking Hybrid Evaluation Profile
The reranking hybrid subset uses reranking_hybrid with exactly 100 candidates per query and no safeguard rows. It reaches nDCG@10 0.8145, hit@10 1.0000, and recall@100 0.9600, making it the strongest top-rank profile. The hybrid result shows that Portuguese HotpotQA benefits from combining lexical entity anchors with dense semantic matching. BM25 helps preserve exact names and titles, while dense retrieval captures the relationship implied by the question. The combined pool gives perfect first-page query coverage and the best nDCG@10.
Metric Interpretation for Model Researchers
Because every query has two positives, hit@10 only confirms that at least one support passage was found. Recall@100 is more important for whether a reranker or QA system can access both pieces of evidence. nDCG@10 shows whether support passages appear early enough to be useful. The observed scores show that dense and hybrid retrieval improve on BM25, with reranking hybrid giving the best early ordering. Researchers should evaluate evidence-set recovery rather than single-passage success.
Query and Relevance Type Tendencies
Queries are natural-language questions that often require following a bridge entity, identifying a work or person, then retrieving another fact. Examples include a sitcom actor connection, a sword made by the founder of a school, a film written and directed by a specific person, a college football game date, and a music collection tied to a band's alternate performance name. Relevant documents are short Wikipedia passages that each contain part of the evidence chain.
Representative Failure Modes
BM25 may retrieve the passage with the clearest entity overlap but miss the second support. Dense models may retrieve semantically related passages around the same entity cluster while overlooking a specific bridge constraint. Hybrid retrieval reduces these failures but still needs reranking that values complementary evidence rather than repeated variants of one hop. Translation can also affect titles, roles, and relation wording in Portuguese.
Training Data That May Help
Helpful training data includes non-overlapping multi-hop QA retrieval, Portuguese Wikipedia question generation, bridge-entity retrieval, comparison questions, and multi-positive passage ranking. Hard negatives should mention one entity from the question but omit the bridge fact or final answer. Training should exclude HotpotQA, BEIR, NanoBEIR, and translated evaluation questions or supporting passages.
Model Improvement Notes
NanoHotpotQA-pt is a strong diagnostic for multi-hop retrieval systems. Reranking hybrid is the best profile because it combines exact entity matching with semantic relation matching. Improvements should focus on preserving question constraints, retrieving complementary support passages, and reranking for evidence-set completeness. For downstream QA, the most important behavior is not just first-hit success but whether both supporting passages are present and highly ranked.
Example Data
| Query | Positive document |
| Em qual sitcom de televisão Penny Rae Bridges participou com qual outro ator? [77 chars] | Penny Rae Bridges (nascida em 29 de julho de 1990) é uma atriz americana. Seu trabalho na televisão inclui papéis em "For Your Love", "Family Law", "Boy Meets World" e "The Parent 'Hood". Ela é mais conhecida por seu papel em "Half & Half", como a jovem Mona. [259 chars] |
| Quem entregou a Kaganoi Shigemochi uma espada feita pelo fundador da escola Muramasa? [85 chars] | Kaganoi Shigemochi (加賀井 重望, 1561 – 27 de agosto de 1600) foi um samurai japonês do período Azuchi-Momoyama, que serviu ao clã Oda. Ele governou o Castelo Kaganoi. Durante a Batalha de Komaki e Nagakute, Shigemochi lutou sob o comando de seu pai, Shigemune, que estava alinhado com as forças de Oda Nobukatsu. Pouco depois, o Castelo Kaganoi foi cercado pelas forças de Toyotomi Hideyoshi; Shigemune se rendeu, e Shigemochi foi empregado por Hideyoshi como mensageiro, recebendo um estipêndio de 10.000 "koku". Ele também possuía uma lâmina feita por Muramasa, que Hideyoshi lhe concedeu em 1598. [595 chars] |
| Qual é o filme escrito e dirigido por Joby Harold com música de Samuel Sim? [75 chars] | Samuel Sim é um compositor de cinema e televisão. Ganhou reconhecimento com a trilha sonora premiada da série de drama da BBC "Dunkirk". Desde então, compôs música para uma ampla variedade de produções cinematográficas e televisivas, tendo mais recentemente trabalhado na trilha sonora do filme "Awake" para a The Weinstein Company e na série de drama da BBC/HBO "House of Saddam". Sua música mais recente e aclamada é a trilha sonora de "Home Fires". "Home Fires (Música da Série de Televisão)" foi lançada em 6 de maio de 2016 pela Sony Classical Records. [557 chars] |
Source Reference Table
| Title | Year | Type | URL |
| HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering | 2018 | task paper | https://arxiv.org/abs/1809.09600 |
| BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models | 2021 | benchmark paper | https://arxiv.org/abs/2104.08663 |
| MMTEB: Massive Multilingual Text Embedding Benchmark | 2025 | benchmark paper | https://arxiv.org/abs/2502.13595 |
| NanoBEIR: Smaller BEIR dataset subsets | 2024 | dataset collection | https://huggingface.co/collections/zeta-alpha-ai/nanobeir |
Dataset Information
| Field | Value |
| Nano set | MNanoBEIR |
| Backing dataset | NanoBEIR-pt |
| Task / split | NanoHotpotQA |
| Hugging Face dataset | hakari-bench/NanoBEIR-pt |
| Language | pt |
| Category | natural_language |
| Queries | 50 |
| Documents | 5,090 |
| Positive qrels | 100 |
| Positives / query avg | 2.00 |
| Positives / query min | 2 |
| Positives / query median | 2.00 |
| Positives / query max | 2 |
| Multi-positive queries | 50 (100.00%) |
| Query length avg chars | 91.12 |
| Document length avg chars | 377.51 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.7604 | 0.9600 | 0.9300 | top-500 |
| Dense | harrier_oss_v1_270m | 0.7948 | 0.9800 | 0.9600 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.8145 | 1.0000 | 0.9600 | top-100 |