HAKARI-Bench

NanoRARb / NanoQuail

Overview

NanoQuail is an English reading-comprehension answer retrieval task from NanoRARb. It recasts QuAIL as retrieval: the query contains a long narrative passage plus a question, and the relevant document is the correct short answer option. Each query has one positive answer. The task is difficult because the query is long and full of distractor context, while the answer document is often only a short phrase. Dense retrieval improves over BM25, but both remain low; the hybrid pool gives similar recall to dense but weaker top-rank ordering.

Details

What the Original Data Measures

RAR-b converts QuAIL into a reasoning-as-retrieval task by pooling answer options and asking a retriever to locate the gold answer. The original QuAIL benchmark targets reading comprehension over passages that require prerequisite reasoning, narrative tracking, and question-specific interpretation.

In this Nano task, the retriever is not finding a supporting passage. It must retrieve the answer option that correctly responds to the question given the long context.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 1,813.76 characters, while answer documents average only 25.02 characters.

Examples are long story or scene passages followed by questions about what happens next, when an event occurred, or which inference is supported. Some correct answers are specific event descriptions, while others are generic options such as insufficient information.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.0522, hit@10 of 0.1250, and recall@100 of 0.2850. BM25 is weak because the long query contains many words unrelated to the answer, and the answer option may be short or paraphrased.

Sparse retrieval is especially brittle when the correct answer is a generic phrase or when many answer candidates reuse common narrative terms. Exact word matching does not determine which answer is licensed by the passage and question.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.1174, hit@10 of 0.2150, and recall@100 of 0.4600. Dense retrieval improves over BM25 across all metrics. It can connect the question to a plausible answer phrase better than term frequency alone.

The absolute score remains low because the retriever must compress a long context and question into the specific answer relation. Generic or very short answer options provide little semantic signal.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 107 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.0982, hit@10 of 0.1900, and recall@100 of 0.4650. Hybrid retrieval slightly improves recall@100 over dense retrieval but has weaker top-ten ranking.

This pattern suggests that sparse overlap adds some answer coverage but can pull in many distractors from the long context. Dense retrieval is the better first-stage ranker, while hybrid retrieval is a useful coverage-oriented reranking pool.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 is mostly the rank quality of the correct answer, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures whether a reranker can see it.

For QuAIL, the challenge is long-context answer selection. Improvements should come from question-focused reading comprehension rather than generic passage-answer similarity.

Query and Relevance Type Tendencies

Queries include long narrative contexts plus a question. Relevant documents are short answer options. The answer may refer to an event, character decision, temporal relation, or inference from the passage.

Relevance is conditional on the question. A phrase that appears plausible for the passage can be wrong if it does not answer the specific question.

Representative Failure Modes

Common failures include matching an answer phrase to an incidental passage word, ranking generic options incorrectly, losing the question focus inside the long context, and confusing chronological relations. BM25 overweights context words; dense retrieval can prefer a plausible narrative phrase that is not the gold answer.

Training Data That May Help

Useful training data includes reading-comprehension answer selection, long-passage QA, narrative QA, and retrieval tasks where short answer options are grounded in long contexts. Evaluation passages and answers should be excluded.

Model Improvement Notes

Models should learn question-aware compression of long contexts into answer candidates. Hard negatives should be plausible answer options that share passage words but fail the specific question. Rerankers may need explicit cross-attention over context, question, and answer.

Example Data

QueryPositive document
Context: It took Erin an hour and forty-five minutes to drive from their half-million dollar home in Plano to the small rented cabin at Lake Texoma, near the Oklahoma state line. It was Thursday night, and she could have been in their backyard, sitting by the pool in an ultra-skimpy bikini, drinking and laughing with her friends. Like every other night. She walked in and slammed the door. "Okay, I'm here. Now, will you please tell me why it was so important for me to drive all the way up here to... [500 / 1,562 chars]will have an argument with Erin [31 chars]
Context: I actually managed a kind of sleep there, kneeling with the circulation cut off to my legs, my head in canvas twilight. My body had squirted a year's supply of adrenalin into my bloodstream in the space of 30 minutes, and while that stuff can give you the strength to lift cars off your loved ones and leap over tall buildings, the payback's always a bitch. I woke up to someone pulling the hood off my head. They were neither rough nor careful -- just...impersonal. Like someone at McDonald... [500 / 1,943 chars]After someone pulled the hood off their head. [45 chars]
Context: I'm a senior at Cesar Chavez high in San Francisco's sunny Mission district, and that makes me one of the most surveilled people in the world. My name is Marcus Yallow, but back when this story starts, I was going by w1n5t0n. Pronounced "Winston." Not pronounced "Double-you-one-enn-five-tee-zero-enn" -- unless you're a clueless disciplinary officer who's far enough behind the curve that you still call the Internet "the information superhighway." I know just such a clueless person, and h... [500 / 1,746 chars]After I was told to report to the administration office immediately. [68 chars]

Source Reference Table

TitleYearTypeURL
RAR-b: Reasoning as Retrieval Benchmark2024arXiv paperhttps://arxiv.org/abs/2404.06347
Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks2020proceedings paperhttps://ojs.aaai.org/index.php/AAAI/article/view/6398

Dataset Information

FieldValue
Nano setNanoRARb
Backing datasetNanoRARb
Task / splitNanoQuail
Hugging Face datasethakari-bench/NanoRARb
Languageen
Categorynatural_language
Queries200
Documents10,000
Positive qrels200
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars1,813.76
Document length avg chars25.02

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.05220.12500.2850top-500
Denseharrier_oss_v1_270m0.11740.21500.4600top-500
Reranking hybridreranking_hybrid0.09820.19000.4650top-100