NanoRARb / NanoHellaSwag
Overview
NanoHellaSwag is an English commonsense continuation retrieval task from NanoRARb. It recasts HellaSwag as retrieval: the query is an unfinished grounded activity or video-caption context, and the relevant document is the plausible continuation. Each query has one positive continuation among 10,000 short candidate endings. The task requires event coherence and commonsense plausibility rather than fact lookup. BM25 and dense retrieval are both weak, and reranking_hybrid is the strongest observed profile, suggesting that both lexical continuity and semantic event matching are useful.
Details
What the Original Data Measures
RAR-b includes HellaSwag as a commonsense reasoning task converted into full answer retrieval. The original HellaSwag benchmark asks whether a system can choose the plausible ending for a grounded situation after adversarial filtering creates fluent but wrong distractors.
In this retrieval version, the system must find the correct ending from a large answer pool. The document is not evidence; it is the continuation itself.
Observed Data Profile
The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Each query has exactly one positive. Queries average 114.68 characters, and candidate endings average 62.15 characters.
Examples include ice fishing, mopping a floor, weightlifting, crushing a small stone, and flying a kite. Candidate continuations are short action descriptions and often share objects or verbs with the query.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.1393, hit@10 of 0.2300, and recall@100 of 0.5250. BM25 is limited because the correct continuation may not repeat enough distinctive query words. Many incorrect endings can mention the same objects or activities but violate temporal or physical plausibility.
Sparse retrieval is useful when the continuation repeats an object or action from the context, but HellaSwag is designed around plausible-looking distractors, so term overlap is only a weak signal.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.1253, hit@10 of 0.2500, and recall@100 of 0.5250. Dense retrieval slightly improves hit@10 but has lower nDCG@10 than BM25 and equal recall@100.
This indicates that semantic similarity can identify plausible continuations, but it may not order the exact gold ending above other coherent-looking events. Dense embeddings can overvalue general scene compatibility without capturing the specific next action.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 81 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.1551, hit@10 of 0.2900, and recall@100 of 0.5950. Hybrid retrieval is the strongest profile for this split.
The result suggests that event-continuation retrieval benefits from combining sparse overlap with dense plausibility. BM25 helps preserve local objects and actions, while dense retrieval brings broader event semantics. The hybrid pool is the best candidate source for reranking.
Metric Interpretation for Model Researchers
With one positive per query, nDCG@10 measures how early the correct continuation is ranked, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures whether a reranker can access it.
For HellaSwag, absolute retrieval scores are low. A strong model must reason about temporal order, physical plausibility, and activity continuity among short candidate endings.
Query and Relevance Type Tendencies
Queries are unfinished descriptions of everyday activities or video-like scenes. Relevant documents are plausible next events. Candidate endings are fluent and may share the same objects, people, or actions.
Relevance is continuation correctness. A candidate can be semantically related and still wrong if it does not continue the scene naturally.
Representative Failure Modes
Common failures include selecting an ending with matching objects but impossible timing, ranking a generic action above the specific next step, confusing repeated activity with conclusion, and missing physical consequences such as a kite falling when wind lessens. Sparse retrieval overweights object overlap; dense retrieval can over-rank broadly plausible scenes.
Training Data That May Help
Useful training data includes story and activity continuation, grounded commonsense QA, HellaSwag-style adversarial endings, and hard negatives that share objects or actions but break physical or temporal plausibility. Evaluation examples and answer pool entries should be excluded.
Model Improvement Notes
Models should learn event sequence compatibility over short text. Hard negatives should be fluent and lexically similar while violating the next-event relation. Hybrid retrieval is useful because both object continuity and semantic plausibility matter.
Example Data
| Query | Positive document |
| A man dressed in yellow and black winter clothes ice fishes on a a frozen lake. The man [87 chars] | is reeling in a fish for a long time. [37 chars] |
| A group of people are in a house. A man is mopping the floor with a mop. Another boy [84 chars] | attempts to walk through where he is mopping. [45 chars] |
| A man is in the gym in tight he bends over picks up a weight over his head and drops it back down. He walks back and loosens up before walking back up and doing it again adding more weight. He [192 chars] | does this multiple times adding more and more weight to the rack. [65 chars] |
Source Reference Table
| Title | Year | Type | URL |
| RAR-b: Reasoning as Retrieval Benchmark | 2024 | arXiv paper | https://arxiv.org/abs/2404.06347 |
| HellaSwag: Can a Machine Really Finish Your Sentence? | 2019 | arXiv paper | https://arxiv.org/abs/1905.07830 |
Dataset Information
| Field | Value |
| Nano set | NanoRARb |
| Backing dataset | NanoRARb |
| Task / split | NanoHellaSwag |
| Hugging Face dataset | hakari-bench/NanoRARb |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 200 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 114.68 |
| Document length avg chars | 62.15 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.1393 | 0.2300 | 0.5250 | top-500 |
| Dense | harrier_oss_v1_270m | 0.1253 | 0.2500 | 0.5250 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.1551 | 0.2900 | 0.5950 | top-100 |