NanoRARb / NanoAlphaNLI

Overview

NanoAlphaNLI is an English reasoning-as-retrieval task from NanoRARb. It recasts AlphaNLI abductive commonsense reasoning as retrieval: the query contains the beginning and ending observations of a short story, and the retriever must find the missing explanatory event from a large answer pool. Each query has one positive answer. Dense retrieval is much stronger than BM25 because the correct hypothesis is often causally plausible rather than lexically obvious, while the hybrid pool improves coverage but trails dense top-rank quality.

Details

What the Original Data Measures

RAR-b converts reasoning tasks into full answer retrieval. AlphaNLI originates from the Abductive Commonsense Reasoning task, where a system selects the most plausible hypothesis connecting two observations. The retrieval version tests whether that missing event can be found among many short candidate explanations.

The task measures narrative and causal plausibility. A relevant document is not a source passage; it is the best explanatory bridge between a story's start and end.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Each query has exactly one positive. Queries average 103.79 characters, and candidate explanations average 43.84 characters.

Queries are formatted with Start: and End: fields. Examples include moving away from New York, tripping over untied shoelaces, wanting a game at Target, disliking karate class, and winning a spelling bee after practice. Candidate documents are short story middle events.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.3288, hit@10 of 0.4650, and recall@100 of 0.6750. BM25 is moderately useful because the correct hypothesis may repeat characters, places, or objects from the observations.

However, lexical overlap is not enough. Many wrong candidates mention the same people or objects but do not explain the ending. The task requires identifying the event that makes the transition plausible.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.5898, hit@10 of 0.7900, and recall@100 of 0.9150. Dense retrieval is much stronger than BM25 across all metrics. Embedding similarity captures narrative continuity and causal relationships better than exact word matching.

This is the strongest standalone profile. It suggests that AlphaNLI is well suited to semantic retrieval, although ranking the single best explanation among many plausible short events remains nontrivial.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 21 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.4777, hit@10 of 0.6500, and recall@100 of 0.8950. Hybrid retrieval improves over BM25 and nearly matches dense recall@100, but it is weaker than dense retrieval at top-rank ordering.

The pattern indicates that lexical overlap adds coverage but can pull in distractors that mention the same story entities. Dense retrieval remains the better first-stage ranker, while hybrid retrieval can be useful for reranking coverage.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 reflects how early the correct missing event appears, hit@10 measures whether it is in the first ten candidates, and recall@100 measures whether a reranker can access it.

For AlphaNLI, dense retrieval is the baseline to beat. Improvements should focus on abductive plausibility and causal story coherence, not only entity overlap.

Query and Relevance Type Tendencies

Queries provide two observations: a starting situation and an ending outcome. Relevant documents are short hypotheses that explain the transition. They often involve a decision, accident, emotional response, preparation, or discovery.

Relevance is explanatory. A candidate can share a character or object with the query and still be wrong if it does not make the ending plausible.

Representative Failure Modes

Common failures include choosing an event that repeats query entities but does not explain the end, selecting a plausible event with the wrong emotional consequence, failing to infer a causal accident, and confusing preparation with outcome. BM25 overweights repeated names; dense retrieval can rank generic plausible events above the exact story bridge.

Training Data That May Help

Useful training data includes abductive commonsense QA, story cloze tasks, narrative continuation, retrieval-formatted hypothesis selection, and hard negatives that mention the same people or objects but fail the causal bridge. Evaluation questions and candidate answers should be excluded.

Model Improvement Notes

Models should learn query-to-hypothesis coherence over short text. Hard negatives should be fluent and lexically similar but causally wrong. Because the documents are short hypotheses, training should emphasize narrative entailment and abductive reasoning rather than passage-level semantic similarity.

Example Data

Query	Positive document
Start: Scott has felt increasingly unhappy in his last few Year's in New York. End: Driving out of New York, Scott feels both relieved and nostalgic. [149 chars]	The daily grind, extreme traffic and rude city dwellers left Scott longing for small town living. [97 chars]
Start: Joe's mother bugged him constantly to tie his shoelaces. End: As he lay at the bottom of the stairs he wished he'd listened. [131 chars]	Joe tripped down the stairs with his shoes untied. [50 chars]
Start: Alex was at target with his mom. End: He begged his mother to buy it until she gave in. [94 chars]	Alex saw a game he really wanted. [33 chars]

Source Reference Table

Title	Year	Type	URL
RAR-b: Reasoning as Retrieval Benchmark	2024	arXiv paper	https://arxiv.org/abs/2404.06347
Abductive Commonsense Reasoning	2019	arXiv paper	https://arxiv.org/abs/1908.05739

Dataset Information

Field	Value
Nano set	NanoRARb
Backing dataset	NanoRARb
Task / split	NanoAlphaNLI
Hugging Face dataset	hakari-bench/NanoRARb
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	103.79
Document length avg chars	43.84

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.3288	0.4650	0.6750	top-500
Dense	`harrier_oss_v1_270m`	0.5898	0.7900	0.9150	top-500
Reranking hybrid	`reranking_hybrid`	0.4777	0.6500	0.8950	top-100