NanoRARb / NanoSIQA

Overview

NanoSIQA is an English social commonsense retrieval task from NanoRARb. It recasts SocialIQA as retrieval: the query describes a social situation and asks about a likely intention, reaction, motivation, or next event, while the relevant document is the correct short answer phrase. Each query has one positive answer among 10,000 candidates. This is one of the harder NanoRARb tasks because answers are extremely short and often do not repeat the situation wording. Dense retrieval improves over BM25, but absolute scores remain low.

Details

What the Original Data Measures

RAR-b converts SocialIQA into full answer retrieval. The original SocialIQA benchmark targets commonsense reasoning about social interactions, including what a person intended, how someone would feel, what might happen next, or why an action occurred.

In this retrieval version, the model must retrieve the answer phrase itself. The task tests social plausibility and inference over interpersonal context rather than topical document retrieval.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 126.94 characters, while answer documents average only 21.51 characters.

Examples include a student taking parental warnings seriously, a person with many friends, reading a biography of Hillary Clinton, intimacy between two people, and feeling ashamed after stealing from a locker. Positive documents are short phrases such as study very hard, know more about Hillary Clinton, or ashamed.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.0239, hit@10 of 0.0400, and recall@100 of 0.1850. BM25 is extremely weak because the correct answer often contains few or none of the query's distinctive words.

Sparse retrieval may work when an answer repeats a named entity, but most social inferences are expressed as short generic phrases. Term frequency is not enough to infer intent, emotion, or likely consequence.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.0618, hit@10 of 0.1250, and recall@100 of 0.3850. Dense retrieval improves over BM25 across all metrics. It captures some semantic relationship between a situation and a likely answer phrase.

The low absolute score shows that social commonsense answer retrieval remains difficult. Many candidate answers are short and broadly plausible, and embeddings may rank a reasonable but incorrect phrase above the gold answer.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 133 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.0405, hit@10 of 0.0700, and recall@100 of 0.3350. Hybrid retrieval improves over BM25 but trails dense retrieval.

The large safeguard count indicates that many positives sit near the edge of the compact candidate pool. Sparse overlap adds limited value because the answer phrases are too short and generic. Dense retrieval is the stronger first-stage profile.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 reflects the rank of the correct answer phrase, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures candidate availability for reranking.

For SIQA, dense retrieval is the baseline to beat, but better models likely need social reasoning and answer-selection supervision rather than only semantic embeddings.

Query and Relevance Type Tendencies

Queries describe social contexts and ask why someone acted, how they might feel, or what will happen. Relevant documents are concise answer phrases. The answer may be an emotion, intention, activity, or consequence.

Relevance is social plausibility. A candidate that names the same person or setting is wrong if it implies the wrong intention, reaction, or outcome.

Representative Failure Modes

Common failures include selecting a generic feeling instead of the specific expected reaction, matching a person or object without the right intent, confusing motivation with consequence, and ranking plausible but non-gold actions. BM25 almost entirely misses implicit answers; dense retrieval can over-rank common social phrases.

Training Data That May Help

Useful training data includes social commonsense QA, intent and effect prediction, dialogue commonsense, story-based social inference, and retrieval-formatted answer selection. Evaluation examples and answer documents should be excluded.

Model Improvement Notes

Models should learn situation-to-answer social inference over very short documents. Hard negatives should mention the same person, setting, or social relationship but imply a wrong emotion, intention, or consequence. Cross-encoder reranking may be especially important because short answer phrases lack standalone context.

Example Data

Query	Positive document
Context: Cameron's parents told them to do well at school or they would be grounded. Cameron took their words seriously. Question: What will happen to Cameron? [160 chars]	study very hard [15 chars]
Context: Riley had a lot of friends. Question: What will happen to Riley? [73 chars]	they will play with Riley [25 chars]
Context: Sydney is a fan of Hillary Clinton. One day she found a biography of Hillary Clinton. Sydney wanted to read it. Question: Why did Sydney do this? [154 chars]	know more about Hillary Clinton [31 chars]

Source Reference Table

Title	Year	Type	URL
RAR-b: Reasoning as Retrieval Benchmark	2024	arXiv paper	https://arxiv.org/abs/2404.06347
SocialIQA: Commonsense Reasoning about Social Interactions	2019	arXiv paper	https://arxiv.org/abs/1904.09728

Dataset Information

Field	Value
Nano set	NanoRARb
Backing dataset	NanoRARb
Task / split	NanoSIQA
Hugging Face dataset	hakari-bench/NanoRARb
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	126.94
Document length avg chars	21.51

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.0239	0.0400	0.1850	top-500
Dense	`harrier_oss_v1_270m`	0.0618	0.1250	0.3850	top-500
Reranking hybrid	`reranking_hybrid`	0.0405	0.0700	0.3350	top-100