HAKARI-Bench

NanoRARb

Overview

NanoRARb is the Nano task group for RAR-b, the Reasoning as Retrieval Benchmark. It converts reasoning problems into retrieval tasks: the query is a question, story context, code prompt, math problem, spatial scene, or temporal reasoning prompt, and the relevant document is the correct answer, continuation, entity, solution, or implementation from a large answer pool. The group tests whether a retriever can rank logically correct answers, not just topically related text.

The group contains 3,400 queries, 156,037 task-local documents, and 3,584 positive qrel rows. Most tasks are single-positive; NanoSpartQA is the main multi-positive exception. Documents are often short answer strings, while some queries are very long, especially the TempReason context tasks. This makes the group a compact stress test for reasoning-level semantic retrieval.

What This Group Measures

RAR-b asks whether retrievers can solve reasoning problems after they are recast as information retrieval. Instead of retrieving ordinary topical documents, the model must retrieve the correct answer from a pool of plausible candidates. The Nano group includes science QA, abductive story reasoning, event continuation, physical and social commonsense, reading comprehension, spatial reasoning, temporal reasoning, code generation, math problem solving, and Winograd-style referent resolution.

This benchmark is intentionally hard for lexical retrieval. The positive answer may be a short phrase, a date, an entity, a referent, a code snippet, or a worked solution. The query may imply the answer through causal, temporal, mathematical, spatial, or program-behavior constraints rather than through shared vocabulary.

Task Families

Dataset Shape

All task metadata is English. Each split has 200 queries. Candidate pools are usually 10,000 documents, except NanoARCChallenge, NanoSpartQA, and NanoWinoGrande, which use smaller pools. Most tasks have one positive per query. NanoSpartQA averages 1.92 positives per query and has 384 qrel rows.

The query/document length contrast is important. TempReason context queries can contain tens of thousands of characters of facts, while the target answer is a short entity string. Math and code documents are longer because they contain solutions or implementations. Many other tasks retrieve very short answers, so retrieval success depends on reasoning, not document length.

Retrieval Behavior

BM25 Profile

BM25 is weak overall, with query-weighted nDCG@10 of 0.1536. This is expected: the answer is often logically entailed by the query rather than lexically similar to it. BM25 is strongest on NanoRARbMath, NanoWinoGrande, and NanoAlphaNLI, where equations, quantities, story entities, or candidate referents can overlap with the query. NanoRARbMath reaches 0.6147 nDCG@10, and NanoWinoGrande reaches 0.5067.

BM25 nearly fails on pure temporal and social reasoning tasks. NanoTempReasonL2Pure has 0.0000 nDCG@10, and NanoSIQA has 0.0239. These tasks require choosing an answer that follows from world, social, or temporal structure rather than from repeated words. BM25 is therefore a useful lower bound for the group, but not a reasonable proxy for reasoning retrieval.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest group-level profile, with 0.2469 nDCG@10 and 0.6563 recall@100. It is best for many reasoning families: ARC, AlphaNLI, PIQA, QuAIL, Math, SIQA, and most TempReason tasks. The gains are particularly large for temporal tasks, where dense retrieval can rank entities or dates better than sparse overlap even when absolute nDCG@10 remains low.

Dense is not uniformly best. NanoHellaSwag, NanoRARbCode, NanoSpartQA, and NanoWinoGrande favor the reranking hybrid profile. Some of those tasks still benefit from exact entities, identifiers, object labels, or code tokens. This means dense reasoning retrieval helps substantially, but surface anchors remain useful in several answer-pool settings.

Reranking Hybrid Profile

The reranking hybrid profile is best for NanoHellaSwag, NanoRARbCode, NanoSpartQA, and NanoWinoGrande. These tasks combine semantic plausibility with surface cues: story continuations share entities and activities, code answers share identifiers or API names, spatial answers share object labels, and WinoGrande answers are often explicit referents in the sentence.

Hybrid trails dense on group-level nDCG@10 but remains close, and it improves some recall-heavy or candidate-sensitive tasks. The practical interpretation is that hybrid search can help when reasoning answers still contain lexical anchors, while dense retrieval is the stronger default for pure temporal, social, reading, and abductive reasoning.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoARCChallengeScience QA retrievalen2009,3502001.000.03860.11130.0642Dense
NanoAlphaNLIAbductive reasoningen20010,0002001.000.32880.58980.4777Dense
NanoHellaSwagEvent continuationen20010,0002001.000.13930.12530.1551Reranking hybrid
NanoPIQAPhysical commonsenseen20010,0002001.000.24430.40170.3741Dense
NanoQuailReading comprehensionen20010,0002001.000.05220.11740.0982Dense
NanoRARbCodeCode reasoningen20010,0002001.000.13180.11730.1773Reranking hybrid
NanoRARbMathMath reasoningen20010,0002001.000.61470.78180.7350Dense
NanoSIQASocial commonsenseen20010,0002001.000.02390.06180.0405Dense
NanoSpartQASpatial reasoningen2001,5923841.920.18880.26340.3419Reranking hybrid
NanoTempReasonL1Temporal date reasoningen20010,0002001.000.01250.04880.0129Dense
NanoTempReasonL2ContextTemporal entity reasoningen20010,0002001.000.11140.21710.2049Dense
NanoTempReasonL2FactTemporal entity reasoningen20010,0002001.000.06150.30050.2513Dense
NanoTempReasonL2PureTemporal entity reasoningen20010,0002001.000.00000.04830.0033Dense
NanoTempReasonL3ContextTemporal relation reasoningen20010,0002001.000.09450.19260.1668Dense
NanoTempReasonL3FactTemporal relation reasoningen20010,0002001.000.05470.25490.1981Dense
NanoTempReasonL3PureTemporal relation reasoningen20010,0002001.000.00740.07070.0238Dense
NanoWinoGrandeCoreference reasoningen2005,0952001.000.50670.49460.6020Reranking hybrid

Interpretation Notes for Model Researchers

NanoRARb should be read as a reasoning-as-retrieval diagnostic. A strong score on ordinary semantic retrieval does not guarantee strong performance here, because the candidate document can be a short answer whose relevance is only visible after reasoning. Dense retrieval improves many tasks, but the absolute scores show that this remains difficult for embedding-only retrieval.

Task-family analysis is essential. Math and WinoGrande are much easier than social, temporal, and reading-comprehension answer retrieval. Hybrid wins where surface anchors remain important, while dense retrieval is better for most pure reasoning tasks. The aggregate group score hides these differences.

Training and Leakage Notes

Useful training data includes abductive story reasoning, event continuation, physical and social commonsense QA, Winograd/coreference examples, ARC-style science QA, passage QA answer retrieval, textual spatial reasoning, temporal interval QA, docstring-to-code retrieval, and math problem-solution pairs.

Leakage control should exclude NanoRARb evaluation queries, qrels, candidate answers, worked solutions, code snippets, and upstream reasoning benchmark test examples. Synthetic data should preserve the reasoning relation and include hard negatives that share vocabulary but fail the causal, temporal, spatial, mathematical, social, or program-behavior constraint.

Source Reference Table

SourceYearTypeURL
RAR-b: Reasoning as Retrieval Benchmark2024benchmark paperhttps://arxiv.org/abs/2404.06347
ARC, the AI2 Reasoning Challenge2018source task paperhttps://arxiv.org/abs/1803.05457
Abductive Commonsense Reasoning2019source task paperhttps://arxiv.org/abs/1908.05739
HellaSwag2019source task paperhttps://arxiv.org/abs/1905.07830
PIQA2020source task paperhttps://arxiv.org/abs/1911.11641
QuAIL2020source task paperhttps://ojs.aaai.org/index.php/AAAI/article/view/6398

Metadata Summary

FieldValue
Task pages17
Queries3,400
Split-local documents156,037
Positive qrels3,584
Languagesen
Categoriesnatural_language
Positives / query avg1.05

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAlphaNLINanoRARbennatural_language20010,0002000.32880.58980.4777Dense
NanoARCChallengeNanoRARbennatural_language2009,3502000.03860.11130.0642Dense
NanoHellaSwagNanoRARbennatural_language20010,0002000.13930.12530.1551Reranking hybrid
NanoPIQANanoRARbennatural_language20010,0002000.24430.40170.3741Dense
NanoQuailNanoRARbennatural_language20010,0002000.05220.11740.0982Dense
NanoRARbCodeNanoRARbennatural_language20010,0002000.13180.11730.1773Reranking hybrid
NanoRARbMathNanoRARbennatural_language20010,0002000.61470.78180.7350Dense
NanoSIQANanoRARbennatural_language20010,0002000.02390.06180.0405Dense
NanoSpartQANanoRARbennatural_language2001,5923840.18880.26340.3419Reranking hybrid
NanoTempReasonL1NanoRARbennatural_language20010,0002000.01250.04880.0129Dense
NanoTempReasonL2ContextNanoRARbennatural_language20010,0002000.11140.21710.2049Dense
NanoTempReasonL2FactNanoRARbennatural_language20010,0002000.06150.30050.2513Dense
NanoTempReasonL2PureNanoRARbennatural_language20010,0002000.00000.04830.0033Dense
NanoTempReasonL3ContextNanoRARbennatural_language20010,0002000.09450.19260.1668Dense
NanoTempReasonL3FactNanoRARbennatural_language20010,0002000.05470.25490.1981Dense
NanoTempReasonL3PureNanoRARbennatural_language20010,0002000.00740.07070.0238Dense
NanoWinoGrandeNanoRARbennatural_language2005,0952000.50670.49460.6020Reranking hybrid