HAKARI-Bench

NanoRARb / NanoSpartQA

Overview

NanoSpartQA is an English spatial reasoning retrieval task from NanoRARb. It recasts SpartQA as answer retrieval: the query describes blocks, objects, colors, sizes, and spatial relations, and the relevant documents are short answer phrases that satisfy the question. Unlike most NanoRARb tasks, some queries have multiple positives. The task tests whether retrievers can compose spatial relations rather than only match object names. reranking_hybrid is the strongest profile, dense retrieval has better top-rank quality than BM25 but lower recall@100, and BM25 provides useful lexical anchors.

Details

What the Original Data Measures

RAR-b uses SpartQA as a spatial reasoning task converted into retrieval. The original SpartQA benchmark focuses on textual question answering where a model must track spatial relations among entities described in language.

In this Nano task, the answer pool contains short phrases such as object descriptions, both of them, or none of them. A retriever must encode the described scene and select the answer consistent with the spatial constraints.

Observed Data Profile

The Nano split contains 200 queries, 1,592 documents, and 384 positive qrel rows. Queries average 654.85 characters, while answer documents average 49.80 characters.

Each query has 1.92 positives on average, with a median of 1 and a maximum of 3. Ninety-two of 200 queries, or 46.0%, have multiple positives. Examples describe three blocks containing colored shapes and ask which object satisfies a relation, whether both objects satisfy it, or whether no object does.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.1888, hit@10 of 0.3750, and recall@100 of 0.5260. BM25 benefits from repeated color, size, shape, and block labels. Object names in the answer can overlap directly with the query.

The limitation is spatial composition. A phrase such as medium blue square is not enough; the answer must also satisfy above, below, left, right, touching, or containment relations across the scene. BM25 can retrieve the right object words while missing the relational constraint.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.2634, hit@10 of 0.4950, and recall@100 of 0.4870. Dense retrieval improves top-rank quality and hit@10 over BM25 but has lower recall@100.

This suggests that embeddings capture some spatial-scene semantics and answer plausibility but may miss part of the positive set when several answer phrases are valid. Dense retrieval is better at ranking an answer early when it finds it, while BM25 has broader lexical coverage.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 37 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.3419, hit@10 of 0.5600, and recall@100 of 0.5443. Hybrid retrieval is strongest across the reported metrics.

The improvement reflects complementary signals. Sparse matching preserves exact object descriptors, while dense retrieval contributes relational and semantic compatibility. The combined pool is the best setting for downstream reranking.

Metric Interpretation for Model Researchers

Because some queries have multiple positives, nDCG@10 rewards ranking one or more valid answer phrases early. Hit@10 measures whether at least one positive answer appears in the first ten results, and recall@100 measures coverage of the positive answer set.

For SpartQA, hybrid retrieval is the main candidate-generation baseline. A strong model should improve spatial relation composition, not just object-name matching.

Query and Relevance Type Tendencies

Queries are long textual scene descriptions with blocks, shapes, colors, sizes, and relative positions. Relevant documents are short answer phrases referring to selected objects or aggregate answers such as both of them.

Relevance depends on satisfying the spatial question. A candidate can mention the right object type but be wrong if the object is in the wrong block or relation.

Representative Failure Modes

Common failures include matching color and shape but ignoring position, confusing block-level relations with object-level relations, selecting one object when multiple are valid, and misranking generic answers such as both of them or none of them. BM25 overweights object descriptors; dense retrieval can blur exact relation chains.

Training Data That May Help

Useful training data includes textual spatial QA, scene-graph QA, relation-composition tasks, and retrieval pairs where answers refer to objects selected by spatial constraints. Evaluation queries, answers, and qrels should be excluded.

Model Improvement Notes

Models should learn structured spatial representations from text. Hard negatives should share colors, shapes, sizes, and block labels but violate one relation. Multi-positive handling is important because some questions allow several correct answer phrases.

Example Data

QueryPositive document
There are three blocks. Lets call them A, B and C. Block A is below B and block B is below C. Block A has one small yellow circle. Block B has a big black square and a big blue square. To the left of and above a medium blue circle is the big black square. Above and to the left of the medium blue circle there is the big blue square. Block C has a big black square and a medium blue square. The big black square is above a big blue circle. The medium blue square is to the left of the big blue circle... [500 / 797 chars]both of them [12 chars]
We have three blocks, A, B and C. Blocks B and C are above A. Block A contains one medium black square and a medium blue square. Below the medium blue square there is the medium black square. Block B contains one medium yellow square and a medium blue square. The medium yellow square is below the medium blue square. And block C has one medium blue square. Which object is above a medium square? the medium blue square that is in block C or the medium blue square that is in block B? [484 chars]both of them [12 chars]
We have three blocks, A, B and C. Block B is below block C and it is to the left of block A. Block A has a small black triangle. Block B has a medium black triangle, one big blue circle and one small blue triangle. The big blue circle is to the right of and below the small blue triangle. Far from and to the right of the small blue triangle there is the medium black triangle. Block C contains one small black triangle and a small black circle. Above the small black circle there is the small black... [500 / 669 chars]both of them [12 chars]

Source Reference Table

TitleYearTypeURL
RAR-b: Reasoning as Retrieval Benchmark2024arXiv paperhttps://arxiv.org/abs/2404.06347
SpartQA: A Textual Question Answering Benchmark for Spatial Reasoning2021arXiv paperhttps://arxiv.org/abs/2104.05832

Dataset Information

FieldValue
Nano setNanoRARb
Backing datasetNanoRARb
Task / splitNanoSpartQA
Hugging Face datasethakari-bench/NanoRARb
Languageen
Categorynatural_language
Queries200
Documents1,592
Positive qrels384
Positives / query avg1.92
Positives / query min1
Positives / query median1.00
Positives / query max3
Multi-positive queries92 (46.00%)
Query length avg chars654.85
Document length avg chars49.80

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.18880.37500.5260top-500
Denseharrier_oss_v1_270m0.26340.49500.4870top-500
Reranking hybridreranking_hybrid0.34190.56000.5443top-100