NanoMMTEB-v2 / spart_qa
Overview
NanoMMTEB-v2 / spart_qa is an English spatial-reasoning retrieval task from SpartQA. Queries describe blocks, objects, colors, sizes, and spatial relations, while documents are short candidate answer phrases. The Nano split has 200 queries, 1,592 documents, and 384 positive qrel rows. Many queries have multiple valid answers, averaging 1.92 positives per query. Current diagnostics show reranking_hybrid as the strongest profile, dense retrieval as better than BM25 at the top ranks, and BM25 as limited because the correct answer depends on composing spatial constraints rather than matching object words.
Details
What the Original Data Measures
SpartQA is a textual question-answering benchmark for spatial reasoning. It uses natural-language scene descriptions and asks questions that require tracking spatial relations across objects, blocks, and reference frames. The MTEB retrieval framing turns correct answer phrases into retrievable candidate documents.
The task measures whether retrieval models can connect a dense scene description to the answer phrase satisfying all spatial constraints. It is closer to reasoning-over-text than ordinary topical retrieval.
Observed Data Profile
The Nano split contains 200 queries, 1,592 documents, and 384 positive qrel rows. The average positives per query is 1.92, with a minimum of 1, median of 1, and maximum of 3. Forty-six percent of queries have multiple positives. Queries average 654.85 characters, while candidate answer documents average 49.80 characters.
Queries are long scene descriptions followed by a spatial question. Candidate answers are short phrases such as "both of them", "none of them", or a description of an object satisfying the relation.
BM25 Evaluation Profile
The dataset-provided BM25 candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.1848, hit@10 = 0.3800, and recall@100 = 0.5260. BM25 is the weakest top-rank profile.
Lexical overlap helps when candidate answers repeat color, size, shape, or object terms from the query. However, many wrong candidates share almost the same vocabulary. The correct answer depends on spatial composition: left of, above, touching, contained in, edge relation, or block relation.
Dense Evaluation Profile
The dense harrier_oss_v1_270m candidate subset contains 500 candidates per query and achieves nDCG@10 = 0.2591, hit@10 = 0.4900, and recall@100 = 0.4870. Dense retrieval improves top-rank quality over BM25, suggesting that embedding similarity captures some relation between scene description and answer phrase.
Dense retrieval still struggles because the answer is often a short phrase with little semantic content by itself. A phrase like "both of them" or "none of them" cannot be ranked correctly without resolving the full scene.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate subset contains mostly 100 candidates per query, with 37 queries using a rank-101 safeguard row. It achieves nDCG@10 = 0.3382, hit@10 = 0.5700, and recall@100 = 0.5469. Hybrid retrieval is the best observed profile across nDCG@10, hit@10, and recall@100.
This is a case where combining lexical object evidence with dense scene-answer similarity helps. The hybrid pool retains more positives and ranks them better than either BM25 or dense alone, although absolute scores remain modest because spatial reasoning is the central difficulty.
Metric Interpretation for Model Researchers
This is a multi-positive task for nearly half the queries. nDCG@10 rewards ranking all valid answer phrases high, while hit@10 only checks whether at least one positive appears near the top. Recall@100 measures whether valid answers remain available for reranking.
Because candidate answers are short, retrieval metrics reflect both semantic matching and reasoning limitations. High performance likely requires a model or reranker that can parse the scene and compose relations, not only embed the query and answer phrase.
Query and Relevance Type Tendencies
Queries describe block worlds with colors, shapes, sizes, containment, relative positions, and edge contacts. Relevant documents are short answer candidates that satisfy all spatial constraints. Some questions allow multiple valid answers.
The task rewards relation composition, object binding, and spatial constraint checking. It penalizes systems that only match color or shape words.
Representative Failure Modes
BM25 can retrieve a candidate with the right object vocabulary but the wrong relation. Dense retrieval can prefer semantically generic answer phrases that fit many questions. Hybrid retrieval can still rank a near-miss object above a valid one when both share the same colors and shapes.
Rerankers should build or approximate a scene graph from the query and evaluate candidate answers against the graph.
Training Data That May Help
Useful training data includes textual spatial QA, scene-graph question answering, synthetic block-world relation questions, and relation-composition hard negatives. Training should preserve multiple positives where several answer phrases are valid. The Nano split's scene queries, qrels, and answer candidates should be excluded from training.
Synthetic data can generate text scenes with blocks, objects, colors, sizes, and relative positions. Questions should require relation composition rather than matching a single object mention. Negatives should reuse the same object vocabulary while violating the queried spatial relation.
Model Improvement Notes
Dense retrievers should encode object binding and relation composition, not only scene topic. Sparse systems can preserve object labels but need reasoning rerankers. Cross-encoders or structured rerankers should parse the query into a scene graph and test candidate answers against all constraints.
For hybrid systems, NanoMMTEB-v2 / spart_qa is a positive hybrid case: reranking_hybrid wins all three provided metrics. The remaining opportunity is not broader retrieval alone but spatially faithful reranking.
Example Data
| Query | Positive document |
| There are three blocks. Lets call them A, B and C. Block A is below B and block B is below C. Block A has one small yellow circle. Block B has a big black square and a big blue square. To the left of and above a medium blue circle is the big black square. Above and to the left of the medium blue circle there is the big blue square. Block C has a big black square and a medium blue square. The big black square is above a big blue circle. The medium blue square is to the left of the big blue circle... [500 / 797 chars] | both of them [12 chars] |
| We have three blocks, A, B and C. Blocks B and C are above A. Block A contains one medium black square and a medium blue square. Below the medium blue square there is the medium black square. Block B contains one medium yellow square and a medium blue square. The medium yellow square is below the medium blue square. And block C has one medium blue square. Which object is above a medium square? the medium blue square that is in block C or the medium blue square that is in block B? [484 chars] | both of them [12 chars] |
| We have three blocks, A, B and C. Block B is below block C and it is to the left of block A. Block A has a small black triangle. Block B has a medium black triangle, one big blue circle and one small blue triangle. The big blue circle is to the right of and below the small blue triangle. Far from and to the right of the small blue triangle there is the medium black triangle. Block C contains one small black triangle and a small black circle. Above the small black circle there is the small black... [500 / 669 chars] | both of them [12 chars] |
Source Reference Table
| Title | Year | Type | URL |
| SpartQA: A Textual Question Answering Benchmark for Spatial Reasoning | 2021 | task paper | https://arxiv.org/abs/2104.05832 |
| SpartQA generation repository | 2021 | repository | https://github.com/HLR/SpartQA_generation |
| mteb/SpartQA | 2024 | dataset card | https://huggingface.co/datasets/mteb/SpartQA |
Dataset Information
| Field | Value |
| Nano set | NanoMMTEB-v2 |
| Backing dataset | NanoMMTEB-v2 |
| Task / split | spart_qa |
| Hugging Face dataset | hakari-bench/NanoMMTEB-v2 |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 1,592 |
| Positive qrels | 384 |
| Positives / query avg | 1.92 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 3 |
| Multi-positive queries | 92 (46.00%) |
| Query length avg chars | 654.85 |
| Document length avg chars | 49.80 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.1848 | 0.3800 | 0.5260 | top-500 |
| Dense | harrier_oss_v1_270m | 0.2591 | 0.4900 | 0.4870 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.3382 | 0.5700 | 0.5469 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: test
- Train/eval overlap audit: not_audited
- Leakage note: do not train on this Nano split's scene queries, qrels, or answer candidates
- Multi-positive training: multi_positive_objective
- Useful training data: textual spatial QA, scene-graph question answering, synthetic block-world relation questions, relation-composition hard negatives