NanoRARb / NanoTempReasonL3Pure

Overview

NanoTempReasonL3Pure is an English temporal reasoning retrieval task from NanoRARb. The query is a compact before/after temporal question without supporting facts, and the relevant document is the correct short answer string. Each query has one positive. This split combines the sparse evidence problem of pure retrieval with the harder ordering requirement of L3 TempReason. Dense retrieval is the strongest profile, reranking_hybrid improves over BM25, and BM25 remains near zero because the answer is usually not lexically present in the query.

Details

What the Original Data Measures

RAR-b reformulates TempReason reasoning tasks as retrieval over answer candidates. TempReason evaluates whether systems can reason about dates, intervals, and temporal order.

The L3 pure variant removes the supporting facts and asks a harder relative temporal question. The model must retrieve the correct predecessor, successor, or ordered answer using knowledge encoded in its representation rather than evidence included in the query.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 65.13 characters, while answer documents average 19.88 characters.

Examples ask who chaired Technical University of Munich before Wolfgang A. Herrmann, who led Romania before Alexandru G. Golescu, which parliamentary position Lord Douglas Gordon-Hallyburton held before a later Parliament, which employer Eduard Winkelmann joined after the Imperial University of Dorpat, and who led Romania after Adrian Nastase.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.0074, hit@10 of 0.0150, and recall@100 of 0.0950. This is a very weak lexical baseline.

The query contains an anchor entity and a temporal relation, while the correct answer is a different short string. BM25 cannot infer a missing timeline or identify the neighboring state, so it only succeeds when the answer has accidental overlap with the query.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.0707, hit@10 of 0.1550, and recall@100 of 0.5400. Dense retrieval is clearly the strongest method among the three reported profiles.

The score remains modest because the task asks for implicit temporal knowledge. Embedding similarity can connect some anchors and likely answer strings, but it often confuses adjacent timeline entities or more prominent alternatives.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 117 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.0238, hit@10 of 0.0600, and recall@100 of 0.4150. Hybrid retrieval improves over BM25 but is weaker than dense retrieval.

This shows that sparse matching contributes little when supporting facts are absent. Hybrid retrieval may still increase candidate diversity, but the main useful signal comes from dense association and temporal knowledge.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 measures early placement of the exact answer, hit@10 measures whether the answer appears in the first ten candidates, and recall@100 measures reranker availability.

For NanoTempReasonL3Pure, recall@100 is a proxy for whether a retriever can find plausible timeline answers at all. High-quality ranking also requires distinguishing the correct ordered answer from related but temporally invalid candidates.

Query and Relevance Type Tendencies

Queries are short before/after questions. Relevant documents are short names, offices, institutions, or role strings. The target answer is usually not named in the query and must be inferred from an external temporal sequence.

Relevance is exact ordered validity. The answer must satisfy the specified before or after relation, not merely belong to the same entity timeline.

Representative Failure Modes

Common failures include retrieving the anchor entity, choosing a more famous office holder, selecting an adjacent but wrong timeline entry, ignoring the before/after direction, and overranking semantically related answer strings. BM25 mostly lacks usable evidence; dense retrieval can recover candidates but may not encode exact ordering.

Training Data That May Help

Useful training data includes temporal knowledge-base QA, before/after entity retrieval, succession timelines, date-conditioned contrastive examples, and hard negatives from neighboring timeline entries. Evaluation queries, answers, and qrels should be excluded.

Model Improvement Notes

Models need stronger temporal knowledge and direction-sensitive retrieval. Hard negatives should share the same anchor timeline but differ in before/after relation or adjacency. This split is useful for measuring whether a retriever can use implicit temporal memory rather than copied context.

Example Data

Query	Positive document
Who was the chair of Technical University of Munich before Wolfgang A. Herrmann? [80 chars]	Otto Meitinger [14 chars]
Who was the head of Romania before Alexandru G. Golescu? [56 chars]	Dimitrie Ghica [14 chars]
Which position did Lord Douglas Gordon-Hallyburton hold before Member of the 13th Parliament of the United Kingdom? [115 chars]	Member of the 12th Parliament of the United Kingdom [51 chars]

Source Reference Table

Title	Year	Type	URL
RAR-b: Reasoning as Retrieval Benchmark	2024	arXiv paper	https://arxiv.org/abs/2404.06347
Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models	2023	arXiv paper	https://arxiv.org/abs/2306.08952

Dataset Information

Field	Value
Nano set	NanoRARb
Backing dataset	NanoRARb
Task / split	NanoTempReasonL3Pure
Hugging Face dataset	hakari-bench/NanoRARb
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	65.13
Document length avg chars	19.88

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.0074	0.0150	0.0950	top-500
Dense	`harrier_oss_v1_270m`	0.0707	0.1550	0.5400	top-500
Reranking hybrid	`reranking_hybrid`	0.0238	0.0600	0.4150	top-100