NanoRARb / NanoTempReasonL2Pure
Overview
NanoTempReasonL2Pure is an English temporal reasoning retrieval task from NanoRARb. The query is a temporal question without supporting facts, and the relevant document is the correct short answer string. Each query has one positive answer. This is the hardest L2 formulation because retrieval must rely on temporal knowledge encoded in the model rather than on facts included in the query. BM25 is effectively unable to solve the top ranks, dense retrieval is the strongest profile, and reranking_hybrid improves candidate availability over BM25 but remains weaker than dense retrieval.
Details
What the Original Data Measures
RAR-b converts TempReason tasks into retrieval tasks where the target output is an answer document. TempReason is designed to test temporal reasoning over facts such as offices, heads of government, team membership, and dated roles.
The pure variant removes the supporting fact list from the query. The retriever sees only the question and date, so success depends on whether the representation can associate the question with the correct temporal answer without explicit evidence in the query text.
Observed Data Profile
The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 52.96 characters, while answer documents average 19.91 characters.
Examples ask which office Patricia de Lille held in September 2015, which parliamentary position Lord Douglas Gordon-Hallyburton held in October 1833, who led Russia in July 1999, which team Glynn Snodin played for in January 1992, and who led Romania in May 1935.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.0000, hit@10 of 0.0000, and recall@100 of 0.0050. This is a near-complete failure mode for lexical retrieval.
The reason is structural. The query usually contains an entity, relation, and date, while the relevant document is only the answer string. Unless the answer words overlap with the query by accident, BM25 has little evidence to rank the correct document. Term frequency cannot supply missing temporal knowledge.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.0483, hit@10 of 0.1050, and recall@100 of 0.4850. Dense retrieval is clearly stronger than BM25, but the absolute score is still low.
This profile shows that embedding similarity can recover some associations between question wording, entities, relations, dates, and plausible answer strings. However, the model must retrieve a short answer without seeing supporting evidence, so many correct answers are semantically distant from the query surface.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 134 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.0033, hit@10 of 0.0100, and recall@100 of 0.3300. Hybrid retrieval improves substantially over BM25 in recall@100 but is much weaker than dense retrieval at the top ranks.
This indicates that the sparse side contributes little for this pure setting. The hybrid pool can still expose additional candidates for a reranker, but the task mainly rewards dense semantic and memorized temporal associations rather than lexical matching.
Metric Interpretation for Model Researchers
With one positive per query, nDCG@10 reflects how early the exact answer string is ranked, hit@10 reflects whether it is available in the first ten candidates, and recall@100 reflects whether a second-stage reranker can access it.
For NanoTempReasonL2Pure, recall@100 is useful but should not be overread as reasoning success. A candidate can be retrieved from broad association, while a final model still needs to decide whether the answer is temporally valid for the target date.
Query and Relevance Type Tendencies
Queries are short temporal questions. Relevant documents are short entity, role, team, or office strings. The answer is determined by a temporal fact that is not included in the query.
Relevance is exact answer validity for the specified time. A historically related answer is wrong if it held before or after the target date.
Representative Failure Modes
Common failures include retrieving a more famous role for the same entity, choosing an adjacent office holder, overvaluing entity popularity, selecting an answer from the wrong time interval, and ranking semantically related but temporally invalid names. BM25 fails because the answer string is usually absent from the query; dense retrieval can still confuse nearby temporal states.
Training Data That May Help
Useful training data includes temporal QA without explicit facts, entity-time relation triples, historical office-holder timelines, sports roster timelines, and contrastive pairs with nearby dates. Evaluation queries, answers, and qrels should be excluded.
Model Improvement Notes
Models need stronger temporal knowledge and better date-conditioned retrieval behavior. Hard negatives should share the same entity and relation but differ by interval. This split is useful for identifying whether a retriever has learned temporal associations beyond lexical overlap.
Example Data
| Query | Positive document |
| Which position did Patricia de Lille hold in Sep, 2015? [55 chars] | mayor of Cape Town [18 chars] |
| Which position did Lord Douglas Gordon-Hallyburton hold in Oct, 1833? [69 chars] | Member of the 11th Parliament of the United Kingdom [51 chars] |
| Who was the head of Russia in Jul, 1999? [40 chars] | Sergei Stepashin [16 chars] |
Source Reference Table
| Title | Year | Type | URL |
| RAR-b: Reasoning as Retrieval Benchmark | 2024 | arXiv paper | https://arxiv.org/abs/2404.06347 |
| Towards Benchmarking and Improving the Temporal Reasoning Capability of Large Language Models | 2023 | arXiv paper | https://arxiv.org/abs/2306.08952 |
Dataset Information
| Field | Value |
| Nano set | NanoRARb |
| Backing dataset | NanoRARb |
| Task / split | NanoTempReasonL2Pure |
| Hugging Face dataset | hakari-bench/NanoRARb |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 200 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 52.95 |
| Document length avg chars | 19.91 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.0000 | 0.0000 | 0.0050 | top-500 |
| Dense | harrier_oss_v1_270m | 0.0483 | 0.1050 | 0.4850 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.0033 | 0.0100 | 0.3300 | top-100 |