NanoRARb / NanoRARbMath

Overview

NanoRARbMath is an English mathematical reasoning retrieval task from NanoRARb. It recasts math problem solving as retrieval: the query is a math word problem or formal problem, and the relevant document is the corresponding worked solution or answer text. Each query has one positive solution. Unlike many short-answer NanoRARb tasks, BM25 is already strong because problems and solutions often share equations, symbols, variables, and named quantities. Dense retrieval is stronger for top-rank quality, while reranking_hybrid gives the best recall@100.

Details

What the Original Data Measures

RAR-b builds a pooled numerical reasoning retrieval task from MATH and GSM8K-style evaluation questions, with MetaMathQA-style answer material used to enlarge the corpus. Related source tasks include grade-school word problems, competition-style mathematical problem solving, and synthetic or bootstrapped math reasoning data.

In this retrieval setting, the target is not a general textbook passage. It is the solution text that corresponds to the query problem. The model must connect problem statements to the correct derivation among many math solutions.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 201.33 characters, while solution documents average 481.33 characters.

Examples include triangle geometry, rotation matrices, inverse trigonometric simplification, angle-addition identities, and law-of-cosines transformations. Positive documents often contain formulas, intermediate derivations, and final answers.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.6147, hit@10 of 0.7900, and recall@100 of 0.9450. This is a strong sparse profile. Mathematical notation, variables, expressions, and named operations are frequently repeated from problem to solution.

BM25 is therefore far more effective here than in many commonsense NanoRARb tasks. Its limitation is that shared symbols can also appear in near-miss solutions that solve a different quantity or apply a different transformation.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.7818, hit@10 of 0.8850, and recall@100 of 0.9400. Dense retrieval is the strongest top-rank profile. It improves ranking quality even though its recall@100 is slightly below BM25.

This suggests that embeddings capture some mathematical problem-solution relation beyond symbol overlap. Dense retrieval can connect a problem to the kind of derivation needed, not only to repeated notation.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with five rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.7350, hit@10 of 0.8700, and recall@100 of 0.9750. Hybrid retrieval has the best candidate coverage but lower top-rank quality than dense retrieval.

This is a useful split for separating first-stage ranking from reranking pool quality. Dense retrieval is better as a ranked list, while hybrid retrieval gives a reranker the broadest access to the gold solution.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 measures how early the correct solution appears, hit@10 measures whether it is in the first ten results, and recall@100 measures whether a later reranker can see it.

For NanoRARbMath, BM25 is a serious baseline because of notation overlap, dense retrieval is the best top-rank baseline, and hybrid retrieval is the best coverage baseline. Improvements should demonstrate mathematical reasoning alignment, not just equation matching.

Query and Relevance Type Tendencies

Queries are math problems with variables, diagrams encoded as text, trigonometric expressions, geometry conditions, or algebraic constraints. Relevant documents are worked solutions with derivation steps and final values.

Relevance is exact problem-solution correspondence. A solution with similar symbols or topic is wrong if it solves a different problem or reaches a different result.

Representative Failure Modes

Common failures include retrieving a solution with similar formulas but a different target quantity, confusing angle identities, matching geometry diagrams by variable names rather than constraints, and selecting a generic worked solution that shares notation but not the problem. BM25 overweights symbol overlap; dense retrieval can still blur similar solution templates.

Training Data That May Help

Useful training data includes math problem-to-solution retrieval, GSM8K and MATH-style reasoning pairs outside the evaluation examples, verifier data, and synthetic worked solutions with near-miss distractors. Evaluation queries, solutions, and answer-pool entries should be excluded.

Model Improvement Notes

Models should learn to align a mathematical problem with the correct derivation path. Hard negatives should share symbols, equation forms, or topic tags but solve a different value or apply an invalid step. Hybrid candidate generation is useful for reranking, but top-rank models should reason over mathematical structure.

Example Data

Query	Positive document
Problem: Let $ABC$ be a triangle with $\angle A = 45^\circ$. Let $P$ be a point on side $\overline{BC}$ with $PB = 3$ and $PC = 5$. Let $O$ be the circumcenter of triangle $ABC$. Determine the length $OP$. [205 chars]	Using the extended Sine law, we find the circumradius of $ABC$ to be $R = \frac{BC}{2\sin A} = 4\sqrt 2$. [asy] unitsize(0.8 cm); pair A, B, C, O, P; A = (0,0); B = (2,2); C = (5,0); P = interp(B,C,3/8); O = circumcenter(A,B,C); draw(A--B--C--cycle); draw(circumcircle(A,B,C)); draw(O--P); label("$A$", A, W); label("$B$", B, N); label("$C$", C, E); dot("$O$", O, S); dot("$P$", P, NE); [/asy] By considering the power of point $P$, we find that $R^2 - OP^2 = PB \cdot PC = 15$. So $OP = \sqrt{R^2 - 15} = \sqrt{ 16 \cdot 2 - 15} = \boxed{\sqrt{17}}$. [557 chars]
Problem: Find the matrix that corresponds to rotating about the origin by an angle of $45^\circ$ clockwise. [107 chars]	The transformation that rotates about the origin by an angle of $45^\circ$ clockwise takes $\begin{pmatrix} 1 \\ 0 \end{pmatrix}$ to $\begin{pmatrix} 1/\sqrt{2} \\ -1/\sqrt{2} \end{pmatrix}$ and $\begin{pmatrix} 0 \\ 1 \end{pmatrix}$ to $\begin{pmatrix} 1/\sqrt{2} \\ 1/\sqrt{2} \end{pmatrix},$ so the matrix is \[\boxed{\begin{pmatrix} 1/\sqrt{2} & 1/\sqrt{2} \\ -1/\sqrt{2} & 1/\sqrt{2} \end{pmatrix}}.\] [406 chars]
Problem: Compute $\sin^{-1} (\sin 3) + \sin^{-1} (\sin 4) + \sin^{-1} (\sin 5).$ All functions are in radians. [111 chars]	Since $\sin (\pi - 3) = \sin 3$ and $-\frac{\pi}{2} \le \pi - 3 \le \frac{\pi}{2},$ \[\sin^{-1} (\sin 3) = \pi - 3.\]Since $\sin (\pi - 4) = \sin 4$ and $-\frac{\pi}{2} \le \pi - 4 \le \frac{\pi}{2},$ \[\sin^{-1} (\sin 4) = \pi - 4.\]Since $\sin (5 - 2 \pi) = \sin 5$ and $-\frac{\pi}{2} \le 5 - 2 \pi \le \frac{\pi}{2},$ \[\sin^{-1} (\sin 5) = 5 - 2 \pi.\]Therefore, \[\sin^{-1} (\sin 3) + \sin^{-1} (\sin 4) + \sin^{-1} (\sin 5) = (\pi - 3) + (\pi - 4) + (5 - 2 \pi) = \boxed{-2}.\] [484 chars]

Source Reference Table

Title	Year	Type	URL
RAR-b: Reasoning as Retrieval Benchmark	2024	arXiv paper	https://arxiv.org/abs/2404.06347
Training Verifiers to Solve Math Word Problems	2021	arXiv paper	https://arxiv.org/abs/2110.14168
Measuring Mathematical Problem Solving With the MATH Dataset	2021	arXiv paper	https://arxiv.org/abs/2103.03874
MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models	2023	arXiv paper	https://arxiv.org/abs/2309.12284

Dataset Information

Field	Value
Nano set	NanoRARb
Backing dataset	NanoRARb
Task / split	NanoRARbMath
Hugging Face dataset	hakari-bench/NanoRARb
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	201.32
Document length avg chars	481.33

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.6147	0.7900	0.9450	top-500
Dense	`harrier_oss_v1_270m`	0.7818	0.8850	0.9400	top-500
Reranking hybrid	`reranking_hybrid`	0.7350	0.8700	0.9750	top-100