NanoRARb / NanoRARbCode

Overview

NanoRARbCode is an English code reasoning retrieval task from NanoRARb. Queries are programming prompts, function signatures, and docstrings, while documents are candidate code implementations. Each query has one positive implementation. The task measures whether a retriever can map behavioral specifications to executable code, not merely match identifiers. BM25 and dense retrieval are both weak, with BM25 slightly ahead of dense, while reranking_hybrid is the strongest profile because it combines identifier overlap with semantic specification matching.

Details

What the Original Data Measures

RAR-b includes a code reasoning retrieval task built from program-synthesis and code-search style data. It uses HumanEvalPack and MBPP-style evaluation prompts, with answer corpora enlarged using code-search and instruction-tuned code data. Source references include CodeSearchNet and OctoPack.

In this task, the document is the code answer itself. A correct retrieval model must connect a natural-language specification and function signature to a code snippet that implements the required behavior.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 470.08 characters, and code documents average 256.00 characters.

Examples include summing ASCII codes for uppercase characters, collecting odd values from a Collatz sequence, returning a rounded average as binary, producing rolling maxima, and swapping case or reversing strings under conditions. Queries often include docstrings and examples; positives are compact Python implementations.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.1318, hit@10 of 0.2150, and recall@100 of 0.4450. BM25 is slightly stronger than dense retrieval. It benefits from shared function names, identifiers, literals, and API words between prompt and implementation.

However, lexical retrieval remains weak. Many correct implementations use different variable names or concise logic that does not repeat the full behavioral specification. Matching words is not the same as matching program semantics.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.1173, hit@10 of 0.2000, and recall@100 of 0.4350. Dense retrieval is close to BM25 but slightly weaker. This suggests that general-purpose embedding similarity has limited ability to represent executable behavior from code snippets.

Dense retrieval may capture broad task type, such as string processing or list iteration, but still fail to distinguish exact conditions, return formats, and edge cases.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 85 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.1773, hit@10 of 0.3000, and recall@100 of 0.5750. Hybrid retrieval is clearly strongest.

The result shows that code retrieval benefits from combining sparse identifiers and literals with semantic task similarity. BM25 contributes exact names and syntax; dense retrieval contributes broader functional similarity. A reranker can exploit the combined pool to judge behavior more precisely.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 measures how early the correct implementation appears, hit@10 measures whether it is in the first ten candidates, and recall@100 measures whether a reranker can see it.

For NanoRARbCode, top-rank scores remain low. A strong system needs program-semantic retrieval or reranking that can reason about examples, edge cases, and code behavior.

Query and Relevance Type Tendencies

Queries are code prompts with signatures, docstrings, and sometimes examples. Relevant documents are code snippets, often function bodies. The target relation is implementation correctness.

Relevance is behavioral. A candidate can share identifiers or APIs and still be wrong if it fails the specified edge case or output format.

Representative Failure Modes

Common failures include matching function names without behavior, confusing similar string or list transformations, missing output-format requirements such as binary conversion, and selecting code that handles common cases but fails edge cases. Dense retrieval can blur tasks with similar structure; BM25 can overvalue identifiers and comments.

Training Data That May Help

Useful training data includes code search, docstring-to-code retrieval, unit-test-backed program synthesis pairs, and HumanEval or MBPP-style tasks outside the evaluation queries. Evaluation prompts and solutions should be excluded.

Model Improvement Notes

Models should learn specification-to-implementation semantics. Hard negatives should share identifiers, APIs, or control-flow shape but fail tests or edge cases. Hybrid candidate generation is valuable, but final ranking likely needs code-aware reranking or execution-informed supervision.

Example Data

Query	Positive document
Finish the following code based on the docstring: def digitSum(s): """Task Write a function that takes a string as input and returns the sum of the upper characters only' ASCII codes. Examples: digitSum("") => 0 digitSum("abAB") => 131 digitSum("abcCd") => 67 digitSum("helloE") => 69 digitSum("woArBld") => 131 digitSum("aAaaaXa") => 153 """ [412 chars]	if s == "": return 0 return sum(ord(char) if char.isupper() else 0 for char in s) [85 chars]
Finish the following code based on the docstring: def get_odd_collatz(n): """ Given a positive integer n, return a sorted list that has the odd numbers in collatz sequence. The Collatz conjecture is a conjecture in mathematics that concerns a sequence defined as follows: start with any positive integer n. Then each term is obtained from the previous term as follows: if the previous term is even, the next term is one half of the previous term. If the previous term is odd, the next term is 3 times... [500 / 892 chars]	if n%2==0: odd_collatz = [] else: odd_collatz = [n] while n > 1: if n % 2 == 0: n = n/2 else: n = n*3 + 1 if n%2 == 1: odd_collatz.append(int(n)) return sorted(odd_collatz) [275 chars]
Finish the following code based on the docstring: def rounded_avg(n, m): """You are given two positive integers n and m, and your task is to compute the average of the integers from n through m (including n and m). Round the answer to the nearest integer and convert that to binary. If n is greater than m, return -1. Example: rounded_avg(1, 5) => "0b11" rounded_avg(7, 5) => -1 rounded_avg(10, 20) => "0b1111" rounded_avg(20, 33) => "0b11010" """ [489 chars]	if m < n: return -1 summation = 0 for i in range(n, m+1): summation += i return bin(round(summation/(m - n + 1))) [141 chars]

Source Reference Table

Title	Year	Type	URL
RAR-b: Reasoning as Retrieval Benchmark	2024	arXiv paper	https://arxiv.org/abs/2404.06347
CodeSearchNet Challenge: Evaluating the State of Semantic Code Search	2019	arXiv paper	https://arxiv.org/abs/1909.09436
OctoPack: Instruction Tuning Code Large Language Models	2023	arXiv paper	https://arxiv.org/abs/2308.07124

Dataset Information

Field	Value
Nano set	NanoRARb
Backing dataset	NanoRARb
Task / split	NanoRARbCode
Hugging Face dataset	hakari-bench/NanoRARb
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	470.08
Document length avg chars	256.00

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.1318	0.2150	0.4450	top-500
Dense	`harrier_oss_v1_270m`	0.1173	0.2000	0.4350	top-500
Reranking hybrid	`reranking_hybrid`	0.1773	0.3000	0.5750	top-100