HAKARI-Bench

NanoCoIR / NanoApps

Overview

NanoApps is an English code-retrieval task in NanoCoIR, adapted from the APPS programming-challenge benchmark through CoIR. The query is a full competitive-programming problem statement, often including constraints, examples, and detailed input/output rules. The target document is a Python solution program that solves the stated problem.

This task is useful because it stresses retrieval across a large modality gap. The query is long natural-language algorithmic prose, while the relevant document is compact executable code. A model must infer algorithmic intent from the statement, connect it to implementation structure, and avoid being distracted by solutions for problems that have similar input formats or contest-style wording. It is therefore closer to problem-to-solution retrieval than API search or docstring-code matching.

Details

What the Original Data Measures

CoIR converts APPS into a text-to-code retrieval task. APPS itself contains programming problems collected from open-access coding challenge sites, with candidate solutions validated by test cases. In the CoIR formulation, the natural-language problem description becomes the query and code solutions form the searchable corpus.

The original data measures whether a retrieval system can recognize the implementation that satisfies a full algorithmic specification. Relevance depends on behavior, constraints, and input/output semantics, not on the presence of shared identifiers. For example, two programs may both parse integers and loop over test cases, but only one implements the correct greedy, dynamic-programming, graph, arithmetic, or simulation logic for the query.

Observed Data Profile

This Nano split contains 200 queries, 8,754 documents, and 200 positive qrels. Each query has exactly one positive solution. Queries are long, averaging 1,675.42 characters, while documents average 573.12 characters. This imbalance is central to the task: the problem statement may be several paragraphs, but the correct solution may be a short Python program with sparse natural-language overlap.

Observed queries include competitive-programming stories and mathematical constraints about movement on strings, contest archive validation, repeated attacks on a monster, stick-length adjustment, and digit deletion. Positive documents are concise Python solutions using loops, sorting, arithmetic checks, counters, and standard input parsing. The retrieval problem is to connect those implementation patterns back to the full problem behavior.

BM25 Evaluation Profile

BM25 performs very poorly on NanoApps. Its nDCG@10 is 0.0084, hit@10 is 0.0150, and recall@100 is 0.0750 with a top-500 candidate pool. Only a small fraction of positives are recovered near the top, which shows that word overlap between problem statements and solution code is usually weak.

This is the expected failure mode for lexical retrieval on APPS-style tasks. Problem statements contain story text, constraints, mathematical notation, and examples. Solution programs contain variable names, control flow, library imports, and arithmetic operations. BM25 can match tokens such as input, print, numbers, or occasional problem-specific words, but these are rarely sufficient to identify the correct algorithm. In this split, term frequency mostly captures superficial programming boilerplate rather than relevance.

Dense Evaluation Profile

The dense harrier-oss-270m candidate subset is the strongest among the available profiles, with nDCG@10 of 0.2528, hit@10 of 0.3500, and recall@100 of 0.6700. The dense result is far above BM25, indicating that embedding similarity captures some relationship between algorithmic descriptions and implementation patterns.

The score is still far from saturated. Dense retrieval can often group a statement with solutions that use compatible algorithm families, but it may confuse problems with similar contest structure, similar input signatures, or similar mathematical constraints. Since there is only one positive per query, rank precision is sensitive to near-miss programs that solve a related but different problem. This makes NanoApps a strong diagnostic for whether code retrievers understand full task semantics rather than short textual hints.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.1655, hit@10 of 0.2750, and recall@100 of 0.5400. It uses top-100 candidates with optional rank-101 safeguard rows; 92 rows contain 101 candidates, and 92 safeguard positives were needed. Hybrid retrieval is much better than BM25 but weaker than dense retrieval on this split.

This pattern is informative. Hybrid search usually helps when lexical and dense signals are complementary, but NanoApps has a lexical signal that is extremely noisy. Adding BM25-style evidence can pull in code that shares surface programming tokens without solving the problem. The hybrid set still benefits from dense candidates, yet the dense-only profile is the clearer retrieval signal for this particular task.

Metric Interpretation for Model Researchers

NanoApps is a dense-favored code retrieval task. The main difficulty is cross-representational matching from long natural-language specifications to short implementations. BM25 failure should not be interpreted as a broken candidate source; it reveals that the relevant evidence is behavioral rather than lexical.

Recall@100 is especially important. Dense retrieval recovers 67.00% of positives by rank 100, while BM25 recovers only 7.50%. This large gap means that a reranker built on BM25-only candidates would miss most correct solutions before reranking begins. For model researchers, NanoApps is a good place to test models that encode algorithms, constraints, and input-output behavior into a shared text-code representation.

Query and Relevance Type Tendencies

Queries are long contest problem statements. They may contain narrative framing, formal constraints, examples, edge cases, and required output formats. The relevant document is a working Python solution that implements the required algorithm, often with little or no explanatory text.

Relevance is behavioral. A correct solution may use arbitrary variable names and minimal comments, so matching must rely on inferred computation. The model needs to recognize patterns such as scanning a string in fixed jumps, validating identifiers, minimizing adjustment cost, computing repeated attack counts, or counting trailing zeros after digit deletion.

Representative Failure Modes

A frequent failure mode is retrieving a solution with the same input shape but a different algorithm. Many competitive-programming programs read n, arrays, strings, or multiple test cases, so shallow code structure is not enough. Another failure mode is over-weighting story words from the query, which often never appear in the code.

Dense systems may retrieve solutions from the right broad family but miss a constraint-specific detail, such as whether the optimization is over all target lengths, whether obstacles block movement, or whether a monster can be defeated in one step. Hybrid systems may inherit these semantic errors while also adding boilerplate-heavy lexical distractors.

Training Data That May Help

Useful training data includes APPS-style problem-to-solution pairs, competitive-programming solutions with hard negatives, and long specification-to-code retrieval data. Hard negatives should be selected from programs with similar input formats or similar algorithm families but different required behavior.

Leakage control matters. The Nano split is derived from CoIR APPS test-side retrieval data. Training should exclude NanoApps queries, qrels, and positive solution documents, and should not train on APPS test-derived rows. A safer filter removes rows by normalized query text, normalized solution text, and token-fingerprint overlap.

Model Improvement Notes

Improving NanoApps requires better alignment between natural-language problem semantics and code behavior. Stronger models should represent constraints, examples, and algorithmic objectives, not just topic words or code tokens. Pretraining or contrastive training over full problem statements and verified solutions is likely more useful than short docstring-code pairs alone.

Evaluation should also inspect candidate recall before reranking. A reranker cannot recover positives that are absent from the candidate pool, and BM25 candidates are especially weak here. Dense candidate generation is the more appropriate base for this task, with reranking focused on distinguishing near-miss solutions that share broad structure.

Example Data

QueryPositive document
On the way to Rio de Janeiro Ostap kills time playing with a grasshopper he took with him in a special box. Ostap builds a line of length n such that some cells of this line are empty and some contain obstacles. Then, he places his grasshopper to one of the empty cells and a small insect in another empty cell. The grasshopper wants to eat the insect. Ostap knows that grasshopper is able to jump to any empty cell that is exactly k cells away from the current (to the left or to the right). Note th... [500 / 2,427 chars]from math import * from sys import * from queue import * from decimal import * n,k=(int(z) for z in input().split()) s=input() i=0 while i<len(s) and s[i] not in ["G","T"]: i+=1 i+=k while i<len(s) and s[i] not in ["G","T","#"]: i+=k if i>=len(s) or s[i]=="#": print("NO") else: print("YES") [300 chars]
The All-Berland National Olympiad in Informatics has just ended! Now Vladimir wants to upload the contest from the Olympiad as a gym to a popular Codehorses website. Unfortunately, the archive with Olympiad's data is a mess. For example, the files with tests are named arbitrary without any logic. Vladimir wants to rename the files with tests so that their names are distinct integers starting from 1 without any gaps, namely, "1", "2", ..., "n', where n is the total number of tests. Some of the fi... [500 / 2,762 chars]n = int(input()) t = [1] + [0] * n b, a = d = [], [] h, s = [], [] for i in range(n): f, k = input().split() d[int(k)].append(f) m = len(a) for i in a: if i.isdigit() and i[0] != '0': j = int(i) if 0 < j <= m: t[j] = 1 elif m < j <= n: t[j] = -1 else: s.append(i) else: s.append(i) for i in b: if i.isdigit() and i[0] != '0': j = int(i) if m < j <= n: t[j] = 1 elif 0 < j <= m: t[j] = -1 else: s.append(i) else: s.append(i) x = [j for j in range(1, m + 1) if t[j] < 0] y = [j for j in range(m + 1, n + 1) if t[j] < 0] u = [j for j in range(1, m + 1) if not t[j]] v = [j for j in range(m + 1, n + 1) if not t[j]] if not s and (x or y): s = ['0'] if y: i = y.pop() v.append(i) else: i = x.pop() u.append(i) h.append(str(i) + ' 0') t[i] = 0 while x or y: if v and x: i = x.pop() j = v.pop() t[j] = 1 h.append(str(i) + ' ' + str(j)) u.append(i) else: u, v, x, y = v, u, y, x k = 1 for j in s: while t[k] == 1: k += 1 h.append(j + ' ' + str(k)) k += 1 d = '\nmove ' print(str(len(h)) + d + d.join(h) if h... [1,000 / 1,311 chars]
You are fighting with Zmei Gorynich — a ferocious monster from Slavic myths, a huge dragon-like reptile with multiple heads! $m$ Initially Zmei Gorynich has $x$ heads. You can deal $n$ types of blows. If you deal a blow of the $i$-th type, you decrease the number of Gorynich's heads by $min(d_i, curX)$, there $curX$ is the current number of heads. But if after this blow Zmei Gorynich has at least one head, he grows $h_i$ new heads. If $curX = 0$ then Gorynich is defeated. You can deal each blow... [500 / 2,047 chars]for _ in range(int(input())): n, x = list(map(int, input().split())) A = [] for _1 in range(n): d, h = list(map(int, input().split())) A.append([d, h]) A.sort(reverse=True) if A[0][0] >= x: print(1) else: x -= A[0][0] mz = 0 for d, h in A: mz = max(mz, d - h) if mz: print((x + mz - 1) // mz + 1) else: print(-1) [434 chars]

Source Reference Table

SourceRole
CoIR: A Comprehensive Benchmark for Code Information Retrieval ModelsBenchmark paper that adapts APPS and other code resources into retrieval tasks.
Measuring Coding Challenge Competence With APPSSource task paper for the programming-problem benchmark.
codeparrot/appsPublic Hugging Face dataset card for APPS.
hakari-bench/NanoCoIRNano benchmark dataset containing this split.

Dataset Information

FieldValue
Nano setNanoCoIR
Backing datasetNanoCoIR
Task / splitNanoApps
Hugging Face datasethakari-bench/NanoCoIR
Languageen
Categorycode
Queries200
Documents8,754
Positive qrels200
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars1,675.41
Document length avg chars573.12

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.00840.01500.0750top-500
Denseharrier_oss_v1_270m0.25280.35000.6700top-500
Reranking hybridreranking_hybrid0.16550.27500.5400top-100

Training and Leakage Metadata