HAKARI-Bench

NanoRTEB / NanoHumanEval

Overview

NanoHumanEval is an English docstring-to-code retrieval task from NanoRTEB. The query is a compact Python programming task description, and the relevant document is the corresponding implementation body from HumanEval. Each query has one positive among 158 code documents. Dense retrieval is strong because it captures specification-to-implementation semantics, while reranking_hybrid has the best nDCG@10 and matched recall@100. BM25 is useful but weaker because short code bodies often do not share enough surface text with the natural-language description.

Details

What the Original Data Measures

HumanEval was introduced to evaluate large language models trained on code. The original benchmark contains hand-written Python programming problems, tests, and reference-like function behavior for functional-correctness evaluation.

RTEB converts the generation setting into retrieval. The model receives the function description as a query and must retrieve the matching implementation body. This makes the task closer to semantic code search than competitive-programming retrieval.

Observed Data Profile

The Nano split contains 158 queries, 158 documents, and 158 positive qrel rows. Every query has exactly one positive. Queries average 291.16 characters, while code documents average 176.99 characters.

Example tasks include filtering strings by prefix, computing maximum nested-parentheses depth, checking whether a number is a product of three primes, counting fruit items from a string description, and validating balanced brackets.

BM25 Evaluation Profile

The BM25 candidate subset uses the full 158-document pool and reaches nDCG@10 of 0.3405, hit@10 of 0.5443, and recall@100 of 0.9051. BM25 works when descriptions and code share variable names, operators, or literal strings.

The weakness is semantic compression. A short implementation can satisfy a long docstring without repeating its words, and many Python functions share common tokens such as loops, returns, list comprehensions, or helper names.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses the full 158-document pool and reaches nDCG@10 of 0.5666, hit@10 of 0.7975, and recall@100 of 0.9937. Dense retrieval is the best profile for top-ten hit rate.

This shows that embedding similarity captures function-level intent better than lexical matching. It can connect a description such as checking balanced brackets to a stack or depth-counting implementation even when wording differs.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 1 row receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.5770, hit@10 of 0.7405, and recall@100 of 0.9937. Hybrid retrieval slightly improves nDCG@10 over dense retrieval while sharing the same recall@100.

The result suggests that sparse code tokens and dense function semantics are complementary for early ordering. Exact symbols help when the query names operations or literals, while dense similarity handles the core specification.

Metric Interpretation for Model Researchers

With one positive per query, nDCG@10 measures how early the exact implementation appears, hit@10 measures whether it is in the first ten candidates, and recall@100 measures whether a reranker can access it.

For NanoHumanEval, the candidate pool is small, so recall is high for dense and hybrid retrieval. The main research signal is early ranking among compact implementations with similar Python structure.

Query and Relevance Type Tendencies

Queries are concise function specifications, sometimes with examples or constraints. Relevant documents are short Python implementation bodies. The task often requires mapping natural language to control flow, string operations, arithmetic, or list processing.

Relevance is exact specification-to-implementation correspondence. Code with similar structure is wrong if it implements a different edge case or returns a different value.

Representative Failure Modes

Common failures include retrieving a function with similar loop structure, confusing string and list tasks, overmatching shared variable names, and ranking a plausible but semantically different implementation. BM25 overweights tokens; dense retrieval can blur nearby algorithmic intentions.

Training Data That May Help

Useful training data includes docstring-to-code retrieval, Python function search, unit-test-linked code examples, and hard negatives from functions with similar signatures or control flow. Evaluation prompts, implementations, and qrels should be excluded.

Model Improvement Notes

Models should represent function intent, input-output behavior, and edge cases. Hard negatives should share variable names, operations, or control flow while differing in specification. Hybrid retrieval is a strong first-stage choice because exact code tokens and semantic function matching both matter.

Example Data

QueryPositive document
Filter an input list of strings only for ones that start with a given prefix. [77 chars]return [x for x in strings if x.startswith(prefix)] [51 chars]
Input to this function is a string represented multiple groups for nested parentheses separated by spaces. For each of the group, output the deepest level of nesting of parentheses. E.g. (()()) has maximum two levels of nesting while ((())) has three. [251 chars]def parse_paren_group(s): depth = 0 max_depth = 0 for c in s: if c == '(': depth += 1 max_depth = max(depth, max_depth) else: depth -= 1 return max_depth return [parse_paren_group(x) for x in paren_string.split(' ') if x] [331 chars]
Write a function that returns true if the given number is the multiplication of 3 prime numbers and false otherwise. Knowing that (a) is less then 100. Example: is_multiply_prime(30) == True [191 chars]def is_prime(n): for j in range(2,n): if n%j == 0: return False return True for i in range(2,101): if not is_prime(i): continue for j in range(2,101): if not is_prime(j): continue for k in range(2,101): if not is_prime(k): continue if i*j*k == a: return True return False [396 chars]

Source Reference Table

TitleYearTypeURL
Evaluating Large Language Models Trained on Code2021task paperhttps://arxiv.org/abs/2107.03374
openai/openai_humanevaldataset cardhttps://huggingface.co/datasets/openai/openai_humaneval
Introducing RTEB: A New Standard for Retrieval Evaluation2025benchmark articlehttps://huggingface.co/blog/rteb

Dataset Information

FieldValue
Nano setNanoRTEB
Backing datasetNanoRTEB
Task / splitNanoHumanEval
Hugging Face datasethakari-bench/NanoRTEB
Languageen
Categorycode
Queries158
Documents158
Positive qrels158
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars291.16
Document length avg chars176.99

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.34050.54430.9051top-500
Denseharrier_oss_v1_270m0.56660.79750.9937top-500
Reranking hybridreranking_hybrid0.57700.74050.9937top-100