NanoRTEB / NanoHumanEval
Overview
NanoHumanEval is an English docstring-to-code retrieval task from NanoRTEB. The query is a compact Python programming task description, and the relevant document is the corresponding implementation body from HumanEval. Each query has one positive among 158 code documents. Dense retrieval is strong because it captures specification-to-implementation semantics, while reranking_hybrid has the best nDCG@10 and matched recall@100. BM25 is useful but weaker because short code bodies often do not share enough surface text with the natural-language description.
Details
What the Original Data Measures
HumanEval was introduced to evaluate large language models trained on code. The original benchmark contains hand-written Python programming problems, tests, and reference-like function behavior for functional-correctness evaluation.
RTEB converts the generation setting into retrieval. The model receives the function description as a query and must retrieve the matching implementation body. This makes the task closer to semantic code search than competitive-programming retrieval.
Observed Data Profile
The Nano split contains 158 queries, 158 documents, and 158 positive qrel rows. Every query has exactly one positive. Queries average 291.16 characters, while code documents average 176.99 characters.
Example tasks include filtering strings by prefix, computing maximum nested-parentheses depth, checking whether a number is a product of three primes, counting fruit items from a string description, and validating balanced brackets.
BM25 Evaluation Profile
The BM25 candidate subset uses the full 158-document pool and reaches nDCG@10 of 0.3405, hit@10 of 0.5443, and recall@100 of 0.9051. BM25 works when descriptions and code share variable names, operators, or literal strings.
The weakness is semantic compression. A short implementation can satisfy a long docstring without repeating its words, and many Python functions share common tokens such as loops, returns, list comprehensions, or helper names.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses the full 158-document pool and reaches nDCG@10 of 0.5666, hit@10 of 0.7975, and recall@100 of 0.9937. Dense retrieval is the best profile for top-ten hit rate.
This shows that embedding similarity captures function-level intent better than lexical matching. It can connect a description such as checking balanced brackets to a stack or depth-counting implementation even when wording differs.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 1 row receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.5770, hit@10 of 0.7405, and recall@100 of 0.9937. Hybrid retrieval slightly improves nDCG@10 over dense retrieval while sharing the same recall@100.
The result suggests that sparse code tokens and dense function semantics are complementary for early ordering. Exact symbols help when the query names operations or literals, while dense similarity handles the core specification.
Metric Interpretation for Model Researchers
With one positive per query, nDCG@10 measures how early the exact implementation appears, hit@10 measures whether it is in the first ten candidates, and recall@100 measures whether a reranker can access it.
For NanoHumanEval, the candidate pool is small, so recall is high for dense and hybrid retrieval. The main research signal is early ranking among compact implementations with similar Python structure.
Query and Relevance Type Tendencies
Queries are concise function specifications, sometimes with examples or constraints. Relevant documents are short Python implementation bodies. The task often requires mapping natural language to control flow, string operations, arithmetic, or list processing.
Relevance is exact specification-to-implementation correspondence. Code with similar structure is wrong if it implements a different edge case or returns a different value.
Representative Failure Modes
Common failures include retrieving a function with similar loop structure, confusing string and list tasks, overmatching shared variable names, and ranking a plausible but semantically different implementation. BM25 overweights tokens; dense retrieval can blur nearby algorithmic intentions.
Training Data That May Help
Useful training data includes docstring-to-code retrieval, Python function search, unit-test-linked code examples, and hard negatives from functions with similar signatures or control flow. Evaluation prompts, implementations, and qrels should be excluded.
Model Improvement Notes
Models should represent function intent, input-output behavior, and edge cases. Hard negatives should share variable names, operations, or control flow while differing in specification. Hybrid retrieval is a strong first-stage choice because exact code tokens and semantic function matching both matter.
Example Data
| Query | Positive document |
| Filter an input list of strings only for ones that start with a given prefix. [77 chars] | return [x for x in strings if x.startswith(prefix)] [51 chars] |
| Input to this function is a string represented multiple groups for nested parentheses separated by spaces. For each of the group, output the deepest level of nesting of parentheses. E.g. (()()) has maximum two levels of nesting while ((())) has three. [251 chars] | def parse_paren_group(s): depth = 0 max_depth = 0 for c in s: if c == '(': depth += 1 max_depth = max(depth, max_depth) else: depth -= 1 return max_depth return [parse_paren_group(x) for x in paren_string.split(' ') if x] [331 chars] |
| Write a function that returns true if the given number is the multiplication of 3 prime numbers and false otherwise. Knowing that (a) is less then 100. Example: is_multiply_prime(30) == True [191 chars] | def is_prime(n): for j in range(2,n): if n%j == 0: return False return True for i in range(2,101): if not is_prime(i): continue for j in range(2,101): if not is_prime(j): continue for k in range(2,101): if not is_prime(k): continue if i*j*k == a: return True return False [396 chars] |
Source Reference Table
| Title | Year | Type | URL |
| Evaluating Large Language Models Trained on Code | 2021 | task paper | https://arxiv.org/abs/2107.03374 |
| openai/openai_humaneval | dataset card | https://huggingface.co/datasets/openai/openai_humaneval | |
| Introducing RTEB: A New Standard for Retrieval Evaluation | 2025 | benchmark article | https://huggingface.co/blog/rteb |
Dataset Information
| Field | Value |
| Nano set | NanoRTEB |
| Backing dataset | NanoRTEB |
| Task / split | NanoHumanEval |
| Hugging Face dataset | hakari-bench/NanoRTEB |
| Language | en |
| Category | code |
| Queries | 158 |
| Documents | 158 |
| Positive qrels | 158 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 291.16 |
| Document length avg chars | 176.99 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.3405 | 0.5443 | 0.9051 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5666 | 0.7975 | 0.9937 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.5770 | 0.7405 | 0.9937 | top-100 |