HAKARI-Bench

NanoCodeRAG / NanoCodeRAGLibraryDocumentationSolutions

Overview

NanoCodeRAGLibraryDocumentationSolutions is an English code-retrieval task in NanoCodeRAG, sampled from the library-documentation retrieval source of CodeRAG-Bench. The query is an API name, usage intent, or short reference-style description, and the target document is the documentation entry that contains the needed API behavior, signature, arguments, aliases, examples, or migration notes.

The task is useful for studying retrieval-augmented code generation because library documentation is often the missing context for correct API use. A model must preserve exact identifiers such as dotted module paths and function names, while also matching semantic clues about the API's purpose. The documents are long enough to include generic boilerplate, so the retriever must find the exact reference page rather than any page from the same library.

Details

What the Original Data Measures

CodeRAG-Bench evaluates whether retrieval can support code generation. It defines several retrieval sources, including programming solutions, online tutorials, Python library documentation, Stack Overflow posts, and GitHub files. The library-documentation source is built from official Python library references, including documentation collected through devdocs.io.

This Nano task isolates the documentation source. The relevant document should contain the API behavior, signature, argument details, aliases, examples, or version-specific notes needed by the query. The task therefore measures API-reference retrieval, not general web search.

Observed Data Profile

This Nano split contains 200 queries, 8,683 documents, and 200 positive qrels. Each query has exactly one positive documentation entry. Queries average 397.43 characters, and documents average 2,045.70 characters. The long document length reflects full reference entries rather than short summaries.

Observed examples are dominated by TensorFlow-style API documentation, including entries such as forward-mode autodiff accumulators, random datasets, confusion matrices, batch-to-space operations, and distribution strategies. The relevant documents contain signatures, aliases, parameter descriptions, examples, deprecation notices, and migration guidance.

BM25 Evaluation Profile

BM25 is strong on this task, with nDCG@10 of 0.6867, hit@10 of 0.8150, and recall@100 of 0.9200 using a top-500 candidate pool. Exact API paths, function names, class names, and argument names give lexical retrieval meaningful anchors. When the query contains a dotted TensorFlow path, BM25 can often find the matching documentation entry.

The difficulty is that documentation pages share large amounts of boilerplate: alias sections, migration guide text, parameter templates, and generic phrases appear across many entries. BM25 can over-rank nearby APIs in the same namespace when they share common text but describe a different function. Exact identifier preservation matters, but it is not sufficient by itself.

Dense Evaluation Profile

The dense harrier-oss-270m profile is the best top-rank result, with nDCG@10 of 0.7645, hit@10 of 0.8900, and recall@100 of 0.9250. Dense retrieval improves over BM25 by using semantic clues in the API summary and query text, not just exact identifier overlap.

Dense similarity can connect a query about Jacobian-vector products to forward-mode autodiff documentation, or a query about reshaping batch dimensions to a batch-to-space operation. It still faces ambiguity among neighboring API pages, especially when many documents share the same library namespace and boilerplate structure.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.7544, hit@10 of 0.8850, and recall@100 of 0.9450. It uses top-100 candidates with optional rank-101 safeguards; 11 rows contain 101 candidates and 11 safeguard-positive rows are recorded. Hybrid retrieval has the best top-100 coverage but is slightly below dense retrieval for nDCG@10.

This pattern shows that lexical and dense signals are complementary. BM25 helps with exact API paths and argument names, while dense retrieval helps with purpose and usage descriptions. The hybrid pool is attractive for downstream reranking because it recovers more positives by rank 100, but final ranking still needs to choose the exact documentation entry among similar API pages.

Metric Interpretation for Model Researchers

NanoCodeRAGLibraryDocumentationSolutions is an API-documentation retrieval task where both exact terms and semantic similarity matter. Dense retrieval is strongest at the top, reranking_hybrid is strongest for coverage, and BM25 remains competitive because API identifiers are decisive.

Researchers should inspect whether a model preserves namespace structure and identifier tokens. A model that smooths all TensorFlow documentation into a broad semantic cluster may retrieve the right topic but wrong page. Conversely, a purely lexical model may overvalue boilerplate and alias text.

Query and Relevance Type Tendencies

Queries are API-name or usage-intent strings. They may contain a dotted path, class or function name, short description, inheritance note, or alias cue. Documents are reference entries with signatures, attributes, examples, arguments, returns, warnings, and version notes.

Relevance is exact documentation grounding. The positive document should be the page or entry that contains the requested API's behavior or signature. A page from the same namespace is non-relevant if it documents a different operation.

Representative Failure Modes

BM25 may confuse entries that share tf.compat.v1, alias boilerplate, or migration guide text. Dense retrieval may confuse APIs with related purposes, such as neighboring tensor-shape operations or similar dataset classes.

Hybrid retrieval can recover the positive while still ranking a nearby API page above it. These errors are especially likely when the query and candidate documents share broad TensorFlow documentation language but differ in the final namespace component or argument behavior.

Training Data That May Help

Useful training data includes non-overlapping Python API documentation retrieval pairs, DocPrompting-style natural-language intent to documentation pairs, docstring and example-code to reference-page retrieval, and library search logs with overlap removed. Training should preserve dotted paths, argument names, aliases, and version notes.

Leakage filtering is required. CodeRAG-Bench reports a library-documentation source corpus of about 34,000 entries, and this Nano split is sampled from that source. Training should exclude NanoCodeRAG library-documentation queries, qrels, positive documentation entries, matching API paths, section text, and token fingerprints.

Model Improvement Notes

Improving this task requires combining identifier-sensitive retrieval with semantic API understanding. Tokenization should preserve module paths, class names, function names, and argument names. At the same time, the model should understand short descriptions such as "computes a confusion matrix" or "distribution strategy for a single device."

For reranking, useful features include namespace match, signature match, argument compatibility, and whether the document contains the exact behavior requested by the query. Boilerplate sections should be down-weighted.

Example Data

QueryPositive document
tf.autodiff.ForwardAccumulator Computes Jacobian-vector products ("JVP"s) using forward-mode autodiff. [102 chars]tf.autodiff.ForwardAccumulator( primals, tangents ) Compare to tf.GradientTape which computes vector-Jacobian products ("VJP"s) using reverse-mode autodiff (backprop). Reverse mode is more attractive when computing gradients of a scalar-valued function with respect to many inputs (e.g. a neural network with many parameters and a scalar loss). Forward mode works best on functions with many outputs and few inputs. Since it does not hold on to intermediate activations, it is much more memory efficient than backprop where it is applicable. Consider a simple linear regression: x = tf.constant([[2.0, 3.0], [1.0, 4.0]]) dense = tf.keras.layers.Dense(1) dense.build([None, 2]) with tf.autodiff.ForwardAccumulator( primals=dense.kernel, tangents=tf.constant([[1.], [0.]])) as acc: loss = tf.reduce_sum((dense(x) - tf.constant([1., -1.])) ** 2.) acc.jvp(loss) <tf.Tensor: shape=(), dtype=float32, numpy=...> The example has two variables containing parameters, dense.kernel (2 parameters) and dense.bia... [1,000 / 6,087 chars]
tf.compat.v1.data.experimental.RandomDataset A Dataset of pseudorandom values. Inherits From: Dataset, Dataset [110 chars]tf.compat.v1.data.experimental.RandomDataset( seed=None ) Attributes element_spec The type specification of an element of this dataset. dataset = tf.data.Dataset.from_tensor_slices([1, 2, 3]) dataset.element_spec TensorSpec(shape=(), dtype=tf.int32, name=None) output_classes Returns the class of each component of an element of this dataset. (deprecated) Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.compat.v1.data.get_output_classes(dataset). output_shapes Returns the shape of each component of an element of this dataset. (deprecated)Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.compat.v1.data.get_output_shapes(dataset). output_types Returns the type of each component of an element of this dataset. (deprecated)Warning: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.compat.v1.data.get_output_types(datas... [1,000 / 55,309 chars]
tf.compat.v1.confusion_matrix Computes the confusion matrix from predictions and labels. View aliases Compat aliases for migration [132 chars]See Migration guide for more details. tf.compat.v1.math.confusion_matrix tf.compat.v1.confusion_matrix( labels, predictions, num_classes=None, dtype=tf.dtypes.int32, name=None, weights=None ) The matrix columns represent the prediction labels and the rows represent the real labels. The confusion matrix is always a 2-D array of shape [n, n], where n is the number of valid labels for a given classification task. Both prediction and labels must be 1-D arrays of the same shape in order for this function to work. If num_classes is None, then num_classes will be set to one plus the maximum value in either predictions or labels. Class labels are expected to start at 0. For example, if num_classes is 3, then the possible labels would be [0, 1, 2]. If weights is not None, then each prediction contributes its corresponding weight to the total value of the confusion matrix cell. For example: tf.math.confusion_matrix([1, 2, 4], [2, 2, 4]) ==> [[0 0 0 0 0] [0 0 1 0 0] [0 0 1 0 0] [0 0 0 0 0] [0 0 0... [1,000 / 1,943 chars]

Source Reference Table

SourceRole
CodeRAG-Bench: Can Retrieval Augment Code Generation?Benchmark paper describing the retrieval sources and code-generation setting.
CodeRAG-Bench project pageProject page for the benchmark.
CodeRAG-Bench GitHubRepository for benchmark resources.
code-rag-bench/library-documentationPublic source dataset card.
hakari-bench/NanoCodeRAGNano benchmark dataset containing this split.

Dataset Information

FieldValue
Nano setNanoCodeRAG
Backing datasetNanoCodeRAG
Task / splitNanoCodeRAGLibraryDocumentationSolutions
Hugging Face datasethakari-bench/NanoCodeRAG
Languageen
Categorycode
Queries200
Documents8,683
Positive qrels200
Positives / query avg1.00
Positives / query min1
Positives / query median1.00
Positives / query max1
Multi-positive queries0 (0.00%)
Query length avg chars397.43
Document length avg chars2,045.70

Candidate Subsets

ProfileConfignDCG@10Hit@10Recall@100Candidates
BM25bm250.68670.81500.9200top-500
Denseharrier_oss_v1_270m0.76450.89000.9250top-500
Reranking hybridreranking_hybrid0.75440.88500.9450top-100

Training and Leakage Metadata