HAKARI-Bench

NanoCoIR

Overview

NanoCoIR is the compact Nano set for CoIR, a code information retrieval benchmark. It covers ten English code-oriented retrieval settings: natural language developer requests retrieving code, code retrieving text, code retrieving code, programming dialogue retrieving assistant responses, StackOverflow-style QA, and Text-to-SQL retrieval. The group is useful because it does not reduce code retrieval to one query shape.

The CoIR setting treats code retrieval as a family of format mismatches. Developer intent, program behavior, identifiers, API usage, SQL schemas, dialogue history, and code summaries can all be the relevant signal. BM25 shows where identifiers and repeated technical terms dominate, dense retrieval tests whether program semantics and developer intent align, and reranking_hybrid shows whether exact code tokens and semantic similarity recover different candidate sets.

What This Group Measures

CoIR: A Comprehensive Benchmark for Code Information Retrieval Models introduces CoIR as a benchmark for code retrieval across diverse query and document formats. NanoCoIR keeps ten compact splits derived from APPS, CoSQA, CodeSearchNet, CodeTransOcean, code feedback data, StackOverflow QA, and synthetic Text-to-SQL.

The shared measurement target is code/prose alignment. A relevant document may be runnable code, a docstring, a code continuation, an equivalent program, a SQL statement, or a mixed prose-and-code answer. A model that only matches keywords will miss behavior; a model that only matches broad semantics may miss exact APIs, schemas, identifiers, and error messages.

Task Families

Dataset Shape

NanoCoIR contains 10 task pages, 1,850 queries, 76,295 split-local documents, and 1,850 positive qrel rows. All tasks are single-positive in the current metadata. Nine tasks have 200 queries; NanoCodeTransOceanDL has 50.

Text length varies sharply. NanoCosQA has very short search-style queries, while NanoCodeFeedbackMT has multi-turn dialogue histories averaging more than 4,000 characters. Documents range from short CodeSearchNet summaries and SQL statements to long code-feedback answers and cross-framework code examples. This makes global averages less informative than per-task interpretation.

Retrieval Behavior

BM25 Profile

BM25 is strongest when identifiers, function names, comments, error messages, or dialogue terms repeat across query and document. It performs very well on code feedback, code continuation, StackOverflow QA, and some CodeSearchNet formats. It is extremely weak on NanoApps, where a long programming problem must retrieve a solution with little direct lexical overlap.

Sparse retrieval is not a naive baseline for code tasks; it captures exact tokens that often matter. But it can fail when relevance depends on algorithmic behavior, schema semantics, or cross-language equivalence.

Dense Profile

Dense retrieval is the best profile for most NanoCoIR tasks. It substantially improves APPS, CodeSearchNet, CoSQA, Text-to-SQL, CodeTransOcean, and code feedback retrieval by connecting intent and program behavior beyond exact token overlap. It is especially useful when natural-language problem statements need to retrieve compact code or SQL.

Dense retrieval still has to preserve code-specific details. A semantically near answer with the wrong library, schema column, edge case, framework, or language behavior is not relevant. Strong dense performance here should be read as code-aware semantic matching, not generic sentence similarity.

Reranking Hybrid Profile

reranking_hybrid is strongest where exact code tokens and semantic intent are both needed. It leads on NanoCodeSearchNetCCR and is competitive on feedback, StackOverflow, and Text-to-SQL tasks. In several dense-led tasks, hybrid still provides a useful candidate pool because BM25 may recover exact identifiers that dense retrieval misses.

For reranker experiments, NanoCoIR is a candidate-generation stress test. If the first stage drops the exact API, schema, or equivalent implementation, a reranker cannot recover the answer.

Task Summary

TaskRetrieval shapeQueriesDocsBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAppsprogramming problem to solution code2008,7540.00840.25280.1655Dense
NanoCodeFeedbackMTmulti-turn code dialogue to assistant answer20010,0000.74030.91770.8035Dense
NanoCodeFeedbackSTsingle-turn code prompt to assistant answer20010,0000.87220.95320.9115Dense
NanoCodeSearchNetcode to documentation summary20010,0000.60990.96870.8678Dense
NanoCodeSearchNetCCRcode prefix to continuation20010,0000.88340.85190.9073Reranking hybrid
NanoCodeTransOceanContestcontest code to equivalent code2001,0080.48690.82310.7157Dense
NanoCodeTransOceanDLdeep-learning code to equivalent framework code502660.55810.63270.5956Dense
NanoCosQAdeveloper search query to code2006,2670.30490.67330.4792Dense
NanoStackOverflowQAStackOverflow question to answer20010,0000.74820.88360.8328Dense
NanoSyntheticText2SQLnatural-language database question to SQL20010,0000.22400.95670.5577Dense

Interpretation Notes for Model Researchers

NanoCoIR should be read by query-document format. Text-to-code, code-to-text, code-to-code, dialogue, and SQL retrieval stress different capabilities. A model that is excellent on CodeSearchNet summaries may still be weak on APPS problem solving or CodeTransOcean cross-framework equivalence.

The BM25/dense contrast is central. BM25-led or BM25-competitive rows show the importance of exact identifiers and code tokens. Dense-led rows show intent and behavior matching. Hybrid-led rows show candidate complementarity, especially when exact tokens and semantic program structure both matter.

Training and Leakage Notes

Useful training data includes APPS-style problem-solution pairs, CoSQA query-code data, CodeSearchNet functions and summaries, Text-to-SQL pairs, StackOverflow QA, code-assistant dialogue, code translation data, framework equivalence examples, and hard negatives from nearby APIs or algorithms.

Exclude NanoCoIR evaluation queries, positives, qrels, SQL statements, code snippets, docstrings, and direct synthetic variants. Public code datasets often have duplicated examples across splits and repositories, so overlap audits are important before training.

Source Reference Table

SourceYearTypeURL
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models2024paperhttps://arxiv.org/abs/2407.02883

Metadata Summary

FieldValue
Task pages10
Queries1,850
Split-local documents76,295
Positive qrels1,850
Languagesen
Categoriescode
Positives / query avg1.00

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAppsNanoCoIRencode2008,7542000.00840.25280.1655Dense
NanoCodeFeedbackMTNanoCoIRencode20010,0002000.74030.91770.8035Dense
NanoCodeFeedbackSTNanoCoIRencode20010,0002000.87220.95320.9115Dense
NanoCodeSearchNetNanoCoIRencode20010,0002000.60990.96870.8678Dense
NanoCodeSearchNetCCRNanoCoIRencode20010,0002000.88340.85190.9073Reranking hybrid
NanoCodeTransOceanContestNanoCoIRencode2001,0082000.48690.82310.7157Dense
NanoCodeTransOceanDLNanoCoIRencode50266500.55810.63270.5956Dense
NanoCosQANanoCoIRencode2006,2672000.30490.67330.4792Dense
NanoStackOverflowQANanoCoIRencode20010,0002000.74820.88360.8328Dense
NanoSyntheticText2SQLNanoCoIRencode20010,0002000.22400.95670.5577Dense