NanoCoIR
Overview
NanoCoIR is the compact Nano set for CoIR, a code information retrieval benchmark. It covers ten English code-oriented retrieval settings: natural language developer requests retrieving code, code retrieving text, code retrieving code, programming dialogue retrieving assistant responses, StackOverflow-style QA, and Text-to-SQL retrieval. The group is useful because it does not reduce code retrieval to one query shape.
The CoIR setting treats code retrieval as a family of format mismatches. Developer intent, program behavior, identifiers, API usage, SQL schemas, dialogue history, and code summaries can all be the relevant signal. BM25 shows where identifiers and repeated technical terms dominate, dense retrieval tests whether program semantics and developer intent align, and reranking_hybrid shows whether exact code tokens and semantic similarity recover different candidate sets.
What This Group Measures
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models introduces CoIR as a benchmark for code retrieval across diverse query and document formats. NanoCoIR keeps ten compact splits derived from APPS, CoSQA, CodeSearchNet, CodeTransOcean, code feedback data, StackOverflow QA, and synthetic Text-to-SQL.
The shared measurement target is code/prose alignment. A relevant document may be runnable code, a docstring, a code continuation, an equivalent program, a SQL statement, or a mixed prose-and-code answer. A model that only matches keywords will miss behavior; a model that only matches broad semantics may miss exact APIs, schemas, identifiers, and error messages.
Task Families
- Text-to-code retrieval:
NanoApps,NanoCosQA, andNanoSyntheticText2SQLmap specifications, search phrases, or database questions to code or SQL. - Code-to-text retrieval:
NanoCodeSearchNetmaps code snippets to documentation-like summaries. - Code-to-code retrieval:
NanoCodeSearchNetCCR,NanoCodeTransOceanContest, andNanoCodeTransOceanDLretrieve code continuations or semantically equivalent programs across languages or frameworks. - Dialogue and QA retrieval:
NanoCodeFeedbackST,NanoCodeFeedbackMT, andNanoStackOverflowQAretrieve assistant or community answers that mix explanation and code.
Dataset Shape
NanoCoIR contains 10 task pages, 1,850 queries, 76,295 split-local documents, and 1,850 positive qrel rows. All tasks are single-positive in the current metadata. Nine tasks have 200 queries; NanoCodeTransOceanDL has 50.
Text length varies sharply. NanoCosQA has very short search-style queries, while NanoCodeFeedbackMT has multi-turn dialogue histories averaging more than 4,000 characters. Documents range from short CodeSearchNet summaries and SQL statements to long code-feedback answers and cross-framework code examples. This makes global averages less informative than per-task interpretation.
Retrieval Behavior
BM25 Profile
BM25 is strongest when identifiers, function names, comments, error messages, or dialogue terms repeat across query and document. It performs very well on code feedback, code continuation, StackOverflow QA, and some CodeSearchNet formats. It is extremely weak on NanoApps, where a long programming problem must retrieve a solution with little direct lexical overlap.
Sparse retrieval is not a naive baseline for code tasks; it captures exact tokens that often matter. But it can fail when relevance depends on algorithmic behavior, schema semantics, or cross-language equivalence.
Dense Profile
Dense retrieval is the best profile for most NanoCoIR tasks. It substantially improves APPS, CodeSearchNet, CoSQA, Text-to-SQL, CodeTransOcean, and code feedback retrieval by connecting intent and program behavior beyond exact token overlap. It is especially useful when natural-language problem statements need to retrieve compact code or SQL.
Dense retrieval still has to preserve code-specific details. A semantically near answer with the wrong library, schema column, edge case, framework, or language behavior is not relevant. Strong dense performance here should be read as code-aware semantic matching, not generic sentence similarity.
Reranking Hybrid Profile
reranking_hybrid is strongest where exact code tokens and semantic intent are both needed. It leads on NanoCodeSearchNetCCR and is competitive on feedback, StackOverflow, and Text-to-SQL tasks. In several dense-led tasks, hybrid still provides a useful candidate pool because BM25 may recover exact identifiers that dense retrieval misses.
For reranker experiments, NanoCoIR is a candidate-generation stress test. If the first stage drops the exact API, schema, or equivalent implementation, a reranker cannot recover the answer.
Task Summary
| Task | Retrieval shape | Queries | Docs | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoApps | programming problem to solution code | 200 | 8,754 | 0.0084 | 0.2528 | 0.1655 | Dense |
| NanoCodeFeedbackMT | multi-turn code dialogue to assistant answer | 200 | 10,000 | 0.7403 | 0.9177 | 0.8035 | Dense |
| NanoCodeFeedbackST | single-turn code prompt to assistant answer | 200 | 10,000 | 0.8722 | 0.9532 | 0.9115 | Dense |
| NanoCodeSearchNet | code to documentation summary | 200 | 10,000 | 0.6099 | 0.9687 | 0.8678 | Dense |
| NanoCodeSearchNetCCR | code prefix to continuation | 200 | 10,000 | 0.8834 | 0.8519 | 0.9073 | Reranking hybrid |
| NanoCodeTransOceanContest | contest code to equivalent code | 200 | 1,008 | 0.4869 | 0.8231 | 0.7157 | Dense |
| NanoCodeTransOceanDL | deep-learning code to equivalent framework code | 50 | 266 | 0.5581 | 0.6327 | 0.5956 | Dense |
| NanoCosQA | developer search query to code | 200 | 6,267 | 0.3049 | 0.6733 | 0.4792 | Dense |
| NanoStackOverflowQA | StackOverflow question to answer | 200 | 10,000 | 0.7482 | 0.8836 | 0.8328 | Dense |
| NanoSyntheticText2SQL | natural-language database question to SQL | 200 | 10,000 | 0.2240 | 0.9567 | 0.5577 | Dense |
Interpretation Notes for Model Researchers
NanoCoIR should be read by query-document format. Text-to-code, code-to-text, code-to-code, dialogue, and SQL retrieval stress different capabilities. A model that is excellent on CodeSearchNet summaries may still be weak on APPS problem solving or CodeTransOcean cross-framework equivalence.
The BM25/dense contrast is central. BM25-led or BM25-competitive rows show the importance of exact identifiers and code tokens. Dense-led rows show intent and behavior matching. Hybrid-led rows show candidate complementarity, especially when exact tokens and semantic program structure both matter.
Training and Leakage Notes
Useful training data includes APPS-style problem-solution pairs, CoSQA query-code data, CodeSearchNet functions and summaries, Text-to-SQL pairs, StackOverflow QA, code-assistant dialogue, code translation data, framework equivalence examples, and hard negatives from nearby APIs or algorithms.
Exclude NanoCoIR evaluation queries, positives, qrels, SQL statements, code snippets, docstrings, and direct synthetic variants. Public code datasets often have duplicated examples across splits and repositories, so overlap audits are important before training.
Source Reference Table
| Source | Year | Type | URL |
| CoIR: A Comprehensive Benchmark for Code Information Retrieval Models | 2024 | paper | https://arxiv.org/abs/2407.02883 |
Metadata Summary
| Field | Value |
| Task pages | 10 |
| Queries | 1,850 |
| Split-local documents | 76,295 |
| Positive qrels | 1,850 |
| Languages | en |
| Categories | code |
| Positives / query avg | 1.00 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoApps | NanoCoIR | en | code | 200 | 8,754 | 200 | 0.0084 | 0.2528 | 0.1655 | Dense |
| NanoCodeFeedbackMT | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.7403 | 0.9177 | 0.8035 | Dense |
| NanoCodeFeedbackST | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.8722 | 0.9532 | 0.9115 | Dense |
| NanoCodeSearchNet | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.6099 | 0.9687 | 0.8678 | Dense |
| NanoCodeSearchNetCCR | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.8834 | 0.8519 | 0.9073 | Reranking hybrid |
| NanoCodeTransOceanContest | NanoCoIR | en | code | 200 | 1,008 | 200 | 0.4869 | 0.8231 | 0.7157 | Dense |
| NanoCodeTransOceanDL | NanoCoIR | en | code | 50 | 266 | 50 | 0.5581 | 0.6327 | 0.5956 | Dense |
| NanoCosQA | NanoCoIR | en | code | 200 | 6,267 | 200 | 0.3049 | 0.6733 | 0.4792 | Dense |
| NanoStackOverflowQA | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.7482 | 0.8836 | 0.8328 | Dense |
| NanoSyntheticText2SQL | NanoCoIR | en | code | 200 | 10,000 | 200 | 0.2240 | 0.9567 | 0.5577 | Dense |