NanoCodeRAG / NanoCodeRAGOnlineTutorials

Overview

NanoCodeRAGOnlineTutorials is an English code-retrieval task in NanoCodeRAG, sampled from the online-tutorial retrieval source of CodeRAG-Bench. The query is usually a short tutorial title, how-to phrase, or programming-problem title. The target document is a long tutorial page containing prose, code snippets, examples, steps, and explanations.

This task tests whether a retrieval model can connect a concise developer need to the article that explains it. The documents are much longer than the queries and often contain boilerplate, multiple examples, timestamps, headings, and unrelated tokens. A good model must identify the central tutorial topic rather than match incidental code words inside a long page.

Details

What the Original Data Measures

CodeRAG-Bench includes online tutorials as one of its retrieval sources for retrieval-augmented code generation. The benchmark paper describes tutorial pages collected from programming tutorial sites such as GeeksforGeeks, W3Schools, Tutorialspoint, and Towards Data Science through ClueWeb22. These pages contain code snippets and explanatory text for programming concepts, APIs, algorithms, and practical tasks.

This Nano task isolates tutorial retrieval. The correct document should be the tutorial that explains the requested API, language feature, algorithm, or programming procedure. It measures title-to-article and short-query-to-long-document retrieval.

Observed Data Profile

This Nano split contains 200 queries, 9,997 documents, and 200 positive qrels. Each query has exactly one positive tutorial page. Queries average 51.91 characters, while documents average 5,722.55 characters. This large length gap is central to the task.

Observed queries include Android screen control, secure file deletion on Linux, C++ access modifiers, Python superscript and subscript printing, and GeeksforGeeks practice problems. Documents are article-like pages with dates, explanations, examples, and sometimes long code-heavy sections.

BM25 Evaluation Profile

BM25 is very strong, with nDCG@10 of 0.8175, hit@10 of 0.9200, and recall@100 of 0.9700 using a top-500 candidate pool. Tutorial titles and article bodies often share exact phrases, language names, method names, and problem titles, so term frequency is a powerful signal.

BM25 still has weaknesses. Generic titles, repeated site boilerplate, and common programming terms can distract lexical ranking. A query such as a standard-library method example or a broad STL concept may retrieve a closely related page instead of the intended overview. Long documents also contain many incidental words that can create false lexical matches.

Dense Evaluation Profile

The dense harrier-oss-270m profile is the strongest by nDCG@10, reaching 0.9027, with hit@10 of 0.9400 and recall@100 of 0.9550. Dense retrieval improves top-rank ordering by matching the central tutorial meaning rather than only exact title words.

Dense similarity helps when a short query describes a task and the tutorial explains it with different wording. It can connect "turn Android device screen on and off programmatically" to a page with setup steps and code, or a practice problem title to the article that states the problem and solution. Its recall@100 is slightly lower than BM25, suggesting that exact title matching still recovers some positives dense retrieval misses.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.8673, hit@10 of 0.9550, and recall@100 of 1.0000. It uses exactly 100 candidates per query, with no safeguard-positive rows. Hybrid retrieval gives the best hit@10 and perfect top-100 coverage, although dense retrieval has the best nDCG@10.

This is a classic hybrid-friendly pattern. BM25 contributes exact title and phrase matching, while dense retrieval captures article-topic similarity. The combined pool ensures that every positive is available for downstream reranking. The final ranker still needs to decide which long article is most directly about the query.

Metric Interpretation for Model Researchers

NanoCodeRAGOnlineTutorials is a strong retrieval task for studying short-query to long-document matching. BM25, dense, and hybrid all perform well, but for different reasons. BM25 uses exact title and phrase overlap, dense retrieval gives the best top-rank quality, and reranking_hybrid gives full top-100 coverage.

The metric pattern suggests that candidate generation should not discard lexical signals. A dense-only system may rank positives best on average, but BM25 recovers some title-driven pages. A hybrid candidate pool is a strong base for reranking tutorial search.

Query and Relevance Type Tendencies

Queries are concise title-like strings. They may name a platform, language, API, algorithm, or practice problem. Documents are long tutorials with prose explanations, code snippets, examples, and headings.

Relevance is page-level topical match. The positive document should be the tutorial that explains the requested task. A page from the same language or site is non-relevant if it covers a neighboring method, operator, or problem.

Representative Failure Modes

BM25 may over-rank pages with similar titles or repeated site boilerplate. Dense retrieval may choose a semantically nearby tutorial that explains a related concept but not the exact requested one. Hybrid retrieval can recover the positive but still needs reranking to distinguish overview pages from method-specific pages.

Long documents introduce another failure mode: incidental code snippets or footer text can match query terms even when the main article topic differs. Models should learn to focus on title, headings, and central explanation.

Training Data That May Help

Useful training data includes non-overlapping programming tutorial title-to-page pairs, developer search logs over tutorials and documentation, Stack Overflow question-to-tutorial citation pairs, and code-example retrieval with long tutorial hard negatives. Hard negatives should come from the same language or topic but solve a different task.

Leakage filtering is required. CodeRAG-Bench reports about 79,400 online-tutorial documents, and this Nano split is sampled from that source. Training should exclude NanoCodeRAG tutorial queries, qrels, positive tutorial pages, matching titles, URLs, article bodies, code snippets, and token fingerprints.

Model Improvement Notes

Improving this task requires robust long-document retrieval. Models should recognize the page's main tutorial topic and avoid being misled by boilerplate or incidental examples. Title and heading signals are important, but semantic alignment to the requested procedure also matters.

For reranking, useful features include title match, heading match, language or framework match, code-example relevance, and whether the article explains the requested procedure directly. A good ranker should prefer the tutorial that solves the exact task over a broader related article.

Example Data

Query	Positive document
How to turn Android device screen on and off programmatically? [62 chars]	This example demonstrate about How to turn Android device screen on and off programmatically. Step 1 − Create a new project in Android Studio, go to File ⇒ New Project and fill all required details to create a new project. Step 2 − Add the following code to res/layout/activity_main.xml <? xml version= "1.0" encoding= "utf-8" ?> <RelativeLayout xmlns: android = "http://schemas.android.com/apk/res/android" xmlns: tools = "http://schemas.android.com/tools" android :layout_width= "match_parent" android :layout_height= "match_parent" android :layout_margin= "16dp" tools :context= ".MainActivity" > <LinearLayout android :layout_width= "match_parent" android :layout_height= "wrap_content" android :layout_centerInParent= "true" android :orientation= "horizontal" > <Button android :id= "@+id/btnEnable" android :layout_width= "0dp" android :layout_height= "wrap_content" android :layout_weight= "1" android :onClick= "enablePhone" android :text= "Enable" /> <Button android :id= "@+id/btnLock" andr... [1,000 / 6,654 chars]
Tools to Securely Delete Files from Linux - GeeksforGeeks [57 chars]	16 Feb, 2021 Every time you delete a file from your Linux system using the shift + delete or rm command, it doesn’t actually permanently and securely delete the file from the hard disk. When you delete a file with the rm command, the file system just frees up the appropriate inode but the contents of the old file are still in that space until it is overwritten which pave a way to recover the files. The space that was used by the file that you deleted is now free to be used by other new files. But the contents of the old files are still in the hard disk, Until and unless that space is overwritten by something else, so there is a good chance that the file can be recovered by anyone (maybe by some data thieves) with good knowledge of recovering data. It is like removing the index page of a book, where the chapters are still there, it just becomes much hard to find, but we can find it. Shred will help you to overwrite a deleted file, so it becomes difficult to recover it. It is like tearin... [1,000 / 3,940 chars]
Difference between Private and Protected in C++ with Example - GeeksforGeeks [76 chars]	03 Jan, 2022 Protected Protected access modifier is similar to that of private access modifiers, the difference is that the class member declared as Protected are inaccessible outside the class but they can be accessed by any subclass(derived class) of that class.Example: CPP // C++ program to demonstrate// protected access modifier #include <bits/stdc++.h>using namespace std; // base classclass Parent { // protected data membersprotected: int id_protected;}; // sub class or derived classclass Child : public Parent { public: void setId(int id) { // Child class is able to access the inherited // protected data members of the base class id_protected = id; } void displayId() { cout << "id_protected is: " << id_protected << endl; }}; // main functionint main(){ Child obj1; // member function of the derived class can // access the protected data members of the base class obj1.setId(81); obj1.displayId(); return 0;} id_protected is: 81 Private The class members declared as private can be acc... [1,000 / 2,559 chars]

Source Reference Table

Source	Role
CodeRAG-Bench: Can Retrieval Augment Code Generation?	Benchmark paper describing the retrieval sources and code-generation setting.
CodeRAG-Bench project page	Project page for the benchmark.
CodeRAG-Bench GitHub	Repository for benchmark resources.
code-rag-bench/online-tutorials	Public source dataset card.
hakari-bench/NanoCodeRAG	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoCodeRAG
Backing dataset	NanoCodeRAG
Task / split	NanoCodeRAGOnlineTutorials
Hugging Face dataset	hakari-bench/NanoCodeRAG
Language	en
Category	code
Queries	200
Documents	9,997
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	51.91
Document length avg chars	5,722.55

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.8175	0.9200	0.9700	top-500
Dense	`harrier_oss_v1_270m`	0.9027	0.9400	0.9550	top-500
Reranking hybrid	`reranking_hybrid`	0.8673	0.9550	1.0000	top-100

Training and Leakage Metadata

Original train split: unknown
Evaluation split origin: CodeRAG-Bench online tutorials retrieval source sampled into NanoCodeRAG
Train/eval overlap audit: not_audited_source_datastore_filtering_required
Leakage note: exclude NanoCodeRAG tutorial queries, qrels, and positive tutorial pages; do not train on unfiltered code-rag-bench/online-tutorials rows
Multi-positive training: single_positive_question_document_focus
Useful training data: non-overlapping programming tutorial title-to-page pairs, developer search logs over tutorials and documentation, Stack Overflow question-to-tutorial citation pairs, code example retrieval with long tutorial hard negatives