NanoCoIR / NanoStackOverflowQA

Overview

NanoStackOverflowQA is an English code and developer-help retrieval task in NanoCoIR. It is derived from StackOverflow question-answer pairs through CoIR. The query is a developer question that may include a title, body text, code snippets, error messages, framework context, and attempted solutions. The target document is the relevant answer.

This task represents practical troubleshooting retrieval. A good model must identify the actual programming problem and retrieve an answer that diagnoses or solves it. Matching only language names, library names, or stack-trace tokens is not enough, because many questions in the same framework share similar vocabulary while requiring different fixes.

Details

What the Original Data Measures

CoIR constructs a StackOverflow QA retrieval task from Stack Overflow data. Developer questions are paired with their answers, and the retrieval model must find the answer that corresponds to the question. The metadata records no confirmed standalone task paper beyond CoIR construction details and the source dataset card.

The original task measures question-to-answer retrieval for programming help. Relevance depends on whether the answer solves or explains the user's issue. Queries may include noisy prose, partial code, error traces, and environment-specific details.

Observed Data Profile

This Nano split contains 200 queries, 10,000 documents, and 200 positive qrels. Each query has exactly one positive answer. Queries average 1,361.81 characters, and documents average 1,218.06 characters. Both sides are long enough to contain a mix of natural language and code.

Observed examples involve WinForms click blocking, Angular $resource parameters, source-map configuration, MongoDB aggregation, and Inno Setup message behavior. Positive answers often include concrete fixes, corrected code, configuration details, or explanations of why the observed behavior occurs.

BM25 Evaluation Profile

BM25 is strong, with nDCG@10 of 0.7482, hit@10 of 0.8300, and recall@100 of 0.9250 using a top-500 candidate pool. Exact overlap from framework names, API names, error messages, configuration keys, and code snippets gives lexical retrieval useful anchors.

However, BM25 is not the strongest profile. StackOverflow answers often paraphrase the diagnosis, omit parts of the question text, or focus on the decisive fix rather than every error token. BM25 can also retrieve answers from the same tag or framework that share many words but solve a different issue.

Dense Evaluation Profile

The dense harrier-oss-270m profile is strongest by top-rank metrics, with nDCG@10 of 0.8836, hit@10 of 0.9300, and recall@100 of 0.9400. Dense retrieval improves over BM25 by connecting the developer problem to the answer's explanatory content and fix.

Dense similarity is useful when the question contains a long narrative but the relevant answer focuses on a small conceptual correction. It can recognize that a source-map problem is about path fields, or that an Angular resource issue is about parameter construction, even when the exact wording differs. The remaining errors likely involve several plausible answers from the same technology area.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.8328, hit@10 of 0.8850, and recall@100 of 0.9900. It uses top-100 candidates with optional rank-101 safeguards; two rows contain 101 candidates and two safeguard-positive rows are recorded. Hybrid retrieval has the best top-100 coverage but is below dense retrieval for top-10 ranking.

This pattern is useful for reranking pipelines. BM25 contributes exact diagnostic tokens and code identifiers, while dense retrieval contributes semantic diagnosis. Together they nearly always include the positive in the top 100. A final ranker then needs to choose the answer that actually resolves the user's problem.

Metric Interpretation for Model Researchers

NanoStackOverflowQA is a high-signal hybrid developer-help task. BM25 is strong because programming questions include exact terms, but dense retrieval is stronger because answers are explanatory and often paraphrase the solution. Reranking_hybrid is best for candidate coverage.

For researchers, the task distinguishes three capabilities: lexical matching of error and API tokens, semantic matching of problem and fix, and final ranking among same-tag answers. nDCG@10 is the main top-rank diagnostic, while recall@100 indicates whether downstream reranking has access to the correct answer.

Query and Relevance Type Tendencies

Queries are real-world programming questions. They may contain titles, code, error output, framework versions, configuration snippets, or partial attempts. Documents are developer answers that mix prose, code, and corrective guidance.

Relevance is answer usefulness. The positive answer should solve, explain, or correctly diagnose the specific issue in the question. A candidate from the same framework is not enough if it addresses a different error or a different API behavior.

Representative Failure Modes

BM25 may retrieve answers from the same framework tag because of shared API names or stack-trace tokens. Dense retrieval may retrieve an answer that is conceptually close but not the accepted or intended fix. Hybrid retrieval can include the correct answer in the pool but rank a more lexically similar answer above it.

Another failure mode is over-focusing on code snippets while missing the user's actual question. The decisive evidence may be in prose around the code, such as a version mismatch, a configuration path, or a misunderstanding of an API parameter.

Training Data That May Help

Useful training data includes StackOverflow question-answer retrieval, code troubleshooting pairs, error-message matching, and framework-tag hard negatives. Strong negatives should come from the same language or framework and share visible tokens while solving a different problem.

Leakage filtering is required. The Nano split is derived from CoIR StackOverflow QA test-side data. Training should exclude NanoStackOverflowQA question-answer pairs and avoid StackOverflow QA test-derived rows. Filters should cover normalized title, body, answer text, code snippets, URL or ID, and token fingerprints.

Model Improvement Notes

Improving this task requires models that can represent both code tokens and explanatory diagnosis. Error messages, API names, and configuration keys should remain visible, but the model also needs to understand why a specific answer resolves the question.

For reranking, the most useful signals are answer specificity, correspondence to the reported failure, and handling of the exact framework or configuration context. Same-tag hard negatives are essential for measuring real progress.

Example Data

Query	Positive document
How to block mouse click events from another form I have a winforms single form application that uses a "Thickbox" I've created whenever the loads a new view into the application form. The "Thickbox" shows another form in front of the application's form that is semi-transparent and has a user control that is the box itself. This thickbox can be shown a modal dialog and in that case I have no problems at all, but it can also be shown as a non modal, for instance, when the user switches views in t... [500 / 1,644 chars]	I'm glad to announce that the problem is finally solved. After spending a few days attempting to recreate this bug in a new application, re-constructing the main form in the application, comment out parts of the code in the main application, and generally just shooting all over to try and find a lead, It finally hit me. The application behaved as if the clicks on the thickbox was queued somehow and only activated when the thickbox was closed. This morning, after fixing some other bugs, The penny finally dropped - all I was missing was a single line of code right before closing the thickbox's form: Application.DoEvents(); The annoying thing is that it's not something that's new to me, I've used it many times before including in the main application and in the thickbox code itself... I guess I just had to let if go for a while to enable my mind to understand what was so painfully obvious in hindsight... [916 chars]
Passing a parameter to a $resource? I have a controller that that looks like this: (function() { angular .module("main") .controller("HomeCtrl", ["branchResource", "adalAuthenticationService", HomeCtrl]); function HomeCtrl(branchResource, adalService){ var vm = this; vm.copyrightDate = new Date(); vm.user = adalService.userInfo.userName; // right here, can I insert the vm.user from above // as a parameter to the resource's query? branchResource.query(function (data) { vm.branches = data; }); }}(... [500 / 1,406 chars]	Create the $resource object with: function branchResource($resource){ ̶r̶e̶t̶u̶r̶n̶ ̶$̶r̶e̶s̶o̶u̶r̶c̶e̶(̶"̶/̶a̶p̶i̶/̶u̶s̶e̶r̶/̶G̶e̶t̶A̶l̶l̶U̶s̶e̶r̶B̶r̶a̶n̶c̶h̶e̶s̶?̶f̶e̶d̶e̶r̶a̶t̶e̶d̶U̶s̶e̶r̶N̶a̶m̶e̶=̶:̶u̶s̶e̶r̶"̶)̶ ̶ return $resource("/api/user/GetAllUserBranches") }} Call the $resource object with: branchResource.query({"federatedUserName": vm.user}, function (data) { vm.branches = data; }); //OR vm.branches = branchResource.query({"federatedUserName": vm.user}); It is important to realize that invoking a $resource object method immediately returns an empty reference (object or array depending on isArray). Once the data is returned from the server the existing reference is populated with the actual data. Each key value in the parameter object is first bound to url template if present and then any excess keys are appended to the url search query after the ?. For more information, see AngularJS ngResource $resource API Reference. [991 chars]
Chrome doesn’t show un-minified code in spite of source map present I’m using Grunt and UglifyJS to generate source maps for my AngularJS app. It produces a file customDomain.js and customDomain.js.map. JS file Last line of customDomain.js looks like this: //# sourceMappingURL=customDomain.js.map Map file I find two references to customDomain.js inside of customDomain.js.map, one at the beginning: "sources":["../../../.tmp/concat/scripts/customDomain.js"] I think this looks weird so I trim it to... [500 / 910 chars]	"sources":["customDomain.js"] should be relative to the customDomain.map.js file. Make sure they are in the same directory on your server if this is the case for you. "file":"customDomain.js" should be changed to the name of the map file, in your case this would be "file":"customDomain.map.js". Here's a map file example taken from treehouse (sourceRoot may be unnecessary in your case): { version: 3, file: "script.js.map", sources: [ "app.js", "content.js", "widget.js" ], sourceRoot: "/", names: ["slideUp", "slideDown", "save"], mappings: "AAA0B,kBAAhBA,QAAOC,SACjBD,OAAOC,OAAO..." } [641 chars]

Source Reference Table

Source	Role
CoIR: A Comprehensive Benchmark for Code Information Retrieval Models	Benchmark paper defining the retrieval adaptation.
CoIR-Retrieval/stackoverflow-qa	Public source dataset card for the retrieval task.
Stack Overflow Data on Kaggle	Public source data page.
hakari-bench/NanoCoIR	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoCoIR
Backing dataset	NanoCoIR
Task / split	NanoStackOverflowQA
Hugging Face dataset	hakari-bench/NanoCoIR
Language	en
Category	code
Queries	200
Documents	10,000
Positive qrels	200
Positives / query avg	1.00
Positives / query min	1
Positives / query median	1.00
Positives / query max	1
Multi-positive queries	0 (0.00%)
Query length avg chars	1,361.81
Document length avg chars	1,218.06

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.7482	0.8300	0.9250	top-500
Dense	`harrier_oss_v1_270m`	0.8836	0.9300	0.9400	top-500
Reranking hybrid	`reranking_hybrid`	0.8328	0.8850	0.9900	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: CoIR StackOverflow QA test-derived retrieval split
Train/eval overlap audit: not_audited_split_filtering_required
Leakage note: exclude NanoStackOverflowQA question-answer pairs; do not train on StackOverflow QA test-derived rows
Multi-positive training: single_positive
Useful training data: StackOverflow question-answer retrieval, code troubleshooting pairs, framework-tag hard negatives