NanoBRIGHT / NanoBrightStackoverflow
Overview
NanoBrightStackoverflow is the Stack Overflow slice of NanoBRIGHT. Queries are developer questions with code snippets, configuration context, platform details, and symptoms, while relevant documents are cited passages such as API documentation, technical references, blog posts, or framework pages. The task evaluates whether retrieval systems can find the source that explains the API behavior, language feature, command, or configuration needed to resolve a programming issue.
Details
What the Original Data Measures
BRIGHT's StackExchange construction uses real posts as queries and cited or validated sources as positives. For Stack Overflow, this creates a practical documentation retrieval task: the answer may depend on a JavaScript comparator rule, a WinSCP command, a DAX function, a C++ constexpr limitation, or a framework-specific configuration behavior.
The task measures more than code-token overlap. A query can contain long code blocks, failed attempts, environment descriptions, and user interpretation. The relevant passage may explain a general API rule or platform behavior that solves the issue without repeating the exact code.
Observed Data Profile
The task contains 117 queries, 10,000 documents, and 478 relevance judgments. It has 4.09 positives per query on average, a minimum of 1, a median of 2.0, a maximum of 59, and 81 multi-positive queries, or 69.23% of the set.
Queries average 1,292.97 characters and documents average 1,120.63 characters. Both sides are relatively long for passage retrieval, often including snippets, command examples, API signatures, or documentation excerpts. This makes the task sensitive to both exact identifiers and semantic interpretation.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.3685, hit@10 of 0.5897, and recall@100 of 0.6506 using the top-500 BM25 candidate subset. Sparse retrieval is useful because Stack Overflow questions frequently contain exact API names, language constructs, command flags, function names, product names, and error fragments.
The limitation is that a query can contain many incidental identifiers. BM25 may match the wrong part of a code snippet or retrieve a same-library page that does not explain the behavior. It is strong, but not enough to capture symptom-to-documentation relations by itself.
Dense Evaluation Profile
The dense harrier-oss-270m run reaches nDCG@10 of 0.4033, hit@10 of 0.5726, and recall@100 of 0.7469. Dense retrieval improves nDCG@10 and recall@100 over BM25, though BM25 has a slightly higher hit@10. This indicates that semantic matching helps find supporting documentation beyond exact token overlap.
Dense retrieval is useful when the question describes a desired behavior or failure symptom and the source explains the underlying API concept. It can connect code examples to documentation about comparator contracts, random functions, launch options, policy filters, or compile-time constructs.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate set reaches nDCG@10 of 0.4686, hit@10 of 0.7009, and recall@100 of 0.8096. It uses a top-100 candidate range with an optional rank-101 safeguard; this task has 14 safeguard rows, candidate counts from 100 to 101, and a mean of 100.12 candidates.
The hybrid profile is strongest across all reported metrics. It benefits from exact API names and error tokens while also capturing semantic links between symptoms and documentation. For Stack Overflow-style retrieval, the observed data strongly supports using sparse and dense signals together.
Metric Interpretation for Model Researchers
This task is a practical hybrid-search benchmark. BM25 is strong because exact identifiers matter. Dense retrieval is strong because many questions describe behavior rather than naming the exact concept. Reranking_hybrid combines both and gives the best top-10 ranking and candidate coverage.
Researchers should evaluate whether models distinguish the document that solves the issue from merely related documentation. Hit@10 is important for whether a developer sees at least one useful source, while recall@100 matters for downstream reranking, answer generation, or tool-assisted documentation lookup.
Query and Relevance Type Tendencies
Queries include JavaScript table sorting, SFTP file operations, DAX row-level security, JSON-only LLM responses, C++ constexpr initialization, framework configuration, SQL behavior, and scripting issues. Positive documents include MDN-style references, Microsoft Learn pages, WinSCP documentation, cppreference pages, OpenAI documentation, and technical blog passages.
The relevance relation is operational support. A positive passage explains the API rule, command option, platform behavior, or language feature needed to fix or implement the user's task.
Representative Failure Modes
Likely failures include matching the same library but the wrong API, over-ranking code-token overlap that does not solve the problem, missing a relevant documentation page because the query describes a symptom, and confusing user attempts with the actual needed concept.
BM25 is vulnerable to noisy code snippets and incidental identifiers. Dense retrieval can overlook exact function names or version-specific details. Hybrid retrieval reduces both risks, but final reranking still needs to judge whether the source resolves the issue.
Training Data That May Help
Useful training data includes non-overlapping Stack Overflow questions with cited links, documentation retrieval and API usage examples, issue-to-document troubleshooting pairs, and hard negatives from the same library or framework but a different failure mode.
Synthetic data should generate developer questions with realistic code snippets, environment details, and symptoms. Positives should explain the API behavior or configuration needed to solve the issue. Hard negatives should share names and syntax but not address the actual problem.
Model Improvement Notes
Strong systems should combine exact matching for identifiers with semantic matching for behavior. Query decomposition can help identify the language, framework, error, attempted code, and desired outcome. Rerankers should be trained to prefer authoritative documentation or precise explanatory passages over generally related pages.
The observed metrics make reranking_hybrid the best candidate source for this task. Improvements should focus on grounding the match in the specific API behavior and filtering out same-library distractors.
Example Data
| Query | Positive document |
| Sort tbody list which is populated with Javascript getList? I have a router with a DHCP page which is not sorted by the internal IP number, instead it is fully random. I have full access to the html and javascript, and I can modify this without any issues. However I cannot figure out how to make the list sorted by IP-address by default. I have no need to be able to manually click and sort, I just want the list to always be sorted by IP-address. I cannot figure out if it is possible to sort a get... [500 / 12,432 chars] | 0 or have opposite signs. * _Transitive_ : If compareFn(a, b) and compareFn(b, c) are both positive, zero, or negative, then compareFn(a, c) has the same positivity as the previous two. A comparator conforming to the constraints above will always be able to return all of 1 , 0 , and -1 , or consistently return 0 . For example, if a comparator only returns 1 and 0 , or only returns 0 and -1 , it will not be able to sort reliably because _anti-symmetry_ is broken. A comparator that always returns 0 will cause the array to not be changed at all, but is reliable nonetheless. The default lexicographic comparator satisfies all constraints above. To compare numbers instead of strings, the compare function can subtract b from a . The following function will sort the array in ascending order (if it doesn't contain NaN ): js function compareNumbers(a, b) { return a - b; } The sort() method is [ generic ](/en- US/docs/Web/JavaScr... [1,000 / 3,998 chars] |
Copy and delete files from SFTP folder I have to pick (remove) the files with file mask FileName_A_* and FileName_B_* from SFTP location and place them in an sharedrive. I tried using WinSCP. I have created an HourlyFile.txt file with below code and placed it under C:\Program Files (x86)\WinSCP. Another batch file HourlyFile.bat to execute the script from HourlyFile.txt HourlyFile.txt: ``` option batch abort option confirm off open sftp.......... get -filemask="FileName_A_*" /outbound/... [500 / 1,125 chars] | Menu Toggle search  * Home * News * Introduction * Download * Install * Documentation * Forum Close Search Close Documentation » Features » Scripting » Script Commands » # rm command Removes one or more remote files. * Syntax * Remarks * Examples * Converting to .NET Assembly _Advertisement_ ## Syntax rm <file> [ <file2> ... ] ## Remarks If remote recycle bin is configured, moves file to the bin instead of deleting it. Filename can be replaced with wildcard to... [1,000 / 4,000 chars] |
DAX RLS Function using LOOKUPVALUE Parsing but not working I have a table that I'm trying to implement RLS on using a secondary table with a structure below: EmployeeTable `` EmployeeID EmployeeEmail 1 1234@email.com 2 4567@email.com ` FilterTable ` EmployeeID ManagerHierarchy 1 3&4&5 2 6&7&4&5 `` The ManagerHierarchy column is a string that shows all managers of an employee concatenated together and separated by "&". The goal of the RLS is to create a filter that allows any manager to vie... [500 / 1,525 chars] | Skip to main content This browser is no longer supported. Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support. Download Microsoft Edge More info about Internet Explorer and Microsoft Edge Table of contents Exit focus mode Read in English Save Table of contents Read in English Save Add to Plan Edit Print Twitter LinkedIn Facebook Email Table of contents # LOOKUPVALUE * Article * 10/20/2023 * 4 contributors Feedback ## In this article Returns the value for the row that meets all criteria specified by one or more search conditions. ## Synta... [1,000 / 3,997 chars] |
Source Reference Table
| Item | Reference |
| Original benchmark paper | BRIGHT |
| Project page | BRIGHT project page |
| Source dataset | xlangai/BRIGHT |
| NanoBRIGHT dataset | hakari-bench/NanoBRIGHT |
Representative query and positive source snippets:
| Query | Positive document snippet |
| Sort a JavaScript-populated table body by internal IP number. | A JavaScript reference passage explains comparator consistency and sort behavior. |
| Copy and delete masked files from an SFTP folder using WinSCP. | A WinSCP documentation passage describes client commands and scripting behavior. |
| Implement DAX row-level security using LOOKUPVALUE. | A Microsoft documentation page explains DAX function behavior and evaluation context. |
| Force an LLM agent to respond only with JSON strings. | A model documentation passage discusses generation behavior and implementation guidance. |
| Initialize a constexpr C++ array with generated values. | A cppreference-style page explains compile-time utility constructs. |
Dataset Information
| Field | Value |
| Nano set | NanoBRIGHT |
| Backing dataset | NanoBRIGHT |
| Task / split | NanoBrightStackoverflow |
| Hugging Face dataset | hakari-bench/NanoBRIGHT |
| Language | en |
| Category | natural_language |
| Queries | 117 |
| Documents | 10,000 |
| Positive qrels | 478 |
| Positives / query avg | 4.09 |
| Positives / query min | 1 |
| Positives / query median | 2.00 |
| Positives / query max | 59 |
| Multi-positive queries | 81 (69.23%) |
| Query length avg chars | 1,292.97 |
| Document length avg chars | 1,120.63 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.3685 | 0.5897 | 0.6506 | top-500 |
| Dense | harrier_oss_v1_270m | 0.4033 | 0.5726 | 0.7469 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.4686 | 0.7009 | 0.8096 | top-100 |
Training and Leakage Metadata
- Original train split: unknown
- Evaluation split origin: BRIGHT Stack Overflow evaluation split
- Train/eval overlap audit: not_audited
- Leakage note: exclude NanoBRIGHT Stackoverflow queries, cited positives, and linked answer pages
- Multi-positive training: multi_positive_objective
- Useful training data: non-overlapping Stack Overflow questions with cited links, documentation retrieval and API usage examples, issue-to-doc troubleshooting pairs