NanoBRIGHT / NanoBrightStackoverflowLong

Overview

NanoBrightStackoverflowLong is the long-document Stack Overflow slice of NanoBRIGHT. Queries are developer questions with code, configuration, environment context, and symptoms, while relevant documents are full cited source pages such as MDN, Microsoft Learn, cppreference, API manuals, specifications, or technical documentation pages. The task measures whether a retriever can identify the source page containing the section that explains the needed API behavior or platform feature.

Details

What the Original Data Measures

BRIGHT's long-document StackExchange variants retrieve full source pages instead of passage chunks. For Stack Overflow, this means the positive document may be a very large reference page, documentation article, command manual, or specification. The useful explanation may be a small section inside a long page with navigation, examples, related APIs, and boilerplate.

The task is a source-page retrieval problem for programming help. Relevance depends on whether the full page contains the authoritative behavior, option, function, or language feature needed to solve the user's issue.

Observed Data Profile

The task contains 117 queries, 1,846 documents, and 129 relevance judgments. It is mostly single-positive: there are 1.10 positives per query on average, a minimum of 1, a median of 1.0, a maximum of 2, and 12 multi-positive queries, or 10.26% of the set.

Queries average 1,292.97 characters, while documents average 77,578.44 characters. Some source pages are extremely large, so the main challenge is not just finding a same-topic page but ranking the full page that contains the decisive API rule or reference section.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.4440, hit@10 of 0.7009, and recall@100 of 0.9225 using the top-500 BM25 candidate subset. This is a strong sparse baseline. Exact API names, command names, function names, and specification terms are highly informative in programming documentation retrieval.

The weakness is that long reference pages contain many related identifiers. BM25 can rank a broad reference page highly because it shares terms with the query, even if a different page contains the exact behavior or feature needed. It is strong at coverage but not the final word on ordering.

Dense Evaluation Profile

The dense harrier-oss-270m run reaches nDCG@10 of 0.3894, hit@10 of 0.6581, and recall@100 of 0.9070. Dense retrieval is also strong, but it trails BM25 on all reported metrics in this long-document slice.

This pattern suggests that exact identifiers are especially valuable when the relevant document is a full programming reference page. Dense retrieval can capture broad semantic similarity, but it may underweight exact function names, flags, class names, or documentation terms that uniquely identify the correct source.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate set reaches nDCG@10 of 0.4744, hit@10 of 0.8376, and recall@100 of 0.9767. It uses a top-100 candidate range with an optional rank-101 safeguard; this task has 3 safeguard rows, candidate counts from 100 to 101, and a mean of 100.03 candidates.

The hybrid profile is strongest across all reported metrics. It improves over BM25's already strong lexical signal by adding semantic evidence about the user's intended behavior or failure mode. For long Stack Overflow source retrieval, hybrid search is the best observed candidate strategy.

Metric Interpretation for Model Researchers

This task is a high-recall hybrid-search benchmark with unusually long documents. BM25 is strong because programming references are identifier-rich. Dense retrieval is useful but weaker than BM25 in this setting. Reranking_hybrid combines exact names and semantic intent, producing the best first-page ranking and nearly complete recall@100.

Researchers should treat long-document length as a serious factor. A correct full page may be hundreds of thousands of characters, and only one section may matter. Systems that retrieve full pages should ideally include section-level reranking or evidence extraction after source-page retrieval.

Query and Relevance Type Tendencies

Queries include JavaScript sorting, SFTP scripting, DAX row-level security, LLM JSON output, C++ constexpr initialization, iCalendar or WebView behavior, .NET APIs, and platform-specific configuration. Positive documents are often long documentation pages from major technical reference sites.

The relevance relation is full-source support. A page is positive if it contains the behavior, option, function, or rule needed to answer the Stack Overflow question. Much of the page can be unrelated.

Representative Failure Modes

Likely failures include retrieving the wrong page from the same API family, over-ranking huge reference pages that mention many query terms, missing a source because the user describes symptoms rather than the canonical API name, and confusing examples with the authoritative rule.

BM25 may be distracted by long-page identifier density. Dense retrieval may lose exact names and version-specific details. Hybrid retrieval reduces both risks, but downstream reranking should still inspect the relevant section.

Training Data That May Help

Useful training data includes document-level API documentation retrieval, Stack Overflow questions with cited links, long technical-reference QA, and issue-to-document pairs where the positive is a full source page.

Synthetic data should create developer questions with concrete code and environment context, then pair them with full documentation pages where one section explains the behavior. Hard negatives should be long pages from the same platform or API family that do not resolve the issue.

Model Improvement Notes

Strong systems should combine exact identifier preservation with semantic understanding of developer intent. Candidate retrieval should use hybrid search, followed by reranking or section extraction that can locate the answer-bearing part of a large source page.

The observed metrics show that reranking_hybrid is the best pool for both top-10 visibility and downstream coverage. Further improvements should focus on distinguishing authoritative source sections from merely related reference pages.

Example Data

Query	Positive document
Sort tbody list which is populated with Javascript getList? I have a router with a DHCP page which is not sorted by the internal IP number, instead it is fully random. I have full access to the html and javascript, and I can modify this without any issues. However I cannot figure out how to make the list sorted by IP-address by default. I have no need to be able to manually click and sort, I just want the list to always be sorted by IP-address. I cannot figure out if it is possible to sort a get... [500 / 12,432 chars]	* Skip to main content * Skip to search * Skip to select language MDN Web Docs Open main menu * References References * Overview / Web Technology Web technology reference for developers * HTML Structure of content on the web * CSS Code used to describe document style * JavaScript General-purpose scripting language * HTTP Protocol for transmitting web resources * Web APIs Interfaces for building web applications * Web Extensions Developing extensions for web browsers * Web Technology Web technology reference for developers * Guides Guides * Overview / MDN Learning Area Learn web development * MDN Learning Area Learn web development * [ HTML Learn to structure web con... [1,000 / 951,658 chars]
Copy and delete files from SFTP folder I have to pick (remove) the files with file mask `FileName_A_` and `FileName_B_` from SFTP location and place them in an sharedrive. I tried using WinSCP. I have created an `HourlyFile.txt` file with below code and placed it under `C:\Program Files (x86)\WinSCP`. Another batch file `HourlyFile.bat` to execute the script from HourlyFile.txt HourlyFile.txt: ``` option batch abort option confirm off open sftp.......... get -filemask="FileName_A_*" /outbound/... [500 / 1,125 chars]	Menu Toggle search ![ WinSCP Free SFTP, SCP, S3 and FTP client for Windows ](https://winscp.net/) * Home * News * Introduction * Download * Install * Documentation * Forum Close Search Close Documentation » Features » Scripting » Script Commands » # call command With SFTP and SCP protocols , executes arbitrary remote shell command . With FTP protocol, executes a protocol command. Not supported with WebDAV and S3 protocols. * Syntax * Remarks * Examples * Converting to .NET Assembly _Advertisement_ ## Syntax call <command> ## Remarks With SFTP... [1,000 / 190,917 chars]
DAX RLS Function using LOOKUPVALUE Parsing but not working I have a table that I'm trying to implement RLS on using a secondary table with a structure below: EmployeeTable `` `EmployeeID EmployeeEmail 1 1234@email.com 2 4567@email.com` ` `FilterTable` ` `EmployeeID ManagerHierarchy 1 3&4&5 2 6&7&4&5` `` The ManagerHierarchy column is a string that shows all managers of an employee concatenated together and separated by "&". The goal of the RLS is to create a filter that allows any manager to vie... [500 / 1,525 chars]	Skip to main content This browser is no longer supported. Upgrade to Microsoft Edge to take advantage of the latest features, security updates, and technical support. Download Microsoft Edge More info about Internet Explorer and Microsoft Edge Table of contents Exit focus mode Read in English Save Table of contents Read in English Save Add to Plan Edit Print Twitter LinkedIn Facebook Email Table of contents # LOOKUPVALUE * Article * 10/20/2023 * 4 contributors Feedback ## In this article Returns the value for the row that meets all criteria specified by one or more search conditions. ## Synta... [1,000 / 42,606 chars]

Source Reference Table

Item	Reference
Original benchmark paper	BRIGHT
Project page	BRIGHT project page
Source dataset	xlangai/BRIGHT
NanoBRIGHT dataset	hakari-bench/NanoBRIGHT

Representative query and positive source snippets:

Query	Positive document snippet
Sort a JavaScript-populated table body by internal IP number.	A long MDN-style JavaScript reference page contains the sort and comparator behavior needed.
Copy and delete masked files from an SFTP folder using WinSCP.	A long WinSCP documentation page explains scripting commands and file operations.
Implement DAX row-level security using LOOKUPVALUE.	A Microsoft Learn page contains the relevant DAX function and security semantics.
Force an LLM executor to respond only with JSON strings.	A long text-generation documentation page discusses model output behavior and implementation guidance.
Initialize a constexpr C++ array with generated values.	A cppreference-style page contains the compile-time utility or template behavior involved.

Dataset Information

Field	Value
Nano set	NanoBRIGHT
Backing dataset	NanoBRIGHT
Task / split	NanoBrightStackoverflowLong
Hugging Face dataset	hakari-bench/NanoBRIGHT
Language	en
Category	natural_language
Queries	117
Documents	1,846
Positive qrels	129
Positives / query avg	1.10
Positives / query min	1
Positives / query median	1.00
Positives / query max	2
Multi-positive queries	12 (10.26%)
Query length avg chars	1,292.97
Document length avg chars	77,578.44

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.4440	0.7009	0.9225	top-500
Dense	`harrier_oss_v1_270m`	0.3894	0.6581	0.9070	top-500
Reranking hybrid	`reranking_hybrid`	0.4744	0.8376	0.9767	top-100

Training and Leakage Metadata

Original train split: unknown
Evaluation split origin: BRIGHT Stack Overflow long-document evaluation split
Train/eval overlap audit: not_audited
Leakage note: exclude NanoBRIGHT StackoverflowLong queries and full cited source pages
Multi-positive training: multi_positive_objective
Useful training data: document-level API documentation retrieval, Stack Overflow questions with cited links, long technical-reference QA