NanoBRIGHT / NanoBrightEconomicsLong
Overview
NanoBrightEconomicsLong is the long-document NanoBRIGHT slice for Economics StackExchange retrieval. It uses the same style of long economics queries as the compact passage task, but candidate documents are full cited source pages or long documents. The task measures whether a retriever can find the source document that contains the relevant economic model, empirical evidence, institutional explanation, or policy argument when the answer-bearing material may be buried inside a much larger page.
Details
What the Original Data Measures
BRIGHT's long-document variants test source-level retrieval rather than passage-level retrieval. For Economics, a query may ask about GDP accounting, taxes, central-bank financing, asset-market mechanics, welfare tradeoffs, or a derivation in a macroeconomic model. The relevant document can be a long report, paper, reference page, encyclopedia article, or finance education page.
The task therefore measures two abilities at once: matching the economic concept behind a detailed question and tolerating long-document noise. A relevant page may contain the needed section, but it can also include navigation material, definitions, tables, citations, examples, or adjacent economic topics that are not directly useful.
Observed Data Profile
The task contains 103 queries, 515 documents, and 109 relevance judgments. Unlike the compact Economics slice, the long-document version is mostly single-positive: it has 1.06 positives per query on average, a minimum of 1, a median of 1.0, a maximum of 3, and only 5 multi-positive queries, or 4.85% of the set.
Queries average 739.57 characters, while documents average 38,615.97 characters. The corpus is much smaller than the passage version, but each candidate is far longer. This changes the retrieval problem from selecting among many short supporting passages to identifying which long source contains the right piece of economic evidence.
BM25 Evaluation Profile
BM25 reaches nDCG@10 of 0.2658, hit@10 of 0.4369, and recall@100 of 0.7248 using the top-500 BM25 candidate subset. The recall is reasonably high because long economic documents contain many terms that can overlap with detailed queries, including formulas, policy vocabulary, institutional names, and market terminology.
The top-rank performance is much weaker. Long pages can mention the right words in unrelated sections, or repeat broad terms such as market, tax, GDP, capital, or policy without answering the query. BM25 can place the positive somewhere in the candidate pool but often fails to rank it high enough for direct use.
Dense Evaluation Profile
The dense harrier-oss-270m run reaches nDCG@10 of 0.4266, hit@10 of 0.6602, and recall@100 of 0.9083. Dense retrieval is the strongest profile for all three headline metrics in this long-document task. It substantially improves over BM25 and slightly exceeds reranking_hybrid in recall@100.
This suggests that the long Economics task is dominated by semantic support matching. The model must connect a detailed economic puzzle to a source page whose overall content and central concepts align with the requested explanation. Embedding similarity is better than term frequency at recognizing the relevant economics source despite long-document noise.
Reranking Hybrid Evaluation Profile
The reranking_hybrid candidate set reaches nDCG@10 of 0.3764, hit@10 of 0.5728, and recall@100 of 0.8991. It uses a top-100 candidate range with an optional rank-101 safeguard; this task has 10 safeguard rows, candidate counts from 100 to 101, and a mean of 100.10 candidates.
Hybrid retrieval remains strong as a compact candidate pool, but dense retrieval is better in the reported metrics. The useful interpretation is that sparse signals add coverage for some formulaic or named-concept queries, while the fused ordering can dilute dense's strongest semantic matches. For reranking experiments, reranking_hybrid is still valuable because it keeps recall high in only about 100 candidates.
Metric Interpretation for Model Researchers
This task is a clear example where dense retrieval is stronger than BM25 and stronger than the hybrid fused order at the top of the ranking. BM25's high recall@100 compared with its low nDCG@10 indicates that exact terms can locate relevant pages, but not rank them reliably. Dense retrieval provides the best top-rank signal.
For model researchers, the important difficulty is not corpus size but document granularity. With only 515 documents, the challenge might look small, yet each document is long enough to contain many distracting economic terms. A system that represents only global topic may retrieve plausible pages; a system that preserves the relevant economic mechanism will rank better.
Query and Relevance Type Tendencies
Queries include long-form questions about national accounting, RBC model transformations, deficit financing, equity versus efficiency, order-book matching, and requests for credible economic sources. Positive documents include full reference pages, long policy or research documents, encyclopedia-like pages, and finance explanations.
Relevance is usually tied to a specific section or argument inside the document. A long source may be relevant because one paragraph defines a model, one table supports an empirical point, or one section explains an institutional process. The rest of the document may be only loosely related.
Representative Failure Modes
Common failures include ranking a long page because it repeats the query's economic vocabulary while omitting the needed explanation, confusing a related policy topic with the specific claim, missing a full paper whose abstract uses different language from the query, and losing evidence because the positive signal is a small part of a long page.
BM25 is especially exposed to repeated terms and boilerplate. Dense retrieval can still over-rank documents that share the broad topic but not the exact mechanism. Hybrid retrieval improves candidate robustness, but final quality depends on a reranker or document model that can inspect the relevant section.
Training Data That May Help
Useful training data includes long economics reports aligned to questions, document-level paper recommendation data, cited-source retrieval from economics forums, and passage-to-full-document distillation where a model learns to map a short evidence span back to its source page.
Synthetic data should generate long economics documents with abstracts, definitions, examples, tables, and policy context, then create detailed questions answerable by one section. Hard negatives should be long documents from the same economic topic with the wrong model, country, time period, or empirical claim.
Model Improvement Notes
For this task, dense retrieval is the strongest observed first-stage method, but practical systems should still preserve sparse signals for named models, formulas, and institutional phrases. Long-document models may benefit from hierarchical pooling, passage aggregation, late interaction, or source-page reranking over extracted sections.
Because most queries have only one positive, small rank changes matter. Training should emphasize exact source support rather than broad topical relevance. Reranking_hybrid is a useful high-recall diagnostic pool, but the top ranking should be judged against dense's strong baseline.
Example Data
| Query | Positive document |
| Would a GDP measure be improved by excluding foreign interest paid? The income method of calculating GDP is as follows: GDP = wages + profits + rents + interest + depreciation + taxes + NFFI. If an economy has high external debt, for instance, because it used external financing to buy machinery and equipment, then foreign interest payments will be high. In that case, wouldn't GDP (per capita) be a poor measure of economic well-being since a significant portion of the generated income is leaving... [500 / 684 chars] | OECD Better Life Index  * Index * Responses * Countries __ * Australia * Austria * Belgium * Brazil * Chile * Denmark * Germany * Estonia * Finland * France * Greece * Ireland * Iceland * Israel * Italy * Japan * Canada * Korea * Luxembourg * Mexico * Netherlands * New Zealand * Norway * Poland * [ Portugal ](/countries/portugal/... [1,000 / 13,187 chars] |
| Derivative to ln(K(t)) in the RBC model In the calculation of the equation of motion for capital in the RBC model, I came across this equation: d ln K_(t+1) / d ln K_t = (d K_(t+1) / d K_t) * (K_t / K_(t+1)) Can someone explain what are the mathematical steps in between? I don't see how exactly the derivative to ln(K(t)) gets us an almost elasticity-like equation. Would be thankful for any leads. :) [406 chars] | Jump to content Main menu Main menu move to sidebar hide Navigation * Main page * Contents * Current events * Random article * About Wikipedia * Contact us * Donate Contribute * Help * Learn to edit * [ Community portal ](/wiki/Wikipedia:Community_portal "The hub for editors"... [1,000 / 34,111 chars] |
| What is the purpose of taxes if central banks can fund deficit spending? Somewhat straight forward. If the federal reserve can print money to buy treasuries to fund deficit spending, what is the purpose of taxes? Sure, taxes reduce the amount of deficit that needs to be picked up by the Fed, but if, as ive seen argued, money “printing” doesn’t necessarily lead to inflation whats the point of levying taxes? Why doesn’t the fed just procure all of the money itself if it could theoretically do so w... [500 / 524 chars] | The Economic and Social Review, Vol. 35, No. 3, Winter, 2004, pp. 251-266 Inflation and Money Growth: Evidence from a Multi-Country Data-Set JOHN C. FRAIN* Central Bank and Financial Services Regulatory Authority of Ireland and Trinity College Dublin Abstract: Using a multi-country data set strong correlation are found between average growth rates of monetary aggregates and average inflation. The correlation remains strong when countries with higher average inflation rates are removed from the sample. These results confirm the strong correlation found in the traditional literature but contradict those in De Grauwe and Polan (2001) who, in a recent analysis, find that the strong link vanishes when higher inflation countries are excluded. Further analysis confirms the unit response and bears out the value of monetary aggregates as an input to the making of monetary policy. I INTRODUCTION Monetary theory predicts a strong long-run correlation between money growth and inflation. One strand... [1,000 / 29,854 chars] |
Source Reference Table
| Item | Reference |
| Original benchmark paper | BRIGHT |
| Project page | BRIGHT project page |
| Source dataset | xlangai/BRIGHT |
| NanoBRIGHT dataset | hakari-bench/NanoBRIGHT |
Representative query and positive source snippets:
| Query | Positive document snippet |
| Would a GDP measure be improved by excluding foreign interest paid? | A long OECD-style page provides country-level well-being and economic context. |
| Why does a derivative expression appear in an RBC capital equation? | A long reference page includes production-function and substitution-elasticity material. |
| What is the purpose of taxes if central banks can fund deficit spending? | A long economics paper discusses money growth, inflation, and policy relationships. |
| Is there always a tradeoff between efficiency and equity? | A long economics association page or article discusses welfare-related distortions. |
| How are stock prices determined when orders match? | A finance education page explains matching orders and how exchanges pair buy and sell requests. |
Dataset Information
| Field | Value |
| Nano set | NanoBRIGHT |
| Backing dataset | NanoBRIGHT |
| Task / split | NanoBrightEconomicsLong |
| Hugging Face dataset | hakari-bench/NanoBRIGHT |
| Language | en |
| Category | natural_language |
| Queries | 103 |
| Documents | 515 |
| Positive qrels | 109 |
| Positives / query avg | 1.06 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 3 |
| Multi-positive queries | 5 (4.85%) |
| Query length avg chars | 739.57 |
| Document length avg chars | 38,615.97 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.2658 | 0.4369 | 0.7248 | top-500 |
| Dense | harrier_oss_v1_270m | 0.4266 | 0.6602 | 0.9083 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.3764 | 0.5728 | 0.8991 | top-100 |
Training and Leakage Metadata
- Original train split: unknown
- Evaluation split origin: BRIGHT Economics long-document evaluation split
- Train/eval overlap audit: not_audited
- Leakage note: exclude NanoBRIGHT EconomicsLong queries and full cited source pages
- Multi-positive training: single_positive_question_document_focus
- Useful training data: long economics reports aligned to questions, document-level paper recommendation data, cited-source retrieval from economics forums