NanoRTEB / NanoFinanceBench
Overview
NanoFinanceBench is an English financial filing evidence retrieval task from NanoRTEB. The query is an analyst-style finance question, and the relevant document is the filing excerpt, table, or statement section needed to answer it. Each query has one positive document among 145 candidates. Dense retrieval is clearly the strongest top-rank profile, reranking_hybrid reaches full recall@100, and BM25 is weaker because exact lexical overlap is not enough to identify the right filing evidence.
Details
What the Original Data Measures
FinanceBench was introduced as a benchmark for financial question answering grounded in public company filings. It emphasizes realistic analyst questions that require evidence from financial statements, footnotes, and management discussion.
RTEB repurposes the evidence-finding portion as retrieval. The system receives a finance question and must retrieve the exact filing excerpt needed before answer generation or numerical calculation.
Observed Data Profile
The Nano split contains 150 queries, 145 documents, and 150 positive qrel rows. Every query has exactly one positive. Queries average 161.09 characters, while documents average 1,676.96 characters.
Example questions ask for adjusted non-GAAP EBITDA, registered debt securities, days payable outstanding, gross margin profile, and total revenue growth rate. Positive documents include cash-flow statements, balance sheets, segment tables, security listings, and explanatory filing excerpts.
BM25 Evaluation Profile
The BM25 candidate subset uses the full 145-document pool and reaches nDCG@10 of 0.4267, hit@10 of 0.6533, and recall@100 of 0.9467. BM25 can match company names, years, table labels, and financial terms.
Its weakness is that many filing sections share similar vocabulary. A question may require a particular statement line, footnote, or table relation, and term overlap alone can retrieve the wrong section from the same company or period.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses the full 145-document pool and reaches nDCG@10 of 0.7694, hit@10 of 0.9533, and recall@100 of 0.9933. Dense retrieval is the best top-rank profile by a large margin.
This indicates that semantic matching is highly useful for analyst-style questions. Dense retrieval can map concepts such as margin profile, debt securities, and payable outstanding to the filing excerpt that supports the calculation or assessment.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates and does not need the rank-101 safeguard. It reaches nDCG@10 of 0.6613, hit@10 of 0.9133, and recall@100 of 1.0000. Hybrid retrieval has the best recall@100 but lower nDCG@10 than dense retrieval.
The pattern suggests that sparse matching helps complete the candidate pool, while dense retrieval orders the correct filing evidence better. For reranking, hybrid is attractive because it exposes all positives by rank 100.
Metric Interpretation for Model Researchers
With one positive per query, nDCG@10 measures early placement of the exact evidence excerpt, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures reranker availability.
For NanoFinanceBench, dense nDCG@10 is the key first-stage signal. Recall@100 is also useful because finance rerankers may need to inspect tables and numeric context to select the final evidence.
Query and Relevance Type Tendencies
Queries are analyst-style financial questions, often asking for a metric, comparison, rate, or interpretation from a company filing. Relevant documents are excerpts from annual or quarterly reports, often containing tables and numeric statements.
Relevance is evidence sufficiency. A document from the same company can be wrong if it does not contain the table or section needed for the requested calculation.
Representative Failure Modes
Common failures include retrieving the right company but wrong statement section, confusing fiscal years, matching metric names without the needed calculation inputs, and overranking broad management discussion when a table is required. BM25 is vulnerable to repeated finance terminology; dense retrieval can still miss exact numeric rows.
Training Data That May Help
Useful training data includes financial QA evidence retrieval, SEC filing search, annual-report table retrieval, analyst-question datasets, and hard negatives from the same company and year but different statement sections. Evaluation questions, filing excerpts, and qrels should be excluded.
Model Improvement Notes
Models should represent company, period, metric, table role, and calculation intent. Hard negatives should use nearby filing sections with overlapping vocabulary. Dense retrieval is the strongest first-stage profile, while hybrid retrieval is best when maximizing reranking coverage.
Example Data
| Query | Positive document |
| What Was AMCOR's Adjusted Non GAAP EBITDA for FY 2023 [53 chars] | Twelve Months Ended June 30, 2022 Twelve Months Ended June 30, 2023 ($ million) EBITDA EBIT Net Income EPS (Diluted US cents)(1) EBITDA EBIT Net Income EPS (Diluted US cents)(1) Net income attributable to Amcor 805 805 805 52.9 1,048 1,048 1,048 70.5 Net income attributable to non-controlling interests 10 10 10 10 Tax expense 300 300 193 193 Interest expense, net 135 135 259 259 Depreciation and amortization 579 569 EBITDA, EBIT, Net income and EPS 1,829 1,250 805 52.9 2,080 1,510 1,048 70.5 2019 Bemis Integration Plan 37 37 37 2.5 — — — — Net loss on disposals(2) 10 10 10 0.7 — — — — Impact of hyperinflation 16 16 16 1.0 24 24 24 1.9 Property and other losses, net(3) 13 13 13 0.8 2 2 2 0.1 Russia-Ukraine conflict impacts(4) 200 200 200 13.2 (90) (90) (90) (6.0) Pension settlements 8 8 8 0.5 5 5 5 0.3 Other 4 4 4 0.3 (3) (3) (3) (0.3) Amortization of acquired intangibles (5) 163 163 10.7 160 160 10.8 Tax effect of above items (32) (2.1) (57) (4.0) Adjusted EBITDA, EBIT, Net income and... [1,000 / 1,049 chars] |
| Which debt securities are registered to trade on a national securities exchange under 3M's name as of Q2 of 2023? [113 chars] | Title of each class Trading Symbol(s) Name of each exchange on which registered Common Stock, Par Value $.01 Per Share MMM New York Stock Exchange MMM Chicago Stock Exchange, Inc. 1.500% Notes due 2026 MMM26 New York Stock Exchange 1.750% Notes due 2030 MMM30 New York Stock Exchange 1.500% Notes due 2031 MMM31 New York Stock Exchange [335 chars] |
| Based on the information provided primarily in the balance sheet and the statement of income, what is FY2020 days payable outstanding (DPO) for Corning? DPO is defined as: 365 * (average accounts payable between FY2019 and FY2020) / (FY2020 COGS + change in inventory between FY2019 and FY2020). Round your answer to two decimal places. [336 chars] | Index Consolidated Statements of Income Corning Incorporated and Subsidiary Companies Year ended December 31, (In millions, except per share amounts) 2020 2019 2018 Net sales $ 11,303 $ 11,503 $ 11,290 Cost of sales 7,772 7,468 6,829 Gross margin 3,531 4,035 4,461 Operating expenses: Selling, general and administrative expenses 1,747 1,585 1,799 Research, development and engineering expenses 1,154 1,031 993 Amortization of purchased intangibles 121 113 94 Operating income 509 1,306 1,575 Equity in (losses) earnings of affiliated companies (Note 3) (25) 17 390 Interest income 15 21 38 Interest expense (276) (221) (191) Translated earnings contract (loss) gain, net (Note 15) (38) 248 (93) Transaction-related gain, net (Note 4) 498 Other expense, net (60) (155) (216) Income before income taxes 623 1,216 1,503 Provision for income taxes (Note 8) (111) (256) (437) Net income attributable to Corning Incorporated $ 512 $ 960 $ 1,066 Earnings per common share attributable to Corning Incorporat... [1,000 / 4,015 chars] |
Source Reference Table
| Title | Year | Type | URL |
| FinanceBench: A New Benchmark for Financial Question Answering | 2023 | task paper | https://arxiv.org/abs/2311.11944 |
| virattt/financebench | dataset card | https://huggingface.co/datasets/virattt/financebench | |
| Introducing RTEB: A New Standard for Retrieval Evaluation | 2025 | benchmark article | https://huggingface.co/blog/rteb |
Dataset Information
| Field | Value |
| Nano set | NanoRTEB |
| Backing dataset | NanoRTEB |
| Task / split | NanoFinanceBench |
| Hugging Face dataset | hakari-bench/NanoRTEB |
| Language | en |
| Category | natural_language |
| Queries | 150 |
| Documents | 145 |
| Positive qrels | 150 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 161.09 |
| Document length avg chars | 1,676.96 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.4267 | 0.6533 | 0.9467 | top-500 |
| Dense | harrier_oss_v1_270m | 0.7694 | 0.9533 | 0.9933 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.6613 | 0.9133 | 1.0000 | top-100 |