NanoMTEB-v2 / fi_qa2018
Overview
NanoMTEB-v2 / fi_qa2018 is an English financial question-answer retrieval task. Queries are investor, personal-finance, or market-related questions, and relevant documents are answers or answer-like passages from a financial QA collection. FiQA was created for financial opinion mining and question answering, and MTEB includes it as a retrieval benchmark. This Nano split contains 200 queries over 10,000 documents and is strongly multi-positive, with many questions having several acceptable answers. It is useful for studying retrieval in a domain where terminology matters, but where the best answer often depends on scenario interpretation, financial concepts, and advice-like reasoning rather than simple keyword overlap.
Details
What the Original Data Measures
FiQA measures question answering and opinion mining in the financial domain. The retrieval task asks whether a system can match a short finance question with passages that answer it or provide useful explanation. The text differs from encyclopedic retrieval because answers often contain personal advice, legal or tax caveats, investor reasoning, and informal forum language.
This makes the task a good test of domain-specific semantic retrieval. A model must understand financial products, tax concepts, market mechanics, and user intent.
Observed Data Profile
The Nano split contains 200 queries, 10,000 documents, and 534 positive qrel rows. Queries have 2.67 positives on average, with a median of 2 and a maximum of 12. There are 128 multi-positive queries, or 64.0% of the query set. Queries average 61.70 characters, while documents average 780.39 characters.
Examples include stock and ETF taxation, exchange-rate conversion, brokerage exchange fees, freelance income earned abroad, and inflation measurement. Documents are usually explanatory answers rather than formal article passages.
BM25 Evaluation Profile
The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.3799, hit@10 of 0.6600, and recall@100 of 0.7097. BM25 is moderately useful because finance questions and answers share domain terms such as tax, exchange rate, brokerages, inflation, ETF, trade, or income.
However, sparse matching is limited by the advice-like nature of the task. The answer may use a different framing from the question, discuss the relevant rule indirectly, or require connecting a scenario to a financial concept. BM25 can also over-rank passages that share a product term but answer a different decision problem.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.5494, hit@10 of 0.8000, and recall@100 of 0.8258. Dense retrieval is clearly stronger than BM25 across all reported metrics. This indicates that semantic matching is important for mapping short finance questions to longer explanatory answers.
The dense advantage is especially meaningful because questions often encode intent rather than exact answer wording: a user asks what exchange rate applies, whether state tax is owed, or how inflation is calculated. Dense models can connect those intents to explanatory passages even when surface terms differ.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 17 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.5258, hit@10 of 0.8000, and recall@100 of 0.8390. The hybrid pool has the best recall@100 and matches dense hit@10, but dense retrieval remains strongest by nDCG@10.
This suggests that sparse retrieval adds complementary candidates through exact financial terms, while dense retrieval gives better rank ordering. A reranker can benefit from hybrid coverage if it can distinguish answers that address the user's actual financial decision from those that merely share vocabulary.
Metric Interpretation for Model Researchers
The high multi-positive rate means recall@100 is important: a useful system should expose multiple acceptable financial answers for a downstream ranker. nDCG@10 still matters because users need the best answer early, and finance answers may differ in relevance, specificity, and caveat quality.
Dense retrieval is the strongest first-stage baseline, while hybrid retrieval is the better candidate pool for reranking. BM25 alone leaves too much semantic matching on the table.
Query and Relevance Type Tendencies
Queries are short finance questions, often about taxes, brokerage mechanics, exchange rates, credit, investment products, or personal-finance decisions. Relevant documents are longer answer passages containing explanations, caveats, or practical advice.
The relevance relation is answer usefulness. A passage should answer the specific financial question, not merely mention the same product or market term.
Representative Failure Modes
Common failures include retrieving a passage about the same financial instrument but a different tax situation, matching a keyword such as ETF or exchange without answering the user's scenario, missing jurisdictional details, and confusing general market explanation with personal-finance advice. Dense models may retrieve plausible but non-specific advice; sparse models may over-rank exact term overlap.
Training Data That May Help
Useful training data includes financial QA pairs, investor-forum answers, personal-finance FAQ retrieval data, and hard negatives from the same topic or product type. Multi-positive training is recommended because many finance questions have several acceptable explanatory answers.
Model Improvement Notes
Models should learn financial intent, not only financial vocabulary. Hard negatives should share the same product, tax term, or market concept while answering a different decision. Rerankers should account for answer specificity, jurisdictional caveats, and whether the passage directly resolves the question.
Example Data
| Query | Positive document |
| Tax on Stocks or ETF's [22 chars] | "If you sell a stock, with no distributions, then your gain is taxable under §1001. But not all realized gains will be recognized as taxable. And some gains which are arguably not realized, will be recognized as taxable. The stock is usually a capital asset for investors, who will generate capital gains under §1(h), but dealers, traders, and hedgers will get different treatment. If you are an investor, and you held the stock for a year or more, then you can get the beneficial capital gain rates (e.g. 20% instead of 39.6%). If the asset was held short-term, less than a year, then your tax will generally be calculated at the higher ordinary income rates. There is also the problem of the net investment tax under §1411. I am eliding many exceptions, qualifications, and permutations of these rules. If you receive a §316 dividend from a stock, then that is §61 income. Qualified dividends are ordinary income but will generally be taxed at capital gains rates under §1(h)(11). Distributions in... [1,000 / 1,997 chars] |
| What exchange rate does El Al use when converting final payment amount to shekels? [82 chars] | "The rate for ""checks and transfers"" is set by each bank multiple times during the day based on the market. It is as opposed to the rate for ""cash/banknotes"", also set by each bank, and the ""representative rate"" (שער היציג) set by the Bank of Israel. These rates can be found on the websites of most banks. Here is Bank Hapoalim and Bank Leumi. The question is which bank's rate will be used. It might be the bank that issued your card, El Al's bank, or the credit card company (ie Poalim for Isracard or Leumi for CAL). You will need to call El Al to verify, but since these are market rates, they shouldn't be too different." [633 chars] |
| How much do brokerages pay exchanges per trade? [47 chars] | There is no one answer to this question, but there are some generalities. Most exchanges make a distinction between the passive and the aggressive sides of a trade. The passive participant is the order that was resting on the market at the time of the trade. It is an order that based on its price was not executable at the time, and therefore goes into the order book. For example, I'm willing to sell 100 shares of a stock at $9.98 but nobody wants to buy that right now, so it remains as an open order on the exchange. Then somebody comes along and is willing to meet my price (I am glossing over lots of details here). So they aggressively take out my order by either posting a market-buy, or specifically that they want to buy 100 shares at either $9.98, or at some higher price. Most exchanges will actually give me, as the passive (i.e. liquidity making) investor a small rebate, while the other person is charged a few fractions of a cent. Google found NYSEArca details, and most other exchan... [1,000 / 1,241 chars] |
Source Reference Table
| Title | Year | Type | URL |
| Financial Opinion Mining and Question Answering | 2018 | source task paper | https://doi.org/10.1145/3184558.3192301 |
| MTEB: Massive Text Embedding Benchmark | 2023 | benchmark paper | https://arxiv.org/abs/2210.07316 |
| mteb/fiqa | dataset card | https://huggingface.co/datasets/mteb/fiqa |
Dataset Information
| Field | Value |
| Nano set | NanoMTEB-v2 |
| Backing dataset | NanoMTEB-v2 |
| Task / split | fi_qa2018 |
| Hugging Face dataset | hakari-bench/NanoMTEB-v2 |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 10,000 |
| Positive qrels | 534 |
| Positives / query avg | 2.67 |
| Positives / query min | 1 |
| Positives / query median | 2.00 |
| Positives / query max | 12 |
| Multi-positive queries | 128 (64.00%) |
| Query length avg chars | 61.70 |
| Document length avg chars | 780.39 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.3799 | 0.6600 | 0.7097 | top-500 |
| Dense | harrier_oss_v1_270m | 0.5494 | 0.8000 | 0.8258 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.5258 | 0.8000 | 0.8390 | top-100 |
Training and Leakage Metadata
- Original train split: available
- Evaluation split origin: MTEB FiQA2018 test split
- Train/eval overlap audit: not_audited
- Leakage note: exclude NanoMTEB-v2 fi_qa2018 question-answer pairs
- Multi-positive training: recommended
- Useful training data: financial QA pairs, investor forum answers, personal-finance FAQ retrieval data