NanoMTEB-v2 / fi_qa2018

Overview

NanoMTEB-v2 / fi_qa2018 is an English financial question-answer retrieval task. Queries are investor, personal-finance, or market-related questions, and relevant documents are answers or answer-like passages from a financial QA collection. FiQA was created for financial opinion mining and question answering, and MTEB includes it as a retrieval benchmark. This Nano split contains 200 queries over 10,000 documents and is strongly multi-positive, with many questions having several acceptable answers. It is useful for studying retrieval in a domain where terminology matters, but where the best answer often depends on scenario interpretation, financial concepts, and advice-like reasoning rather than simple keyword overlap.

Details

What the Original Data Measures

FiQA measures question answering and opinion mining in the financial domain. The retrieval task asks whether a system can match a short finance question with passages that answer it or provide useful explanation. The text differs from encyclopedic retrieval because answers often contain personal advice, legal or tax caveats, investor reasoning, and informal forum language.

This makes the task a good test of domain-specific semantic retrieval. A model must understand financial products, tax concepts, market mechanics, and user intent.

Observed Data Profile

The Nano split contains 200 queries, 10,000 documents, and 534 positive qrel rows. Queries have 2.67 positives on average, with a median of 2 and a maximum of 12. There are 128 multi-positive queries, or 64.0% of the query set. Queries average 61.70 characters, while documents average 780.39 characters.

Examples include stock and ETF taxation, exchange-rate conversion, brokerage exchange fees, freelance income earned abroad, and inflation measurement. Documents are usually explanatory answers rather than formal article passages.

BM25 Evaluation Profile

The BM25 candidate subset uses top-500 candidates and reaches nDCG@10 of 0.3799, hit@10 of 0.6600, and recall@100 of 0.7097. BM25 is moderately useful because finance questions and answers share domain terms such as tax, exchange rate, brokerages, inflation, ETF, trade, or income.

However, sparse matching is limited by the advice-like nature of the task. The answer may use a different framing from the question, discuss the relevant rule indirectly, or require connecting a scenario to a financial concept. BM25 can also over-rank passages that share a product term but answer a different decision problem.

Dense Evaluation Profile

The dense candidate subset from harrier_oss_v1_270m uses top-500 candidates and reaches nDCG@10 of 0.5494, hit@10 of 0.8000, and recall@100 of 0.8258. Dense retrieval is clearly stronger than BM25 across all reported metrics. This indicates that semantic matching is important for mapping short finance questions to longer explanatory answers.

The dense advantage is especially meaningful because questions often encode intent rather than exact answer wording: a user asks what exchange rate applies, whether state tax is owed, or how inflation is calculated. Dense models can connect those intents to explanatory passages even when surface terms differ.

Reranking Hybrid Evaluation Profile

The reranking_hybrid subset uses top-100 candidates, with 17 queries carrying a rank-101 safeguard positive. It reaches nDCG@10 of 0.5258, hit@10 of 0.8000, and recall@100 of 0.8390. The hybrid pool has the best recall@100 and matches dense hit@10, but dense retrieval remains strongest by nDCG@10.

This suggests that sparse retrieval adds complementary candidates through exact financial terms, while dense retrieval gives better rank ordering. A reranker can benefit from hybrid coverage if it can distinguish answers that address the user's actual financial decision from those that merely share vocabulary.

Metric Interpretation for Model Researchers

The high multi-positive rate means recall@100 is important: a useful system should expose multiple acceptable financial answers for a downstream ranker. nDCG@10 still matters because users need the best answer early, and finance answers may differ in relevance, specificity, and caveat quality.

Dense retrieval is the strongest first-stage baseline, while hybrid retrieval is the better candidate pool for reranking. BM25 alone leaves too much semantic matching on the table.

Query and Relevance Type Tendencies

Queries are short finance questions, often about taxes, brokerage mechanics, exchange rates, credit, investment products, or personal-finance decisions. Relevant documents are longer answer passages containing explanations, caveats, or practical advice.

The relevance relation is answer usefulness. A passage should answer the specific financial question, not merely mention the same product or market term.

Representative Failure Modes

Common failures include retrieving a passage about the same financial instrument but a different tax situation, matching a keyword such as ETF or exchange without answering the user's scenario, missing jurisdictional details, and confusing general market explanation with personal-finance advice. Dense models may retrieve plausible but non-specific advice; sparse models may over-rank exact term overlap.

Training Data That May Help

Useful training data includes financial QA pairs, investor-forum answers, personal-finance FAQ retrieval data, and hard negatives from the same topic or product type. Multi-positive training is recommended because many finance questions have several acceptable explanatory answers.

Model Improvement Notes

Models should learn financial intent, not only financial vocabulary. Hard negatives should share the same product, tax term, or market concept while answering a different decision. Rerankers should account for answer specificity, jurisdictional caveats, and whether the passage directly resolves the question.

Example Data

Query	Positive document
Tax on Stocks or ETF's [22 chars]	"If you sell a stock, with no distributions, then your gain is taxable under §1001. But not all realized gains will be recognized as taxable. And some gains which are arguably not realized, will be recognized as taxable. The stock is usually a capital asset for investors, who will generate capital gains under §1(h), but dealers, traders, and hedgers will get different treatment. If you are an investor, and you held the stock for a year or more, then you can get the beneficial capital gain rates (e.g. 20% instead of 39.6%). If the asset was held short-term, less than a year, then your tax will generally be calculated at the higher ordinary income rates. There is also the problem of the net investment tax under §1411. I am eliding many exceptions, qualifications, and permutations of these rules. If you receive a §316 dividend from a stock, then that is §61 income. Qualified dividends are ordinary income but will generally be taxed at capital gains rates under §1(h)(11). Distributions in... [1,000 / 1,997 chars]
What exchange rate does El Al use when converting final payment amount to shekels? [82 chars]	"The rate for ""checks and transfers"" is set by each bank multiple times during the day based on the market. It is as opposed to the rate for ""cash/banknotes"", also set by each bank, and the ""representative rate"" (שער היציג) set by the Bank of Israel. These rates can be found on the websites of most banks. Here is Bank Hapoalim and Bank Leumi. The question is which bank's rate will be used. It might be the bank that issued your card, El Al's bank, or the credit card company (ie Poalim for Isracard or Leumi for CAL). You will need to call El Al to verify, but since these are market rates, they shouldn't be too different." [633 chars]
How much do brokerages pay exchanges per trade? [47 chars]	There is no one answer to this question, but there are some generalities. Most exchanges make a distinction between the passive and the aggressive sides of a trade. The passive participant is the order that was resting on the market at the time of the trade. It is an order that based on its price was not executable at the time, and therefore goes into the order book. For example, I'm willing to sell 100 shares of a stock at $9.98 but nobody wants to buy that right now, so it remains as an open order on the exchange. Then somebody comes along and is willing to meet my price (I am glossing over lots of details here). So they aggressively take out my order by either posting a market-buy, or specifically that they want to buy 100 shares at either $9.98, or at some higher price. Most exchanges will actually give me, as the passive (i.e. liquidity making) investor a small rebate, while the other person is charged a few fractions of a cent. Google found NYSEArca details, and most other exchan... [1,000 / 1,241 chars]

Source Reference Table

Title	Year	Type	URL
Financial Opinion Mining and Question Answering	2018	source task paper	https://doi.org/10.1145/3184558.3192301
MTEB: Massive Text Embedding Benchmark	2023	benchmark paper	https://arxiv.org/abs/2210.07316
mteb/fiqa		dataset card	https://huggingface.co/datasets/mteb/fiqa

Dataset Information

Field	Value
Nano set	NanoMTEB-v2
Backing dataset	NanoMTEB-v2
Task / split	fi_qa2018
Hugging Face dataset	hakari-bench/NanoMTEB-v2
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	534
Positives / query avg	2.67
Positives / query min	1
Positives / query median	2.00
Positives / query max	12
Multi-positive queries	128 (64.00%)
Query length avg chars	61.70
Document length avg chars	780.39

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.3799	0.6600	0.7097	top-500
Dense	`harrier_oss_v1_270m`	0.5494	0.8000	0.8258	top-500
Reranking hybrid	`reranking_hybrid`	0.5258	0.8000	0.8390	top-100

Training and Leakage Metadata

Original train split: available
Evaluation split origin: MTEB FiQA2018 test split
Train/eval overlap audit: not_audited
Leakage note: exclude NanoMTEB-v2 fi_qa2018 question-answer pairs
Multi-positive training: recommended
Useful training data: financial QA pairs, investor forum answers, personal-finance FAQ retrieval data