NanoRTEB / NanoHC3Finance
Overview
NanoHC3Finance is an English finance-domain answer retrieval task from NanoRTEB. The query is a short personal-finance or investing prompt, and the relevant document is the paired explanatory answer from the HC3 finance subset. Each query has one positive answer among 415 documents. Dense retrieval has the strongest top-rank profile, reranking_hybrid has the best recall@100, and BM25 is weaker because many prompts are terse and do not provide enough lexical signal to identify the paired response.
Details
What the Original Data Measures
HC3 compares human expert and ChatGPT answers across several domains, including finance. The original corpus was designed for studying answer style and detection, not as a classic retrieval benchmark.
RTEB uses the finance portion as retrieval. A system receives a user finance question and must retrieve the answer or explanation paired with it. The task is therefore answer retrieval over practical finance advice, not financial statement evidence retrieval.
Observed Data Profile
The Nano split contains 200 queries, 415 documents, and 200 positive qrel rows. Every query has exactly one positive. Queries average 61.41 characters, while answer documents average 991.30 characters.
Example prompts ask whether a web scheme is legitimate, how to interpret Google Finance dividend data, how to track credit card transactions for fraud prevention, whether stock options encourage long-term investment, and whether a lender cares what borrowed money is used for.
BM25 Evaluation Profile
The BM25 candidate subset uses the full 415-document pool and reaches nDCG@10 of 0.3079, hit@10 of 0.4750, and recall@100 of 0.7800. BM25 can match distinctive finance terms such as dividends, credit cards, stock options, or loans.
The limitation is that many queries are short and broad. A prompt such as a beginner investing question can match many plausible answers, and the paired response may use different vocabulary from the user question.
Dense Evaluation Profile
The dense candidate subset from harrier_oss_v1_270m uses the full 415-document pool and reaches nDCG@10 of 0.4654, hit@10 of 0.6650, and recall@100 of 0.9150. Dense retrieval is the strongest top-rank profile.
This indicates that embedding similarity captures finance-topic intent and advice type better than term frequency. It is especially useful when the response explains a concept or risk without repeating the exact wording of the prompt.
Reranking Hybrid Evaluation Profile
The reranking_hybrid subset uses top-100 candidates, with 13 rows receiving the optional rank-101 safeguard. It reaches nDCG@10 of 0.4177, hit@10 of 0.5950, and recall@100 of 0.9350. Hybrid retrieval has the best broad coverage but weaker early ranking than dense retrieval.
This makes hybrid useful as a reranking pool. Sparse terms recover exact finance topics, while dense retrieval ranks paired responses more effectively. A downstream reranker can benefit from the extra coverage.
Metric Interpretation for Model Researchers
With one positive per query, nDCG@10 measures how early the paired answer appears, hit@10 measures whether it appears in the first ten candidates, and recall@100 measures whether a reranker can access it.
For NanoHC3Finance, dense nDCG@10 is the main first-stage signal. Recall@100 matters because many answers are topically plausible and a reranker may need to compare advice specificity.
Query and Relevance Type Tendencies
Queries are short personal-finance prompts. Relevant documents are long explanatory answers about investing, loans, fraud, dividends, taxes, or credit. The answer often expands the topic far beyond the wording of the query.
Relevance is the original question-answer pairing. A finance answer can be topically reasonable and still be non-relevant if it is not the paired response.
Representative Failure Modes
Common failures include retrieving a generic finance explanation, confusing nearby topics such as investing and retirement accounts, overmatching exact terms while missing advice intent, and ranking broad answers above the paired answer. BM25 is limited by short queries; dense retrieval can still blur similar finance-advice categories.
Training Data That May Help
Useful training data includes personal-finance QA retrieval, finance forum question-answer pairs, answer ranking, and hard negatives from nearby topics such as budgeting, investing, taxes, loans, and credit. Evaluation questions, answers, and qrels should be excluded.
Model Improvement Notes
Models should represent user intent, topic, risk framing, and answer specificity. Hard negatives should share the same finance keyword but give advice for a different context. Dense retrieval is the best first-stage ranker, while hybrid retrieval is useful for higher-recall candidate generation.
Example Data
| Query | Positive document |
| Is socialtrend.com or/and feelthetrend.com legitimate? [54 chars] | It's called a "Pyramid scheme". Its illegal in almost every country of the Western world. You're not going to earn lifetime income, of course, and these things collapse pretty quickly. Most of the "common folks" don't return the investment, its the organizers who take the money. Sometimes they run, most times they end up in jail. The way these schemes work is that they pay the early "investors" from the fees paid by new "investors". As long as a steady stream of new people keep signing up and paying into it those who got in very early make money. The idea is based on the geometric procession of each new person signing up two or more people, and those people doing the same. Pretty quickly at that rate you need to sign up every human being on the planet to keep the new money flowing in to make it work, which obviously is not realistic. Ultimately a small % of the people (if they can stay out of jail) will make a big amount of money the vast majority of "investors" get stiffed. [989 chars] |
| How to read Google Finance data on dividends [44 chars] | However, you have to remember that not all dividends are paid quarterly. For example one stock I recently purchased has a price of $8.03 and the Div/yield = 0.08/11.9 . $.08 * 4 = $0.32 which is only 3.9% (But this stock pays monthly dividends). $.08 * 12 = $0.96 which is 11.9 %. So over the course of a year assuming the stock price and the dividends didn't change you would make 11.9% [392 chars] |
| What is a good way to keep track of your credit card transactions, to reduce likelihood of fraud? [97 chars] | Read your bill, question things that don't look familiar. People who steal credit card numbers don't bother to conceal themselves well. So if you live in Florida, and all of the sudden charges appear in Idaho, you should investigate. Keeping charge slips seems counter-productive to me. I already know that I bought gasoline from the station down the street, a slip of paper whose date may or may not align with the credit card bill is not very useful. The half-life for a stolen card is hours. So you tend to see a bunch of charges appearing quickly. If someone is stealing $20 a week from you over an extended period of time, the theif is probably someone you live or work with, and paper slips won't help you there either. [725 chars] |
Source Reference Table
| Title | Year | Type | URL |
| How Close is ChatGPT to Human Experts? Comparison Corpus, Evaluation, and Detection | 2023 | task paper | https://arxiv.org/abs/2301.07597 |
| Hello-SimpleAI/HC3 | dataset card | https://huggingface.co/datasets/Hello-SimpleAI/HC3 | |
| Introducing RTEB: A New Standard for Retrieval Evaluation | 2025 | benchmark article | https://huggingface.co/blog/rteb |
Dataset Information
| Field | Value |
| Nano set | NanoRTEB |
| Backing dataset | NanoRTEB |
| Task / split | NanoHC3Finance |
| Hugging Face dataset | hakari-bench/NanoRTEB |
| Language | en |
| Category | natural_language |
| Queries | 200 |
| Documents | 415 |
| Positive qrels | 200 |
| Positives / query avg | 1.00 |
| Positives / query min | 1 |
| Positives / query median | 1.00 |
| Positives / query max | 1 |
| Multi-positive queries | 0 (0.00%) |
| Query length avg chars | 61.41 |
| Document length avg chars | 991.30 |
Candidate Subsets
| Profile | Config | nDCG@10 | Hit@10 | Recall@100 | Candidates |
| BM25 | bm25 | 0.3079 | 0.4750 | 0.7800 | top-500 |
| Dense | harrier_oss_v1_270m | 0.4654 | 0.6650 | 0.9150 | top-500 |
| Reranking hybrid | reranking_hybrid | 0.4177 | 0.5950 | 0.9350 | top-100 |