NanoRTEB
Overview
NanoRTEB is the Nano task group for the open English portion of RTEB, the Retrieval Embedding Benchmark. It covers 14 retrieval tasks across legal, finance, code, healthcare, and technical-document settings. The group is intentionally heterogeneous: some tasks retrieve statutes or case law from long legal fact patterns, some retrieve financial filing evidence, some retrieve code or SQL from programming questions, and some retrieve medical or developer support answers.
The group contains 2,390 queries, 33,864 task-local documents, and 9,150 positive qrel rows. It evaluates retrieval-first behavior across practical domains rather than one academic task family. Models must handle long queries, short queries, long documents, code snippets, SQL strings, tables, multi-positive medical evidence, and enterprise-style technical search.
What This Group Measures
RTEB focuses on realistic retrieval evaluation across law, healthcare, code, and finance. NanoRTEB preserves that practical-domain mixture. Legal tasks retrieve case law, statutes, and legal clauses. Finance tasks retrieve filing evidence, financial answer passages, or table-backed numerical evidence. Code tasks retrieve Python solutions, data-science code, or SQL queries. Healthcare tasks retrieve patient responses or clinical evidence. FreshStack retrieves technical documentation for developer questions.
Several tasks are repurposed from generation or QA datasets, so relevance is not always ordinary passage similarity. A programming problem may retrieve an implementation; a table question may retrieve SQL; a legal summary may retrieve a clause; a clinical question may retrieve multiple biomedical passages. This is the main value of the group for retrieval-model research.
Task Families
- Legal retrieval:
NanoAILACasedocs,NanoAILAStatutes, andNanoLegalSummarization. - Finance retrieval:
NanoFinanceBench,NanoHC3Finance, andNanoFinQA. - Code and structured-query retrieval:
NanoApps,NanoDS1000,NanoHumanEval,NanoMBPP, andNanoWikiSQL. - Healthcare retrieval:
NanoChatDoctorandNanoCUREv1. - Technical-document retrieval:
NanoFreshStack.
Dataset Shape
All tasks are English. Query and document length vary substantially. AILA legal queries are long fact patterns, while HC3Finance, MBPP, FinQA, and legal summarization use much shorter queries. Code and WikiSQL tasks often have long problem statements but short target code or SQL. AILA case documents are very long; WikiSQL target documents are short SQL strings.
Positive density also varies. NanoCUREv1 dominates the qrel count with 5,163 positives and 28.37 positives per query. NanoFreshStack, AILA, and legal summarization are also multi-positive. Most code and finance splits are single-positive, so exact top-rank retrieval matters more there.
Retrieval Behavior
BM25 Profile
BM25 is best only for NanoFinQA. That task often shares company names, years, financial metrics, and table labels between the query and evidence, giving sparse retrieval strong anchors. BM25 is also reasonable for NanoCUREv1, NanoLegalSummarization, and NanoWikiSQL, where terminology or schema terms often overlap with the target.
BM25 fails badly on code-generation-style retrieval. NanoApps has 0.0084 nDCG@10, and NanoMBPP has 0.0875, because problem statements and correct implementations share little surface text. Long legal and technical-document tasks also show that finding one overlapping term is not enough; ranking the right precedent, provision, or document requires more than token frequency.
Dense Profile
Dense retrieval with harrier-oss-270m is the strongest group-level profile at 0.5764 nDCG@10. It is best for most single-answer semantic or code tasks, including NanoApps, NanoChatDoctor, NanoDS1000, NanoFinanceBench, NanoHC3Finance, NanoMBPP, and NanoWikiSQL. The gains on code and SQL are large, especially for MBPP and WikiSQL, where dense similarity can connect a natural-language task to an implementation or query.
Dense is not uniformly best. It trails BM25 on NanoFinQA, where exact financial evidence terms are highly useful, and trails hybrid on several multi-positive or evidence-heavy tasks. Still, dense retrieval is the best single profile for the group because many NanoRTEB tasks require semantic matching beyond exact lexical overlap.
Reranking Hybrid Profile
The reranking hybrid profile is best for NanoCUREv1, NanoFreshStack, NanoHumanEval, and NanoLegalSummarization, and it has the best group-level recall@100. These tasks benefit from combining exact anchors with semantic matching: biomedical terms plus clinical meaning, documentation terms plus developer intent, function identifiers plus code behavior, and legal vocabulary plus clause meaning.
Hybrid is less effective than dense on APPS, MBPP, WikiSQL, and FinanceBench, where sparse evidence can add noise to a strong semantic or structured-code signal. The group therefore supports task-aware retrieval design: dense is a strong default, hybrid is useful for evidence-rich and multi-positive tasks, and BM25 remains important for highly lexical financial evidence.
Task Summary
| Task | Family | Language | Queries | Docs | Positives | Positives/query | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoAILACasedocs | Legal precedent retrieval | en | 50 | 186 | 195 | 3.90 | 0.2805 | 0.4003 | 0.3667 | Dense |
| NanoAILAStatutes | Statute retrieval | en | 50 | 82 | 217 | 4.34 | 0.2070 | 0.2711 | 0.2564 | Dense |
| NanoApps | Code retrieval | en | 200 | 8,754 | 200 | 1.00 | 0.0084 | 0.2528 | 0.1655 | Dense |
| NanoCUREv1 | Clinical evidence retrieval | en | 182 | 10,000 | 5,163 | 28.37 | 0.5102 | 0.5479 | 0.5688 | Reranking hybrid |
| NanoChatDoctor | Medical answer retrieval | en | 200 | 5,545 | 200 | 1.00 | 0.2952 | 0.5533 | 0.4671 | Dense |
| NanoDS1000 | Data-science code retrieval | en | 200 | 997 | 200 | 1.00 | 0.4424 | 0.6835 | 0.6053 | Dense |
| NanoFinQA | Financial evidence retrieval | en | 200 | 380 | 200 | 1.00 | 0.7330 | 0.6051 | 0.7309 | BM25 |
| NanoFinanceBench | Filing evidence retrieval | en | 150 | 145 | 150 | 1.00 | 0.4267 | 0.7694 | 0.6613 | Dense |
| NanoFreshStack | Technical-document retrieval | en | 200 | 3,770 | 1,522 | 7.61 | 0.2768 | 0.3396 | 0.3482 | Reranking hybrid |
| NanoHC3Finance | Finance answer retrieval | en | 200 | 415 | 200 | 1.00 | 0.3079 | 0.4654 | 0.4177 | Dense |
| NanoHumanEval | Code retrieval | en | 158 | 158 | 158 | 1.00 | 0.3405 | 0.5666 | 0.5770 | Reranking hybrid |
| NanoLegalSummarization | Legal clause retrieval | en | 200 | 438 | 345 | 1.72 | 0.5678 | 0.5861 | 0.6085 | Reranking hybrid |
| NanoMBPP | Code retrieval | en | 200 | 972 | 200 | 1.00 | 0.0875 | 0.7599 | 0.2305 | Dense |
| NanoWikiSQL | Text-to-SQL retrieval | en | 200 | 2,022 | 200 | 1.00 | 0.4898 | 0.9507 | 0.7763 | Dense |
Interpretation Notes for Model Researchers
NanoRTEB is best interpreted by domain and target type. Code and SQL tasks strongly favor dense retrieval. Multi-positive clinical, legal, and technical documentation tasks often benefit from hybrid candidate generation. Finance evidence can remain highly lexical. A single aggregate score can hide whether a model is improving code retrieval, financial filing retrieval, legal search, or healthcare evidence retrieval.
The group is also sensitive to memorization. Several code datasets have exact solutions; legal and finance tasks have small document pools; CURE and FreshStack include many positives. Per-task inspection and leakage audits matter when using NanoRTEB for model comparison.
Training and Leakage Notes
Useful training data includes legal precedent and statute retrieval, contract clause retrieval, SEC filing evidence retrieval, table QA retrieval, problem-to-code and docstring-to-code pairs, text-to-SQL examples, clinical evidence retrieval, patient-question-to-answer ranking, and developer documentation retrieval. Multi-positive tasks should retain multiple support documents when possible.
Leakage control should exclude NanoRTEB evaluation queries, qrels, positive documents, exact code solutions, SQL targets, legal clauses, financial tables, and near-duplicate source records. For code tasks, exact solution memorization is a serious risk; for legal, finance, and healthcare tasks, passage overlap can inflate scores without improving retrieval generalization.
Source Reference Table
| Source | Year | Type | URL |
| Introducing RTEB: A New Standard for Retrieval Evaluation | 2025 | benchmark page | https://huggingface.co/blog/rteb |
| Overview of the FIRE 2019 AILA Track | 2019 | source task paper | https://ceur-ws.org/Vol-2517/T1-1.pdf |
| Plain English Summarization of Contracts | 2019 | source task paper | https://aclanthology.org/W19-2201/ |
| FinanceBench | 2023 | source task paper | https://arxiv.org/abs/2311.11944 |
Metadata Summary
| Field | Value |
| Task pages | 14 |
| Queries | 2,390 |
| Split-local documents | 33,864 |
| Positive qrels | 9,150 |
| Languages | en |
| Categories | code, natural_language |
| Positives / query avg | 3.83 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoAILACasedocs | NanoRTEB | en | natural_language | 50 | 186 | 195 | 0.2805 | 0.4003 | 0.3667 | Dense |
| NanoAILAStatutes | NanoRTEB | en | natural_language | 50 | 82 | 217 | 0.2070 | 0.2711 | 0.2564 | Dense |
| NanoApps | NanoRTEB | en | code | 200 | 8,754 | 200 | 0.0084 | 0.2528 | 0.1655 | Dense |
| NanoChatDoctor | NanoRTEB | en | natural_language | 200 | 5,545 | 200 | 0.2952 | 0.5533 | 0.4671 | Dense |
| NanoCUREv1 | NanoRTEB | en | natural_language | 182 | 10,000 | 5,163 | 0.5102 | 0.5479 | 0.5688 | Reranking hybrid |
| NanoDS1000 | NanoRTEB | en | code | 200 | 997 | 200 | 0.4424 | 0.6835 | 0.6053 | Dense |
| NanoFinanceBench | NanoRTEB | en | natural_language | 150 | 145 | 150 | 0.4267 | 0.7694 | 0.6613 | Dense |
| NanoFinQA | NanoRTEB | en | natural_language | 200 | 380 | 200 | 0.7330 | 0.6051 | 0.7309 | BM25 |
| NanoFreshStack | NanoRTEB | en | natural_language | 200 | 3,770 | 1,522 | 0.2768 | 0.3396 | 0.3482 | Reranking hybrid |
| NanoHC3Finance | NanoRTEB | en | natural_language | 200 | 415 | 200 | 0.3079 | 0.4654 | 0.4177 | Dense |
| NanoHumanEval | NanoRTEB | en | code | 158 | 158 | 158 | 0.3405 | 0.5666 | 0.5770 | Reranking hybrid |
| NanoLegalSummarization | NanoRTEB | en | natural_language | 200 | 438 | 345 | 0.5678 | 0.5861 | 0.6085 | Reranking hybrid |
| NanoMBPP | NanoRTEB | en | code | 200 | 972 | 200 | 0.0875 | 0.7599 | 0.2305 | Dense |
| NanoWikiSQL | NanoRTEB | en | code | 200 | 2,022 | 200 | 0.4898 | 0.9507 | 0.7763 | Dense |