HAKARI-Bench

NanoRTEB

Overview

NanoRTEB is the Nano task group for the open English portion of RTEB, the Retrieval Embedding Benchmark. It covers 14 retrieval tasks across legal, finance, code, healthcare, and technical-document settings. The group is intentionally heterogeneous: some tasks retrieve statutes or case law from long legal fact patterns, some retrieve financial filing evidence, some retrieve code or SQL from programming questions, and some retrieve medical or developer support answers.

The group contains 2,390 queries, 33,864 task-local documents, and 9,150 positive qrel rows. It evaluates retrieval-first behavior across practical domains rather than one academic task family. Models must handle long queries, short queries, long documents, code snippets, SQL strings, tables, multi-positive medical evidence, and enterprise-style technical search.

What This Group Measures

RTEB focuses on realistic retrieval evaluation across law, healthcare, code, and finance. NanoRTEB preserves that practical-domain mixture. Legal tasks retrieve case law, statutes, and legal clauses. Finance tasks retrieve filing evidence, financial answer passages, or table-backed numerical evidence. Code tasks retrieve Python solutions, data-science code, or SQL queries. Healthcare tasks retrieve patient responses or clinical evidence. FreshStack retrieves technical documentation for developer questions.

Several tasks are repurposed from generation or QA datasets, so relevance is not always ordinary passage similarity. A programming problem may retrieve an implementation; a table question may retrieve SQL; a legal summary may retrieve a clause; a clinical question may retrieve multiple biomedical passages. This is the main value of the group for retrieval-model research.

Task Families

Dataset Shape

All tasks are English. Query and document length vary substantially. AILA legal queries are long fact patterns, while HC3Finance, MBPP, FinQA, and legal summarization use much shorter queries. Code and WikiSQL tasks often have long problem statements but short target code or SQL. AILA case documents are very long; WikiSQL target documents are short SQL strings.

Positive density also varies. NanoCUREv1 dominates the qrel count with 5,163 positives and 28.37 positives per query. NanoFreshStack, AILA, and legal summarization are also multi-positive. Most code and finance splits are single-positive, so exact top-rank retrieval matters more there.

Retrieval Behavior

BM25 Profile

BM25 is best only for NanoFinQA. That task often shares company names, years, financial metrics, and table labels between the query and evidence, giving sparse retrieval strong anchors. BM25 is also reasonable for NanoCUREv1, NanoLegalSummarization, and NanoWikiSQL, where terminology or schema terms often overlap with the target.

BM25 fails badly on code-generation-style retrieval. NanoApps has 0.0084 nDCG@10, and NanoMBPP has 0.0875, because problem statements and correct implementations share little surface text. Long legal and technical-document tasks also show that finding one overlapping term is not enough; ranking the right precedent, provision, or document requires more than token frequency.

Dense Profile

Dense retrieval with harrier-oss-270m is the strongest group-level profile at 0.5764 nDCG@10. It is best for most single-answer semantic or code tasks, including NanoApps, NanoChatDoctor, NanoDS1000, NanoFinanceBench, NanoHC3Finance, NanoMBPP, and NanoWikiSQL. The gains on code and SQL are large, especially for MBPP and WikiSQL, where dense similarity can connect a natural-language task to an implementation or query.

Dense is not uniformly best. It trails BM25 on NanoFinQA, where exact financial evidence terms are highly useful, and trails hybrid on several multi-positive or evidence-heavy tasks. Still, dense retrieval is the best single profile for the group because many NanoRTEB tasks require semantic matching beyond exact lexical overlap.

Reranking Hybrid Profile

The reranking hybrid profile is best for NanoCUREv1, NanoFreshStack, NanoHumanEval, and NanoLegalSummarization, and it has the best group-level recall@100. These tasks benefit from combining exact anchors with semantic matching: biomedical terms plus clinical meaning, documentation terms plus developer intent, function identifiers plus code behavior, and legal vocabulary plus clause meaning.

Hybrid is less effective than dense on APPS, MBPP, WikiSQL, and FinanceBench, where sparse evidence can add noise to a strong semantic or structured-code signal. The group therefore supports task-aware retrieval design: dense is a strong default, hybrid is useful for evidence-rich and multi-positive tasks, and BM25 remains important for highly lexical financial evidence.

Task Summary

TaskFamilyLanguageQueriesDocsPositivesPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAILACasedocsLegal precedent retrievalen501861953.900.28050.40030.3667Dense
NanoAILAStatutesStatute retrievalen50822174.340.20700.27110.2564Dense
NanoAppsCode retrievalen2008,7542001.000.00840.25280.1655Dense
NanoCUREv1Clinical evidence retrievalen18210,0005,16328.370.51020.54790.5688Reranking hybrid
NanoChatDoctorMedical answer retrievalen2005,5452001.000.29520.55330.4671Dense
NanoDS1000Data-science code retrievalen2009972001.000.44240.68350.6053Dense
NanoFinQAFinancial evidence retrievalen2003802001.000.73300.60510.7309BM25
NanoFinanceBenchFiling evidence retrievalen1501451501.000.42670.76940.6613Dense
NanoFreshStackTechnical-document retrievalen2003,7701,5227.610.27680.33960.3482Reranking hybrid
NanoHC3FinanceFinance answer retrievalen2004152001.000.30790.46540.4177Dense
NanoHumanEvalCode retrievalen1581581581.000.34050.56660.5770Reranking hybrid
NanoLegalSummarizationLegal clause retrievalen2004383451.720.56780.58610.6085Reranking hybrid
NanoMBPPCode retrievalen2009722001.000.08750.75990.2305Dense
NanoWikiSQLText-to-SQL retrievalen2002,0222001.000.48980.95070.7763Dense

Interpretation Notes for Model Researchers

NanoRTEB is best interpreted by domain and target type. Code and SQL tasks strongly favor dense retrieval. Multi-positive clinical, legal, and technical documentation tasks often benefit from hybrid candidate generation. Finance evidence can remain highly lexical. A single aggregate score can hide whether a model is improving code retrieval, financial filing retrieval, legal search, or healthcare evidence retrieval.

The group is also sensitive to memorization. Several code datasets have exact solutions; legal and finance tasks have small document pools; CURE and FreshStack include many positives. Per-task inspection and leakage audits matter when using NanoRTEB for model comparison.

Training and Leakage Notes

Useful training data includes legal precedent and statute retrieval, contract clause retrieval, SEC filing evidence retrieval, table QA retrieval, problem-to-code and docstring-to-code pairs, text-to-SQL examples, clinical evidence retrieval, patient-question-to-answer ranking, and developer documentation retrieval. Multi-positive tasks should retain multiple support documents when possible.

Leakage control should exclude NanoRTEB evaluation queries, qrels, positive documents, exact code solutions, SQL targets, legal clauses, financial tables, and near-duplicate source records. For code tasks, exact solution memorization is a serious risk; for legal, finance, and healthcare tasks, passage overlap can inflate scores without improving retrieval generalization.

Source Reference Table

SourceYearTypeURL
Introducing RTEB: A New Standard for Retrieval Evaluation2025benchmark pagehttps://huggingface.co/blog/rteb
Overview of the FIRE 2019 AILA Track2019source task paperhttps://ceur-ws.org/Vol-2517/T1-1.pdf
Plain English Summarization of Contracts2019source task paperhttps://aclanthology.org/W19-2201/
FinanceBench2023source task paperhttps://arxiv.org/abs/2311.11944

Metadata Summary

FieldValue
Task pages14
Queries2,390
Split-local documents33,864
Positive qrels9,150
Languagesen
Categoriescode, natural_language
Positives / query avg3.83

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoAILACasedocsNanoRTEBennatural_language501861950.28050.40030.3667Dense
NanoAILAStatutesNanoRTEBennatural_language50822170.20700.27110.2564Dense
NanoAppsNanoRTEBencode2008,7542000.00840.25280.1655Dense
NanoChatDoctorNanoRTEBennatural_language2005,5452000.29520.55330.4671Dense
NanoCUREv1NanoRTEBennatural_language18210,0005,1630.51020.54790.5688Reranking hybrid
NanoDS1000NanoRTEBencode2009972000.44240.68350.6053Dense
NanoFinanceBenchNanoRTEBennatural_language1501451500.42670.76940.6613Dense
NanoFinQANanoRTEBennatural_language2003802000.73300.60510.7309BM25
NanoFreshStackNanoRTEBennatural_language2003,7701,5220.27680.33960.3482Reranking hybrid
NanoHC3FinanceNanoRTEBennatural_language2004152000.30790.46540.4177Dense
NanoHumanEvalNanoRTEBencode1581581580.34050.56660.5770Reranking hybrid
NanoLegalSummarizationNanoRTEBennatural_language2004383450.56780.58610.6085Reranking hybrid
NanoMBPPNanoRTEBencode2009722000.08750.75990.2305Dense
NanoWikiSQLNanoRTEBencode2002,0222000.48980.95070.7763Dense