HAKARI-Bench

NanoBRIGHT

Overview

NanoBRIGHT is the compact Nano set for BRIGHT, a reasoning-intensive retrieval benchmark. It contains English retrieval tasks from math problem solving, theorem use, programming, StackExchange-style evidence retrieval, and long-document web evidence retrieval. The positive document is often useful because it supports a reasoning step, not because it paraphrases the query.

This group is useful for evaluating whether retrievers can connect a query to a mechanism, theorem, algorithm, cited source, API behavior, or supporting evidence. Many queries contain enough domain vocabulary for BM25 to find topical neighbors, but topical neighbors are often wrong. Dense retrieval tests whether embedding similarity captures the hidden reasoning relation, and reranking_hybrid is valuable when exact technical terms and semantic problem structure recover different positives.

What This Group Measures

BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval introduces retrieval tasks where standard lexical or semantic matching is not enough. NanoBRIGHT keeps that premise in smaller form. AoPS and TheoremQA tasks retrieve problems or theorem statements connected by solution skill. LeetCode and Pony tasks retrieve algorithmic or language-reference evidence. Biology, Earth Science, Economics, Psychology, Robotics, Stack Overflow, and Sustainable Living tasks retrieve cited support for complex user questions, with long variants retrieving full source pages.

The group measures reasoning-aware first-stage retrieval. A strong model should retrieve evidence that helps solve or justify the query, even when the evidence does not repeat the query wording.

Task Families

Dataset Shape

NanoBRIGHT contains 20 task pages, 2,245 queries, 121,771 split-local documents, and 9,287 positive qrel rows. All tasks are English, but their formats differ substantially. LeetCode, Robotics, and Stack Overflow queries can be long and technical; theorem queries can be compact but require abstract matching; long variants have small document pools with very large documents.

The group is multi-positive overall. Passage-style StackExchange tasks, NanoBrightPony, and several domain splits have many positives per query, while long-document variants and theorem retrieval are closer to single-positive. This makes Recall@100 important: a retriever may find a plausible supporting document but still miss much of the relevant set.

Retrieval Behavior

BM25 Profile

BM25 performs best when the query contains exact technical phrases that appear in the support document. Stack Overflow long documents, Earth Science, Biology, and Sustainable Living have visible sparse signal. The theorem-statement task is the hardest BM25 case because applied problem text rarely looks like a formal theorem statement. Pony is also hard because many programming tasks share surface vocabulary without sharing the relevant language behavior.

The key lesson is that technical vocabulary does not make retrieval easy. It can retrieve domain neighbors, but the benchmark rewards documents that support the reasoning path.

Dense Profile

Dense retrieval is helpful on many reasoning tasks, especially Biology, EarthScience, Psychology, SustainableLiving, and theorem-question retrieval. It can connect a problem to supporting evidence even when wording differs. Dense retrieval is also valuable on long-document variants where exact terms may be buried among unrelated page text.

Dense retrieval still struggles when the relevance relation is highly formal or algorithmic. Theorem statements, Pony tasks, and some programming cases require precise matching of method, API behavior, or proof concept, not just topical semantic similarity.

Reranking Hybrid Profile

reranking_hybrid is particularly informative in NanoBRIGHT. It is the best profile for tasks such as EarthScience, LeetCode, Pony, Stack Overflow, Robotics, and several long-document variants. These are cases where sparse anchors and dense problem structure recover complementary candidates.

For reranker experiments, NanoBRIGHT is a strong stress test because candidate loss is easy: if the initial pool misses the theorem, algorithm, or cited page, the reranker has no chance to recover the correct reasoning evidence.

Task Summary

TaskRetrieval shapeQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoBrightAopsmath problem to same-skill problem11110,0005240.14330.26230.2167Dense
NanoBrightBiologybiology question to cited passage10310,0003720.34250.49450.4690Dense
NanoBrightBiologyLongbiology question to full source page1034981340.37080.57790.4897Dense
NanoBrightEarthScienceearth science question to cited passage11610,0005790.46110.54060.5518Reranking hybrid
NanoBrightEarthScienceLongearth science question to full source page1165871860.35260.57860.4971Dense
NanoBrightEconomicseconomics question to cited passage10310,0008000.30290.40950.3875Dense
NanoBrightEconomicsLongeconomics question to full source page1035151090.26580.42660.3764Dense
NanoBrightLeetcodeprogramming problem to algorithmic neighbor14210,0002620.26550.30240.3048Reranking hybrid
NanoBrightPonyPony task to support passage1126,1832,2190.04960.02190.0780Reranking hybrid
NanoBrightPonyLongPony task to long reference page1125777690.22440.07670.2871Reranking hybrid
NanoBrightPsychologypsychology question to cited passage10110,0006920.24740.45910.4124Dense
NanoBrightPsychologyLongpsychology question to full source page1015091160.30100.50690.4149Dense
NanoBrightRoboticsrobotics question to cited passage10110,0005180.26070.25890.2976Reranking hybrid
NanoBrightRoboticsLongrobotics question to full source page1015051060.24900.28510.2866Reranking hybrid
NanoBrightStackoverflowdeveloper question to support passage11710,0004780.36850.40330.4686Reranking hybrid
NanoBrightStackoverflowLongdeveloper question to full source page1171,8461290.44400.38940.4744Reranking hybrid
NanoBrightSustainableLivingsustainability question to cited passage10810,0005750.41890.53380.5198Dense
NanoBrightSustainableLivingLongsustainability question to full page1085511290.32770.55010.4436Dense
NanoBrightTheoremQAQuestionsSTEM question to solved theorem-use question19410,0004390.16460.27980.2316Dense
NanoBrightTheoremQATheoremsSTEM question to theorem statement7610,0001510.01980.16530.0895Dense

Interpretation Notes for Model Researchers

NanoBRIGHT is best read as a first-stage reasoning retrieval probe. High scores suggest the model can connect queries to useful evidence, not just similar text. The long variants should be interpreted separately from passage variants: document pools are smaller, but documents contain much more irrelevant context.

Profile differences are especially important. BM25-led behavior points to exact technical anchors. Dense-led behavior points to semantic problem-structure matching. Hybrid-led behavior points to candidate complementarity, often in code and long-document tasks where exact identifiers and broader semantic cues both matter.

Training and Leakage Notes

Useful training data includes theorem-labeled solved problems, contest math problem families, algorithm-problem similarity data, programming documentation retrieval, question-to-cited-source pairs, scientific and technical QA with references, and long-document evidence retrieval. Hard negatives should share the same domain vocabulary while requiring a different theorem, algorithm, mechanism, API behavior, or cited source.

Exclude NanoBRIGHT evaluation queries, positives, qrels, and source pages. Long-document variants are especially leakage-sensitive because one full page can contain passages related to many questions.

Source Reference Table

SourceYearTypeURL
BRIGHT: A Realistic and Challenging Benchmark for Reasoning-Intensive Retrieval2024paperhttps://arxiv.org/abs/2407.12883

Metadata Summary

FieldValue
Task pages20
Queries2,245
Split-local documents121,771
Positive qrels9,287
Languagesen
Categoriesnatural_language
Positives / query avg4.14

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoBrightAopsNanoBRIGHTennatural_language11110,0005240.14330.26230.2167Dense
NanoBrightBiologyNanoBRIGHTennatural_language10310,0003720.34250.49450.4690Dense
NanoBrightBiologyLongNanoBRIGHTennatural_language1034981340.37080.57790.4897Dense
NanoBrightEarthScienceNanoBRIGHTennatural_language11610,0005790.46110.54060.5518Reranking hybrid
NanoBrightEarthScienceLongNanoBRIGHTennatural_language1165871860.35260.57860.4971Dense
NanoBrightEconomicsNanoBRIGHTennatural_language10310,0008000.30290.40950.3875Dense
NanoBrightEconomicsLongNanoBRIGHTennatural_language1035151090.26580.42660.3764Dense
NanoBrightLeetcodeNanoBRIGHTennatural_language14210,0002620.26550.30240.3048Reranking hybrid
NanoBrightPonyNanoBRIGHTennatural_language1126,1832,2190.04960.02190.0780Reranking hybrid
NanoBrightPonyLongNanoBRIGHTennatural_language1125777690.22440.07670.2871Reranking hybrid
NanoBrightPsychologyNanoBRIGHTennatural_language10110,0006920.24740.45910.4124Dense
NanoBrightPsychologyLongNanoBRIGHTennatural_language1015091160.30100.50690.4149Dense
NanoBrightRoboticsNanoBRIGHTennatural_language10110,0005180.26070.25890.2976Reranking hybrid
NanoBrightRoboticsLongNanoBRIGHTennatural_language1015051060.24900.28510.2866Reranking hybrid
NanoBrightStackoverflowNanoBRIGHTennatural_language11710,0004780.36850.40330.4686Reranking hybrid
NanoBrightStackoverflowLongNanoBRIGHTennatural_language1171,8461290.44400.38940.4744Reranking hybrid
NanoBrightSustainableLivingNanoBRIGHTennatural_language10810,0005750.41890.53380.5198Dense
NanoBrightSustainableLivingLongNanoBRIGHTennatural_language1085511290.32770.55010.4436Dense
NanoBrightTheoremQAQuestionsNanoBRIGHTennatural_language19410,0004390.16460.27980.2316Dense
NanoBrightTheoremQATheoremsNanoBRIGHTennatural_language7610,0001510.01980.16530.0895Dense