HAKARI-Bench

NanoLongEmbed

Overview

NanoLongEmbed is the compact long-context retrieval group derived from LongEmbed. It tests whether a retriever can select the correct long document when evidence may be buried in books, scripts, meeting transcripts, Wikipedia bundles, or synthetic long contexts. The group includes both real long-document retrieval tasks and synthetic passkey or needle settings.

This group is different from ordinary passage retrieval. Documents are often tens of thousands of characters long, and NanoNarrativeQA contains whole narratives averaging more than 300,000 characters. A successful model must retain enough signal from a long source to identify the correct document. BM25 is unusually strong because long documents contain many distinctive names and events; dense retrieval tests long-context compression; reranking_hybrid shows whether semantic candidates can recover from lexical dilution.

What This Group Measures

LongEmbed: Extending Embedding Models for Long Context Retrieval introduces retrieval tasks designed for embedding long documents and long contexts. NanoLongEmbed keeps six compact splits from that benchmark: NarrativeQA, SummScreen, QMSum, 2WikiMultiHopQA, Passkey, and Needle.

The group measures document-level long-context retrieval. NarrativeQA, SummScreen, QMSum, and 2WikiMultiHopQA retrieve real or dataset-derived long sources. Passkey and Needle are synthetic probes where a small answer-bearing statement is inserted into a long document. In all cases, the target is the whole long document, not just a short passage.

Task Families

Dataset Shape

NanoLongEmbed contains 6 task pages, 998 queries, 2,788 split-local documents, and 998 positive qrel rows. Every task is single-positive in the current metadata. Candidate pools are small compared with web retrieval, but documents are extremely long.

Document length is the defining feature. NanoNarrativeQA averages more than 326,000 characters per document. QMSum, 2WikiMultiHopQA, Needle, Passkey, and SummScreen also average tens of thousands of characters. Queries range from short direct questions to long recaps and meeting requests. The challenge is not corpus scale; it is representing enough of each long document to rank the right source.

Retrieval Behavior

BM25 Profile

BM25 is the best profile for every NanoLongEmbed task in the current metadata. This is not surprising: long documents contain many rare names, events, phrases, and transcript terms that overlap with queries. NanoSummScreenFD and Nano2WikiMultihopQA are especially strong because recaps, evidence bundles, and source documents often share distinctive terms.

BM25 is still imperfect. NarrativeQA, Needle, Passkey, and QMSum can contain many distractor terms, and the key evidence may be a small part of a very long source. Sparse retrieval benefits from more words, but it can also over-rank documents that share surface terms without containing the right fact.

Dense Profile

Dense retrieval is weaker than BM25 on this group. The likely issue is long-context compression: a single embedding must represent documents that are much longer than ordinary passage inputs. Important evidence can be diluted by story, meeting, or transcript context.

Dense scores are still useful diagnostically. If a model improves on NarrativeQA, QMSum, or synthetic Needle/Passkey without losing BM25-like exact anchors, it likely has better long-document representation rather than only better passage semantics.

Reranking Hybrid Profile

reranking_hybrid usually sits between BM25 and dense. It benefits from sparse anchors while adding some semantic robustness, but it does not beat BM25 in the current metadata. The hybrid profile is still relevant for reranking because it has strong candidate coverage on single-positive long-document tasks.

For reranker experiments, the critical question is whether the correct long document is present in the candidate pool. Once it is present, downstream reranking or reading can focus on locating the evidence inside the document.

Task Summary

TaskRetrieval focusQueriesDocsBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
Nano2WikiMultihopQAmulti-hop question to Wikipedia evidence bundle2003000.95030.84000.9111BM25
NanoNarrativeQAstory question to whole narrative document2003550.76190.33150.5120BM25
NanoNeedlefact query to long document with inserted needle988000.72070.60990.6823BM25
NanoPasskeypasskey query to long context1008000.77170.64730.7294BM25
NanoQMSummeeting request to transcript2001970.74400.36600.6097BM25
NanoSummScreenFDepisode recap to transcript2003360.98130.91980.9443BM25

Interpretation Notes for Model Researchers

NanoLongEmbed should be read as a long-document representation benchmark, not as a web-scale search benchmark. Candidate pools are small, but documents are large. A model can fail because it cannot encode the whole source, because it loses a small inserted fact, or because the query only references one part of a long narrative.

The BM25 dominance is meaningful. It shows that exact rare terms remain very powerful when documents are long. Dense models should be judged by whether they can close the gap without losing those anchors. Improvements on NarrativeQA and QMSum are especially informative because those tasks require selecting a whole source from rich context.

Training and Leakage Notes

Useful training data includes long-document QA, story and screenplay question answering, meeting transcript retrieval, multi-hop Wikipedia retrieval, and synthetic passkey or needle tasks with varied evidence positions. Training should preserve full-document length or realistic truncation behavior.

Exclude NanoLongEmbed evaluation queries, positives, qrels, books, scripts, transcripts, synthetic contexts, and direct variants. Source datasets such as NarrativeQA, SummScreen, QMSum, and 2WikiMultiHopQA should be split-audited before training.

Source Reference Table

SourceYearTypeURL
LongEmbed: Extending Embedding Models for Long Context Retrieval2024paperhttps://aclanthology.org/2024.emnlp-main.47/
The NarrativeQA Reading Comprehension Challenge2018paperhttps://aclanthology.org/Q18-1023
SummScreen: A Dataset for Abstractive Screenplay Summarization2022paperhttps://aclanthology.org/2022.acl-long.589
QMSum: A New Benchmark for Query-based Multi-domain Meeting Summarization2021paperhttps://aclanthology.org/2021.naacl-main.472
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning Steps2020paperhttps://aclanthology.org/2020.coling-main.580

Metadata Summary

FieldValue
Task pages6
Queries998
Split-local documents2,788
Positive qrels998
Languagesen
Categoriesnatural_language
Positives / query avg1.00

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
Nano2WikiMultihopQANanoLongEmbedennatural_language2003002000.95030.84000.9111BM25
NanoNarrativeQANanoLongEmbedennatural_language2003552000.76190.33150.5120BM25
NanoNeedleNanoLongEmbedennatural_language98800980.72070.60990.6823BM25
NanoPasskeyNanoLongEmbedennatural_language1008001000.77170.64730.7294BM25
NanoQMSumNanoLongEmbedennatural_language2001972000.74400.36600.6097BM25
NanoSummScreenFDNanoLongEmbedennatural_language2003362000.98130.91980.9443BM25