HAKARI-Bench

NanoDAPFAM

Overview

NanoDAPFAM is the compact Nano set for DAPFAM, a domain-aware patent-family retrieval benchmark. It evaluates citation-linked prior-art retrieval at the patent-family level. The group contains eighteen variants formed by three domain conditions (All, In, and Out), two query representations (title-abstract or title-abstract-claims), and three target representations (title-abstract, title-abstract-claims, or full text).

The group is useful because it separates ordinary lexical patent similarity from cross-domain prior-art retrieval. Same-domain and all-domain variants give retrievers many technical anchors: components, materials, methods, and claim phrases. OUT-domain variants remove shared IPC3 technical classes, so the model must retrieve cited families related by transferable mechanisms or problem-solution patterns rather than by the same surface vocabulary. BM25, dense retrieval, and reranking_hybrid all reveal different parts of that domain gap.

What This Group Measures

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval introduces a patent-family retrieval benchmark with citation-based relevance judgments and explicit domain partitions. The source benchmark aggregates patents at family level to reduce international duplicate publications and uses IPC3 overlap to distinguish same-domain from cross-domain retrieval.

NanoDAPFAM preserves the DAPFAM design in compact 200-query splits. Each split uses the same high-level prior-art retrieval task but changes the domain condition and patent text fields. This makes the group a controlled probe of three factors: whether positives are same-domain or cross-domain, whether the query includes claims, and whether the target is a compact summary or a very long patent text.

Task Families

Dataset Shape

NanoDAPFAM contains 18 task pages, 3,600 queries, 180,000 split-local documents, and 49,879 positive qrel rows. Each split has 200 queries and 10,000 candidate documents. All variants are multi-positive, but density changes by domain: All-domain variants average about 20 positives per query, In-domain variants about 15, and Out-domain variants about 6.

Text length is a core variable. Title-abstract queries average under 800 characters, while title-abstract-claims queries range from about 8,300 to 9,300 characters. Target documents range from short title-abstract records around 778 characters to full-text patent-family documents around 69,000 to 72,000 characters. The group should therefore be read as a matrix of domain difficulty and representation length, not as eighteen independent tasks.

Retrieval Behavior

BM25 Profile

BM25 is much stronger on All and In variants than on Out variants. In the same-domain setting, shared IPC3 areas provide repeated technical terms, components, material names, and claim language. In OUT-domain retrieval, those anchors are weaker or absent, so BM25 must rely on partial mechanism overlap.

Representation also changes sparse behavior. Claim-bearing targets often expose more components and operations than title-abstract targets. Full text gives even more vocabulary but adds large amounts of boilerplate and unrelated legal or descriptive context. A higher BM25 score on long targets does not always mean better semantic retrieval; it may mean more lexical chances.

Dense Profile

Dense retrieval is the best profile for nearly every NanoDAPFAM variant in the current metadata. It improves most clearly in OUT-domain tasks, where the positive may share an abstract mechanism with the query even when patent-field vocabulary differs. Dense retrieval also benefits title-abstract targets, where there is less text for exact matching.

Dense performance should still be interpreted carefully. Patent retrieval depends on exact technical distinctions, claim scope, and family-level invention identity. A semantically related patent can be wrong if it solves a different problem or lacks the cited technical relation.

Reranking Hybrid Profile

reranking_hybrid often falls between BM25 and dense in nDCG@10, but it is useful for candidate generation. Patent retrieval needs both exact technical anchors and broader mechanism matching. In several title-abstract and full-text variants, the hybrid pool can preserve candidates found by either signal.

For reranker experiments, OUT-domain variants are the most important stress test. If the first-stage candidate pool misses cross-domain positives, reranking cannot recover the analogy or mechanism relation.

Task Summary

TaskDomainQuery fieldsTarget fieldsPositives/queryBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoDAPFAMAllTitlAbsClmToFullTextAlltitle+abstract+claimsfull text19.950.33650.43520.4215Dense
NanoDAPFAMAllTitlAbsClmToTitlAbsAlltitle+abstract+claimstitle+abstract19.910.28640.39970.3767Dense
NanoDAPFAMAllTitlAbsClmToTitlAbsClmAlltitle+abstract+claimstitle+abstract+claims19.950.33600.41560.3989Dense
NanoDAPFAMAllTitlAbsToFullTextAlltitle+abstractfull text19.950.34890.41490.4175Reranking hybrid
NanoDAPFAMAllTitlAbsToTitlAbsAlltitle+abstracttitle+abstract19.910.32810.37860.3790Reranking hybrid
NanoDAPFAMAllTitlAbsToTitlAbsClmAlltitle+abstracttitle+abstract+claims19.950.35100.40560.4088Reranking hybrid
NanoDAPFAMInTitlAbsClmToFullTextIntitle+abstract+claimsfull text15.350.35050.44840.4375Dense
NanoDAPFAMInTitlAbsClmToTitlAbsIntitle+abstract+claimstitle+abstract15.310.29700.41350.3805Dense
NanoDAPFAMInTitlAbsClmToTitlAbsClmIntitle+abstract+claimstitle+abstract+claims15.350.34730.43250.4157Dense
NanoDAPFAMInTitlAbsToFullTextIntitle+abstractfull text15.360.34900.42550.4228Dense
NanoDAPFAMInTitlAbsToTitlAbsIntitle+abstracttitle+abstract15.330.33860.39230.3942Reranking hybrid
NanoDAPFAMInTitlAbsToTitlAbsClmIntitle+abstracttitle+abstract+claims15.360.35930.41250.4220Reranking hybrid
NanoDAPFAMOutTitlAbsClmToFullTextOuttitle+abstract+claimsfull text6.290.04610.10100.0869Dense
NanoDAPFAMOutTitlAbsClmToTitlAbsOuttitle+abstract+claimstitle+abstract6.290.04390.08720.0714Dense
NanoDAPFAMOutTitlAbsClmToTitlAbsClmOuttitle+abstract+claimstitle+abstract+claims6.290.06400.09520.0811Dense
NanoDAPFAMOutTitlAbsToFullTextOuttitle+abstractfull text6.290.06380.09520.0858Dense
NanoDAPFAMOutTitlAbsToTitlAbsOuttitle+abstracttitle+abstract6.290.05830.08720.0762Dense
NanoDAPFAMOutTitlAbsToTitlAbsClmOuttitle+abstracttitle+abstract+claims6.290.06990.09090.0901Dense

Interpretation Notes for Model Researchers

NanoDAPFAM is best interpreted by domain condition first. All and In variants measure patent-family retrieval when the target is usually in or near the same technical area. Out variants measure harder cross-domain prior-art retrieval, where models need analogy and mechanism transfer. A strong model should reduce the Out-domain gap without merely memorizing patent families.

Representation effects should be interpreted second. Claims add legal and component detail; full text adds enormous context. Better performance on full text may reflect useful mechanism evidence, but it may also reflect more opportunities for lexical overlap. Comparing title-abstract targets with claim-bearing and full-text targets helps separate concise semantic matching from long-document term coverage.

Training and Leakage Notes

Useful training data includes patent-family citation retrieval, prior-art search pairs, patent semantic similarity, cross-IPC citation prediction, patent analogy retrieval, and field-aware training over titles, abstracts, claims, and full descriptions. Hard negatives should include same-IPC patents that share terminology but are not cited, plus cross-domain patents that share surface terms without the relevant mechanism.

Exclude NanoDAPFAM evaluation family IDs, qrels, positive target families, same-family duplicate publications, and near-duplicate patent publications from other jurisdictions. Family-level aggregation is important: using another member of the same patent family can leak the invention.

Source Reference Table

SourceYearTypeURL
DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval2025paperhttps://arxiv.org/abs/2506.22141

Metadata Summary

FieldValue
Task pages18
Queries3,600
Split-local documents180,000
Positive qrels49,879
Languagesen
Categoriesnatural_language
Positives / query avg13.86

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoDAPFAMAllTitlAbsClmToFullTextNanoDAPFAMennatural_language20010,0003,9890.33650.43520.4215Dense
NanoDAPFAMAllTitlAbsClmToTitlAbsNanoDAPFAMennatural_language20010,0003,9810.28640.39970.3767Dense
NanoDAPFAMAllTitlAbsClmToTitlAbsClmNanoDAPFAMennatural_language20010,0003,9890.33600.41560.3989Dense
NanoDAPFAMAllTitlAbsToFullTextNanoDAPFAMennatural_language20010,0003,9890.34890.41490.4175Reranking hybrid
NanoDAPFAMAllTitlAbsToTitlAbsNanoDAPFAMennatural_language20010,0003,9820.32810.37860.3790Reranking hybrid
NanoDAPFAMAllTitlAbsToTitlAbsClmNanoDAPFAMennatural_language20010,0003,9890.35100.40560.4088Reranking hybrid
NanoDAPFAMInTitlAbsClmToFullTextNanoDAPFAMennatural_language20010,0003,0690.35050.44840.4375Dense
NanoDAPFAMInTitlAbsClmToTitlAbsNanoDAPFAMennatural_language20010,0003,0620.29700.41350.3805Dense
NanoDAPFAMInTitlAbsClmToTitlAbsClmNanoDAPFAMennatural_language20010,0003,0690.34730.43250.4157Dense
NanoDAPFAMInTitlAbsToFullTextNanoDAPFAMennatural_language20010,0003,0720.34900.42550.4228Dense
NanoDAPFAMInTitlAbsToTitlAbsNanoDAPFAMennatural_language20010,0003,0660.33860.39230.3942Reranking hybrid
NanoDAPFAMInTitlAbsToTitlAbsClmNanoDAPFAMennatural_language20010,0003,0720.35930.41250.4220Reranking hybrid
NanoDAPFAMOutTitlAbsClmToFullTextNanoDAPFAMennatural_language20010,0001,2590.04610.10100.0869Dense
NanoDAPFAMOutTitlAbsClmToTitlAbsNanoDAPFAMennatural_language20010,0001,2570.04390.08720.0714Dense
NanoDAPFAMOutTitlAbsClmToTitlAbsClmNanoDAPFAMennatural_language20010,0001,2590.06400.09520.0811Dense
NanoDAPFAMOutTitlAbsToFullTextNanoDAPFAMennatural_language20010,0001,2590.06380.09520.0858Dense
NanoDAPFAMOutTitlAbsToTitlAbsNanoDAPFAMennatural_language20010,0001,2570.05830.08720.0762Dense
NanoDAPFAMOutTitlAbsToTitlAbsClmNanoDAPFAMennatural_language20010,0001,2590.06990.09090.0901Dense