NanoDAPFAM / NanoDAPFAMOutTitlAbsToFullText

Overview

NanoDAPFAMOutTitlAbsToFullText is an English patent-family retrieval task. The query contains only the title and abstract of a source patent family, while target documents contain full patent-family text. Positives are DAPFAM OUT-domain citation relations, meaning relevant target families do not share IPC3 domain with the source.

This split tests compact-query cross-domain prior-art retrieval over very long target documents. The model must infer a cross-domain technical relationship from a short patent summary and find it inside full-text records from different fields.

Details

What the Original Data Measures

DAPFAM benchmarks family-level patent retrieval using citation links and IPC3 domain labels. OUT-domain positives are citation-related families outside the source domain. This split uses title-abstract queries and full-text targets.

It measures cross-domain retrieval with minimal query context and maximal target length. This is one of the more realistic settings for difficult prior-art discovery from a compact patent summary.

Observed Data Profile

This Nano split contains 200 queries, 10,000 documents, and 1,259 positive qrels. There are 159 multi-positive queries. Positives per query average 6.30, with a minimum of 1, median of 4.0, and maximum of 20. Queries average 786.61 characters, while target documents average 71,902.31 characters.

The short query and very long target create both semantic and length challenges. Relevant cross-domain evidence may be sparse inside the target full text.

BM25 Evaluation Profile

BM25 reaches nDCG@10 of 0.0638, hit@10 of 0.2100, and recall@100 of 0.1875 with a top-500 candidate pool. Exact title and abstract terms are not reliable enough for cross-domain search, and full-text targets introduce many incidental matches.

BM25 can help when a cross-domain target uses shared mechanism or material names, but most OUT-domain positives require more abstract matching than term frequency provides.

Dense Evaluation Profile

The dense harrier-oss-270m profile is strongest by nDCG@10 and hit@10, with nDCG@10 of 0.0952, hit@10 of 0.3350, and recall@100 of 0.2518. Dense retrieval improves over BM25 by matching technical similarity beyond shared terms.

The absolute scores remain low because the query is compact and the relevant target is cross-domain and very long. Dense retrieval must represent analogy or technical transfer, not just topical similarity.

Reranking Hybrid Evaluation Profile

The reranking_hybrid candidate subset reaches nDCG@10 of 0.0858, hit@10 of 0.3050, and recall@100 of 0.2653. It uses top-100 candidates with optional rank-101 safeguards; 74 rows contain 101 candidates and 74 safeguard-positive rows are recorded. Hybrid provides the best recall@100, while dense has better top-rank ordering.

This suggests that lexical evidence adds some additional positives but also introduces noise. A downstream reranker would need to identify the cross-domain relationship among many weakly related candidates.

Metric Interpretation for Model Researchers

This is a difficult short-query, long-target, cross-domain patent retrieval task. The gap between BM25 and dense shows that semantic retrieval is necessary, but the low absolute recall shows that current dense candidates still miss many positives.

The task is useful for testing cross-domain prior-art discovery and long-document retrieval, especially under limited query context.

Query and Relevance Type Tendencies

Queries are title-abstract summaries. Documents are full patent texts. Positives are citation-related families outside the query's IPC3 class.

The relevant relationship may involve analogous functions, mechanisms, control methods, materials, or processes that are described differently across domains.

Representative Failure Modes

BM25 retrieves long documents with incidental term overlap. Dense retrieval retrieves broadly related technology but misses citation relevance. Hybrid retrieval can recover more positives by rank 100 but may rank lexical distractors above them.

Training Data That May Help

Useful training data includes cross-domain title-abstract patent retrieval, cross-IPC patent citation pairs, and long-target prior-art search. Training should exclude NanoDAPFAM evaluation family IDs, positives, and qrels.

Synthetic data should pair compact source patent summaries with long full-text patent records from different technical classes.

Model Improvement Notes

Improving this task requires cross-domain semantic expansion from short summaries and evidence extraction from long targets. Models should learn functional analogy and technical effect matching across IPC classes.

Passage-level target representations are likely important because a full-text patent may contain the relevant cross-domain evidence only in a small section.

Example Data

Query	Positive document
bicycle handlebar grip a bicycle handlebar grip contains a plastic inner shell having a tubular shape and an outer surface; a fiber layer having an inner surface and an outer surface and includes a plurality of fibers interweaving with each other and a plurality of weaving gaps located between the fibers; a plastic layer enclosed around the outer surface of the fiber layer and combined with the fiber layer integrally and including a holding portion coated on the outer surface of the fiber layer,... [500 / 821 chars]	durable flexible membrane and method of making same a flexible membrane having a valuable combination of desirable properties is composed of a generally heavy, dense supporting and reinforcing reticulated base fabric constituted of thick, generally loose bundles of multiple continuous filaments arranged in a mechanically interengated reticular array having an overall weight within the range of about 3-12 oz/yd.sup.2 the continuous filaments being of a synthetic polymer having good dimensional stability and high resistance to heat and light; and a solidified base coating composition completely impregnating the interstices of the base fabric and as forming a continuous coating along the opposite sides thereof, the composition having as essential ingredients a pvc resin and a plasticizer therefor in the amount of 35-75% of the resin of a trimellitate ester which exhibits high resistance to separation from the resin and imparts to the coating high flexibility at temperatures at least as lo... [1,000 / 28,042 chars]
method for improving belt press dewatering a method for increasing the removal of a higher fraction of liquid from the press cake in any belt press is described. specifically, the invention incorporates a series of rollers that create multiple pinch points to compress the solid fraction while removing liquid. after each pinch point, the solid material is allowed to separate from the belt, fall by gravity, and repack so that more liquid can be released at each successive pinch point than is possi... [500 / 620 chars]	artificial human anti-factor b antibody problem to be solved: to provide novel engineered forms of a monoclonal antibody and antigen-binding fragments thereof that bind complement protein factor b and selectively inhibit the alternative complement pathway.solution: artificial human anti-factor b antibodies or antigen-binding fragments thereof are derived from murine monoclonal antibody 1379 "mab1379", which selectively binds factor b in the third short consensus repeat ("scr") domain and prevents formation of the c3bbb complex. 1. a humaneered anti-factor b antibody or antigen-binding fragment thereof that selectively binds to factor b within the third short consensus repeat (“scr”) domain and prevents formation of the c3bbb complex, wherein the antibody comprises a v κ -region polypeptide comprising the amino acid sequence of seq id no: 16 and a v h -region polypeptide comprising the amino acid sequence of seq id no: 35. 2. the humaneered anti-factor b antibody or antigen-binding frag... [1,000 / 108,109 chars]
stitch distribution control system for tufting machines a stitch distribution control system for a tufting machine for controlling placement of yarns being fed to the needles of the tufting machine by yarn feed mechanisms to form a desired pattern. a backing material is fed through the tufting machine at an increased stitch rate as the needles are shifted according to calculated pattern steps. a series of loopers or hooks engage and pick loops of yarns from the needles. the yarn feed mechanisms... [500 / 647 chars]	method and apparatus for measuring direction or position of weft yarn of fabric the measurement of the pick or stitches course position in continuously moved fabrics involves examining at least one gap-shaped segment in a top illumination or transillumination. the width of the segment is small and its legnth long in comparison to the thickness of the picks. the brightness value inside the segment is divided into two stages or areas (light, dark). those sections within the segment, in which the value is continuously associated with one stage, are determined. the number or total length of the sections of s stage is determined or, alternatively the speed at which the sections of a stage move in the segment is determined and the drafting angle of the pick is deduced from this value. 1. a process for measuring the draft angle .alpha. of a weft thread in a travelling textile sheet which comprises: (a) intercepting light transmitted or reflected from a long narrow field of the travelling text... [1,000 / 24,253 chars]

Source Reference Table

Source	Role
DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval	Source benchmark paper for family-level patent retrieval.
DAPFAM DOI record	DOI record for the DAPFAM paper.
datalyes/DAPFAM_patent	Public source dataset card.
hakari-bench/NanoDAPFAM	Nano benchmark dataset containing this split.

Dataset Information

Field	Value
Nano set	NanoDAPFAM
Backing dataset	NanoDAPFAM
Task / split	NanoDAPFAMOutTitlAbsToFullText
Hugging Face dataset	hakari-bench/NanoDAPFAM
Language	en
Category	natural_language
Queries	200
Documents	10,000
Positive qrels	1,259
Positives / query avg	6.29
Positives / query min	1
Positives / query median	4.00
Positives / query max	20
Multi-positive queries	159 (79.50%)
Query length avg chars	786.61
Document length avg chars	71,902.31

Candidate Subsets

Profile	Config	nDCG@10	Hit@10	Recall@100	Candidates
BM25	`bm25`	0.0638	0.2100	0.1875	top-500
Dense	`harrier_oss_v1_270m`	0.0952	0.3350	0.2518	top-500
Reranking hybrid	`reranking_hybrid`	0.0858	0.3050	0.2653	top-100

Training and Leakage Metadata

Original train split: not_confirmed
Evaluation split origin: DAPFAM OUT-domain title-abstract to full-text patent-family retrieval
Train/eval overlap audit: not_audited
Leakage note: exclude NanoDAPFAM evaluation family IDs, positives, and qrels
Multi-positive training: citation_family_multi_positive
Useful training data: cross-domain title-abstract patent retrieval, cross-IPC patent citation pairs, long-target prior-art search