NanoBuiltBench
Overview
NanoBuiltBench is a compact English benchmark for built-asset information retrieval. It evaluates whether a model can align architecture, engineering, construction, and operations terminology across entity descriptions and classification-system descriptions. The retrieval target is not generic web relevance: the model must connect IFC-style building, infrastructure, product, equipment, or facility-management entities to relevant Uniclass-style product or class descriptions.
The group has two tasks: a broader retrieval split and a reranking-oriented variant. Both are terminology-heavy and multi-positive. A single asset can map to several acceptable classifications, and a correct match may depend on function, hierarchy, material, or system context rather than exact wording. BM25 measures how far controlled-vocabulary overlap goes, dense retrieval tests semantic alignment across taxonomy language, and reranking_hybrid shows whether exact terms and embedding similarity recover complementary candidates.
What This Group Measures
Benchmarking pre-trained text embedding models in aligning built asset information studies text embedding models for aligning built-asset information across classification systems. NanoBuiltBench adapts that setting into compact retrieval tasks where queries are IFC-style names and definitions and documents are Uniclass-style product or class descriptions.
The shared relevance relation is classification compatibility. A model must rank descriptions that represent the same or compatible built-asset concept. This is harder than matching words such as door, duct, valve, or sensor, because nearby classes can share vocabulary while differing in asset type, use, installation context, or level of generality.
Task Families
- Built-asset classification retrieval:
NanoBuiltBenchretrieves Uniclass-style product descriptions for IFC-style asset descriptions. - Built-asset reranking:
NanoBuiltBenchRerankinguses the same domain but emphasizes ranking candidate class descriptions for entity definitions. - Multi-positive taxonomy alignment: both tasks contain several positives per query, so ranking quality matters beyond first-hit success.
Dataset Shape
NanoBuiltBench contains 2 task pages, 282 queries, 5,659 split-local documents, and 2,054 positive qrel rows. The main retrieval task has 200 queries and 1,480 positives; the reranking variant has 82 queries and 574 positives. Both average about seven positives per query.
The texts are compact but technical. Query averages range from about 102 to 138 characters, while documents average about 309 to 342 characters. This is enough text to include definition-like cues, but not enough to hide behind long-context reasoning. The benchmark mainly tests domain vocabulary, taxonomy alignment, and ranking among close class descriptions.
Retrieval Behavior
BM25 Profile
BM25 benefits from exact built-environment terms. Asset names, product labels, and classification words often overlap between the IFC-style query and Uniclass-style documents. The broader NanoBuiltBench task is stronger under BM25 than the reranking variant, suggesting that direct vocabulary overlap helps but does not fully order the relevant class set.
Sparse retrieval is limited when the query and document use different taxonomy levels or synonyms. A term like pipe, fitting, terminal, or control can appear in many nearby classes, and BM25 may rank related but incompatible descriptions above the intended classification.
Dense Profile
Dense retrieval is the strongest profile for both tasks in the current metadata. It improves over BM25 because it can connect definitions by function and built-asset semantics, not only by repeated tokens. This is important when classification descriptions use different wording from IFC entity names.
Dense retrieval still needs domain sensitivity. Overly broad embedding similarity can group assets by general topic while missing level of detail or system role. For this group, dense gains are most meaningful when paired with hard negatives from neighboring classes.
Reranking Hybrid Profile
reranking_hybrid sits between BM25 and dense in the current task scores. That does not make it unimportant: in taxonomy alignment, exact terms and semantic definitions often recover different relevant classes. The hybrid profile is therefore useful as a candidate-pool diagnostic for downstream reranking.
If a reranker is evaluated on this group, it should be checked against both ranking quality and candidate coverage. The relevant set is multi-positive, so dropping secondary valid classifications can lower recall even when a first hit is found.
Task Summary
| Task | Retrieval focus | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoBuiltBench | IFC-style asset to Uniclass description | 200 | 2,761 | 1,480 | 0.5235 | 0.6209 | 0.5751 | Dense |
| NanoBuiltBenchReranking | entity definition to candidate class | 82 | 2,898 | 574 | 0.2681 | 0.3650 | 0.3277 | Dense |
Interpretation Notes for Model Researchers
NanoBuiltBench is a domain-alignment benchmark. Strong performance indicates that a retriever understands built-environment terminology and classification semantics, not simply that it can match English definitions. The multi-positive structure matters: a query may have several valid class descriptions, and a model should rank the relevant cluster well.
The group is also useful for testing domain adaptation. If a general embedding model performs poorly here but well on web QA, the gap likely reflects missing construction taxonomy knowledge. If BM25 is competitive, exact controlled vocabulary is still carrying the task; if dense retrieval is much stronger, the model is learning synonymy and hierarchy across classification systems.
Training and Leakage Notes
Useful training data includes non-overlapping built-asset entity-to-class pairs, IFC and Uniclass descriptions, construction taxonomy mappings, product-class alignment data, and facility-management terminology. Hard negatives should come from nearby product classes that share words but differ in function, level, or asset type.
Exclude NanoBuiltBench evaluation queries, positives, qrels, and near-duplicate classification descriptions. If source taxonomy tables are used for training, audit exact row overlap before evaluating.
Source Reference Table
| Source | Year | Type | URL |
| Benchmarking pre-trained text embedding models in aligning built asset information | 2025 | paper | https://www.nature.com/articles/s41598-025-09052-5 |
Metadata Summary
| Field | Value |
| Task pages | 2 |
| Queries | 282 |
| Split-local documents | 5,659 |
| Positive qrels | 2,054 |
| Languages | en |
| Categories | natural_language |
| Positives / query avg | 7.28 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoBuiltBench | NanoBuiltBench | en | natural_language | 200 | 2,761 | 1,480 | 0.5235 | 0.6209 | 0.5751 | Dense |
| NanoBuiltBenchReranking | NanoBuiltBench | en | natural_language | 82 | 2,898 | 574 | 0.2681 | 0.3650 | 0.3277 | Dense |