NanoBIRCO
Overview
NanoBIRCO is the compact Nano set for BIRCO, a benchmark focused on retrieval tasks with complex objectives. The queries are often long descriptions of a goal rather than short keyword searches: refute an argument, find matching clinical trials for a patient, retrieve scientific abstracts for a nuanced research need, recover a literary quotation from its surrounding context, or identify a book from an incomplete memory.
The group is useful because topical similarity is not enough. A retrieved text can share vocabulary with the query and still fail the objective: a clinical trial may exclude the patient, an abstract may study a nearby but wrong problem, a same-topic argument may agree rather than rebut, and a book description may match the setting but not the remembered work. BM25 shows how far lexical overlap goes on these long queries, dense retrieval tests objective-level semantic matching, and reranking_hybrid indicates whether combining both signals gives a better candidate pool.
What This Group Measures
BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives frames retrieval as objective satisfaction. NanoBIRCO keeps five of those objective-heavy task families in compact English form: argument retrieval, clinical-trial matching, DORIS-MAE scientific abstract retrieval, RELIC literary quotation retrieval, and WhatsThatBook-style book identification.
The shared difficulty is constraint handling. The query usually contains multiple clues or requirements, and the positive document must satisfy the whole retrieval goal. This makes NanoBIRCO a good group for testing whether an embedding model retrieves by intent and constraints, rather than only by topic, genre, or repeated words.
Task Families
- Counterargument retrieval:
NanoBIRCOArguAnauses long arguments as queries and expects documents that directly challenge the stance. - Clinical-trial retrieval:
NanoBIRCOClinicalTrialmaps patient cases to trial records. Inclusion and exclusion criteria matter. - Scientific abstract retrieval:
NanoBIRCODorisMaeretrieves abstracts for detailed research needs, often with many valid positives. - Literary quotation retrieval:
NanoBIRCORelicuses context around a missing quotation and requires recovering the relevant literary passage. - Book identification:
NanoBIRCOWTBretrieves books from vague reader memories of plot, characters, setting, or atmosphere.
Dataset Shape
NanoBIRCO contains 5 task pages, 408 queries, 18,789 split-local documents, and 2,909 positive qrel rows. Three tasks are single-positive in the observed metadata: ArguAna, RELIC, and WTB. ClinicalTrial and DORIS-MAE are strongly multi-positive, with average positives per query above 20.
This is one of the long-query Nano groups. Query averages range from about 497 characters in ClinicalTrial to more than 1,100 characters in ArguAna. Documents are also substantial: clinical trial records, scientific abstracts, arguments, and book descriptions often exceed paragraph scale. Long text alone is not the core challenge; the harder problem is matching all constraints in the query.
Retrieval Behavior
BM25 Profile
BM25 is strongest on NanoBIRCOArguAna and NanoBIRCODorisMae, where long queries share useful topical vocabulary with counterarguments or scientific abstracts. It is much weaker on RELIC and WTB because the important clue may be indirect, partial, or absent from the query surface. ClinicalTrial is also hard: medical terms overlap, but eligibility depends on inclusion and exclusion criteria.
The BM25 results are a reminder that long queries do not automatically make sparse retrieval easy. More words can provide more anchors, but they can also retrieve many plausible neighbors that fail the objective.
Dense Profile
Dense retrieval improves objective matching on several tasks, especially ArguAna, ClinicalTrial, and DORIS-MAE. These tasks reward semantic alignment between a complex need and a document that satisfies it. Dense retrieval is less successful on RELIC, where the missing quotation problem often depends on specific literary context, and it only modestly improves WTB because vague book memories can describe many similar works.
For model researchers, this group is useful for checking whether dense retrieval handles long, constraint-heavy queries without collapsing them into generic topical similarity.
Reranking Hybrid Profile
reranking_hybrid is the best profile for NanoBIRCOWTB and remains competitive on the other tasks. It is particularly useful when a sparse match captures a rare clue while dense retrieval captures the broader objective. In clinical, literary, and book-identification settings, that complementarity can matter more than either signal alone.
For reranker experiments, NanoBIRCO is a candidate-coverage stress test. A reranker cannot fix the task if first-stage retrieval drops the one document that satisfies the objective.
Task Summary
| Task | Retrieval focus | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoBIRCOArguAna | argument to counterargument | 98 | 3,081 | 98 | 0.4293 | 0.5062 | 0.4932 | Dense |
| NanoBIRCOClinicalTrial | patient case to trial record | 50 | 3,375 | 1,042 | 0.1322 | 0.2152 | 0.1959 | Dense |
| NanoBIRCODorisMae | research need to abstract | 60 | 5,544 | 1,569 | 0.3866 | 0.4140 | 0.4012 | Dense |
| NanoBIRCORelic | literary context to quotation | 100 | 5,023 | 100 | 0.1314 | 0.0725 | 0.1276 | BM25 |
| NanoBIRCOWTB | vague memory to book | 100 | 1,766 | 100 | 0.2669 | 0.2714 | 0.3376 | Reranking hybrid |
Interpretation Notes for Model Researchers
NanoBIRCO should be read as a constraint-following retrieval benchmark. Strong performance implies more than matching topic labels: the model must retrieve a document that satisfies the query's purpose. The most informative failures are near misses, such as same-disease trials with wrong eligibility, same-topic arguments with wrong stance, or similar books that do not match the remembered details.
The group is also useful for separating long-context encoding from objective matching. ArguAna and DORIS-MAE provide enough query text for lexical and dense signals, while RELIC and WTB show that long or descriptive queries can still be underspecified. Compare BM25, dense, and hybrid profiles before attributing a score change to model size or context length alone.
Training and Leakage Notes
Useful training data includes non-overlapping BIRCO-style objective retrieval pairs, argument-counterargument pairs, clinical-trial eligibility matching, scientific abstract recommendation, literary quotation retrieval, and book description search. ClinicalTrial and DORIS-MAE should use multi-positive training objectives where possible.
Exclude NanoBIRCO evaluation queries, positives, qrels, and direct synthetic variants. For synthetic data, generate hard negatives that share topic or vocabulary but fail a key requirement: wrong trial eligibility, wrong stance, wrong literary passage, wrong abstract contribution, or wrong book identity.
Source Reference Table
| Source | Year | Type | URL |
| BIRCO: A Benchmark of Information Retrieval Tasks with Complex Objectives | 2024 | paper | https://arxiv.org/abs/2402.14151 |
Metadata Summary
| Field | Value |
| Task pages | 5 |
| Queries | 408 |
| Split-local documents | 18,789 |
| Positive qrels | 2,909 |
| Languages | en |
| Categories | natural_language |
| Positives / query avg | 7.13 |
Task Metadata Summary
| Task | Backing dataset | Lang | Category | Queries | Docs | Positives | BM25 nDCG@10 | Dense nDCG@10 | Reranking hybrid nDCG@10 | Best profile |
| NanoBIRCOArguAna | NanoBIRCO | en | natural_language | 98 | 3,081 | 98 | 0.4293 | 0.5062 | 0.4932 | Dense |
| NanoBIRCOClinicalTrial | NanoBIRCO | en | natural_language | 50 | 3,375 | 1,042 | 0.1322 | 0.2152 | 0.1959 | Dense |
| NanoBIRCODorisMae | NanoBIRCO | en | natural_language | 60 | 5,544 | 1,569 | 0.3866 | 0.4140 | 0.4012 | Dense |
| NanoBIRCORelic | NanoBIRCO | en | natural_language | 100 | 5,023 | 100 | 0.1314 | 0.0725 | 0.1276 | BM25 |
| NanoBIRCOWTB | NanoBIRCO | en | natural_language | 100 | 1,766 | 100 | 0.2669 | 0.2714 | 0.3376 | Reranking hybrid |