HAKARI-Bench

NanoIFIR

Overview

NanoIFIR is the compact Nano subset of IFIR, an instruction-following retrieval benchmark for expert-domain search. It covers legal retrieval, clinical decision support, finance QA, medical and nutrition retrieval, precision-medicine trial matching, and scientific-evidence retrieval. The queries are often instructions, fact patterns, or case descriptions rather than plain keyword searches.

The group is useful because a topically related document can still be wrong. A legal result must satisfy the precedent need, a clinical result must match the patient or decision context, a precision-medicine result must satisfy trial eligibility, and a scientific result must provide evidence for the claim. BM25 exposes when expert terminology is enough; dense retrieval tests semantic and instruction-following alignment; reranking_hybrid shows where exact domain anchors and semantic constraints recover complementary candidates.

What This Group Measures

IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval introduces retrieval tasks where expert-domain instructions matter. NanoIFIR samples seven task families from that setting: AILA-style legal retrieval, clinical decision support, FiQA, FIRE legal retrieval, NFCorpus, precision-medicine patient-to-trial retrieval, and SciFact.

The shared measurement target is instruction-sensitive expert retrieval. The model must not only identify the topic but also respect the requested evidence type, domain constraint, patient profile, legal issue, or scientific claim.

Task Families

Dataset Shape

NanoIFIR contains 7 task pages, 637 queries, 48,246 split-local documents, and 3,872 positive qrel rows. Every task is multi-positive in the current metadata. Precision medicine has the densest relevance set, averaging more than 20 positive clinical trials per query.

The group mixes very long expert queries with short user-style questions. NanoIFIRAila and NanoIFIRFire have legal queries averaging thousands of characters, while finance, NFCorpus, and SciFact queries are much shorter. Documents are also long in the legal tasks, especially FIRE, where judgments or case records can be tens of thousands of characters. This makes NanoIFIR a joint test of domain expertise, long-text handling, and multi-positive ranking.

Retrieval Behavior

BM25 Profile

BM25 is strongest on NanoIFIRScifact, where scientific claims and evidence abstracts often share domain terms. It is also useful on NanoIFIRPm and NanoIFIRFire, helped by biomedical or legal vocabulary and multiple positives. BM25 is weakest on NanoIFIRAila, where long legal fact patterns and judgments require more than term overlap.

Sparse retrieval can find domain-near documents, but instruction satisfaction is harder. A legal document can share many words and still be the wrong precedent; a trial can share disease terms and still be ineligible.

Dense Profile

Dense retrieval is the best profile for most NanoIFIR tasks. It improves clinical decision support, finance, NFCorpus, and precision medicine by matching the semantic constraints of the query. It is especially useful when the answer passage or trial record does not repeat the user's wording.

Dense retrieval is not always enough. Legal retrieval remains hard because long fact patterns and precedential relevance can be subtle, and SciFact shows that exact scientific terms can still be decisive.

Reranking Hybrid Profile

reranking_hybrid is strongest on NanoIFIRFire, NanoIFIRPm, and NanoIFIRScifact in the current metadata. These tasks benefit from both exact domain anchors and semantic evidence matching. In multi-positive expert retrieval, the hybrid pool can be valuable even when dense has a slightly higher nDCG@10, because candidate coverage affects reranker ceiling.

For reranker experiments, NanoIFIR should be evaluated with recall and listwise ranking in mind. Finding one relevant document is not enough when a query has many valid cases, papers, or trials.

Task Summary

TaskRetrieval focusQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoIFIRAilalegal fact pattern to prior case402,9141190.09880.08780.0798BM25
NanoIFIRCdsclinical case to biomedical evidence4210,0004660.22580.40730.3376Dense
NanoIFIRFiQAfinance question to advice passage20010,0001,0100.34220.53280.4678Dense
NanoIFIRFirelegal case summary to precedent1671,7395630.35660.34210.3996Reranking hybrid
NanoIFIRNFCorpushealth topic to medical research863,5932420.33380.45800.4108Dense
NanoIFIRPmpatient profile to clinical trial5910,0001,2170.42320.54480.5468Reranking hybrid
NanoIFIRScifactscientific claim to evidence abstract4310,0002550.86820.85160.9055Reranking hybrid

Interpretation Notes for Model Researchers

NanoIFIR should be read as an expert retrieval and instruction-following benchmark. The model has to satisfy the retrieval instruction, not merely find the same topic. This is most visible in legal and clinical tasks, where the wrong precedent or wrong trial can look lexically similar.

Because every task is multi-positive, nDCG@10 and Recall@100 are more informative than hit@10 alone. High hit@10 can hide poor ranking of the broader evidence set. Compare BM25, dense, and hybrid profiles by domain: legal, clinical, finance, and scientific retrieval have different failure modes.

Training and Leakage Notes

Useful training data includes IFIR-style instruction-query retrieval pairs, legal case retrieval, clinical decision support retrieval, FiQA-style finance QA, NFCorpus-style medical search, precision-medicine patient-to-trial matching, and SciFact claim-evidence pairs. Training objectives should preserve multiple relevant documents per query.

Exclude NanoIFIR evaluation queries, positives, qrels, legal cases, clinical trials, scientific abstracts, and direct synthetic variants. Expert-domain datasets are often reused across benchmarks, so source split and text-overlap audits are important before training.

Source Reference Table

SourceYearTypeURL
IFIR: A Comprehensive Benchmark for Evaluating Instruction-Following in Expert-Domain Information Retrieval2025paperhttps://aclanthology.org/2025.naacl-long.511/
Overview of the FIRE 2019 AILA Track: Artificial Intelligence for Legal Assistance2019paperhttps://ceur-ws.org/Vol-2517/T1-1.pdf

Metadata Summary

FieldValue
Task pages7
Queries637
Split-local documents48,246
Positive qrels3,872
Languagesen
Categoriesnatural_language
Positives / query avg6.08

Task Metadata Summary

TaskBacking datasetLangCategoryQueriesDocsPositivesBM25 nDCG@10Dense nDCG@10Reranking hybrid nDCG@10Best profile
NanoIFIRAilaNanoIFIRennatural_language402,9141190.09880.08780.0798BM25
NanoIFIRCdsNanoIFIRennatural_language4210,0004660.22580.40730.3376Dense
NanoIFIRFiQANanoIFIRennatural_language20010,0001,0100.34220.53280.4678Dense
NanoIFIRFireNanoIFIRennatural_language1671,7395630.35660.34210.3996Reranking hybrid
NanoIFIRNFCorpusNanoIFIRennatural_language863,5932420.33380.45800.4108Dense
NanoIFIRPmNanoIFIRennatural_language5910,0001,2170.42320.54480.5468Reranking hybrid
NanoIFIRScifactNanoIFIRennatural_language4310,0002550.86820.85160.9055Reranking hybrid