HAKARI-Bench

Paper

HAKARI-Bench: A Lightweight Benchmark for Comparing Retrieval Architectures and Efficiency Settings under Unified Conditions

Benchmark documentation

Dataset and benchmark task group descriptions used by the leaderboard viewer.

NanoMMTEB-v2
NanoMMTEB v2 is a compact retrieval group drawn from multilingual MTEB/MMTEB retrieval tasks. It is intentionally heterogeneous: legal statute retrieval, counterargument retrieval, multilingual reading comprehension, Chinese COVID policy search, FAQ style retrieval, legal bill retrieval, long context passkey retrieval, MIRACL, MLQA, scientific related paper retrieval, spatial and temporal reasoning, StackOverflow QA, StatCan dialogue to table retrieval, TREC COVID, Danish Twitter advice retrieval, multilingual Wikipedia QA, and WinoGrande style referent retrieval all appear in one group. The group is useful as a mixed domain multilingual stress test. It does not isolate one source benchmark, one language, or one relevance relation. BM25 identifies tasks where answer text, legal terms, or web vocabulary repeat; dense retrieval identifies semantic, multilingual, and reasoning style gains; reranking hybrid highlights tasks where exact anchors and semantic candidates recover different positives.
NanoRTEB
NanoRTEB is the Nano task group for the open English portion of RTEB, the Retrieval Embedding Benchmark. It covers 14 retrieval tasks across legal, finance, code, healthcare, and technical document settings. The group is intentionally heterogeneous: some tasks retrieve statutes or case law from long legal fact patterns, some retrieve financial filing evidence, some retrieve code or SQL from programming questions, and some retrieve medical or developer support answers. The group contains 2,390 queries, 33,864 task local documents, and 9,150 positive qrel rows. It evaluates retrieval first behavior across practical domains rather than one academic task family. Models must handle long queries, short queries, long documents, code snippets, SQL strings, tables, multi positive medical evidence, and enterprise style technical search.
MNanoBEIR
MNanoBEIR is the multilingual NanoBEIR group: a grid of compact BEIR style retrieval tasks across Arabic, German, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese, Serbian, Swedish, Thai, and Vietnamese. Each language variant contains the same thirteen source tasks, so the group separates two questions that are often mixed together: whether a model understands the underlying BEIR retrieval relation, and whether that behavior survives in non English text. The source task mix is deliberately heterogeneous. Some tasks retrieve duplicate questions, some retrieve Wikipedia evidence, some retrieve biomedical or scientific documents, some retrieve debate arguments, and some retrieve answer bearing web passages. A single average score is therefore not enough to understand the group. The useful reading is by language, task family, and retrieval profile. BM25 exposes exact term and named entity dependence; dense retrieval exposes semantic transfer and paraphrase handling; reranking hybrid shows where sparse and dense candidates complement each other.
NanoBIRCO
NanoBIRCO is the compact Nano set for BIRCO, a benchmark focused on retrieval tasks with complex objectives. The queries are often long descriptions of a goal rather than short keyword searches: refute an argument, find matching clinical trials for a patient, retrieve scientific abstracts for a nuanced research need, recover a literary quotation from its surrounding context, or identify a book from an incomplete memory. The group is useful because topical similarity is not enough. A retrieved text can share vocabulary with the query and still fail the objective: a clinical trial may exclude the patient, an abstract may study a nearby but wrong problem, a same topic argument may agree rather than rebut, and a book description may match the setting but not the remembered work. BM25 shows how far lexical overlap goes on these long queries, dense retrieval tests objective level semantic matching, and reranking hybrid indicates whether combining both signals gives a better candidate pool.
NanoMLDR
NanoMLDR is the compact Nano set for MLDR, a multilingual long document retrieval benchmark. It covers 13 monolingual retrieval splits: Arabic, German, English, Spanish, French, Hindi, Italian, Japanese, Korean, Portuguese, Russian, Thai, and Chinese. Each query is a question generated from a paragraph inside a long article, while the positive document is the full article rather than the short answer bearing paragraph. The group is useful because it isolates a difficult document level retrieval problem. The query may point to one small region of a very long same language document. A successful retriever must preserve language coverage, exact entity and phrase anchors, and enough long document representation to select the whole source article. BM25 is the dominant profile for most languages in the current metadata, dense retrieval is weaker on long document compression, and reranking hybrid is useful where sparse and dense candidates recover different long documents.
NanoLongEmbed
NanoLongEmbed is the compact long context retrieval group derived from LongEmbed. It tests whether a retriever can select the correct long document when evidence may be buried in books, scripts, meeting transcripts, Wikipedia bundles, or synthetic long contexts. The group includes both real long document retrieval tasks and synthetic passkey or needle settings. This group is different from ordinary passage retrieval. Documents are often tens of thousands of characters long, and NanoNarrativeQA contains whole narratives averaging more than 300,000 characters. A successful model must retain enough signal from a long source to identify the correct document. BM25 is unusually strong because long documents contain many distinctive names and events; dense retrieval tests long context compression; reranking hybrid shows whether semantic candidates can recover from lexical dilution.
NanoDAPFAM
NanoDAPFAM is the compact Nano set for DAPFAM, a domain aware patent family retrieval benchmark. It evaluates citation linked prior art retrieval at the patent family level. The group contains eighteen variants formed by three domain conditions (All, In, and Out), two query representations (title abstract or title abstract claims), and three target representations (title abstract, title abstract claims, or full text). The group is useful because it separates ordinary lexical patent similarity from cross domain prior art retrieval. Same domain and all domain variants give retrievers many technical anchors: components, materials, methods, and claim phrases. OUT domain variants remove shared IPC3 technical classes, so the model must retrieve cited families related by transferable mechanisms or problem solution patterns rather than by the same surface vocabulary. BM25, dense retrieval, and reranking hybrid all reveal different parts of that domain gap.
NanoCoIR
NanoCoIR is the compact Nano set for CoIR, a code information retrieval benchmark. It covers ten English code oriented retrieval settings: natural language developer requests retrieving code, code retrieving text, code retrieving code, programming dialogue retrieving assistant responses, StackOverflow style QA, and Text to SQL retrieval. The group is useful because it does not reduce code retrieval to one query shape. The CoIR setting treats code retrieval as a family of format mismatches. Developer intent, program behavior, identifiers, API usage, SQL schemas, dialogue history, and code summaries can all be the relevant signal. BM25 shows where identifiers and repeated technical terms dominate, dense retrieval tests whether program semantics and developer intent align, and reranking hybrid shows whether exact code tokens and semantic similarity recover different candidate sets.
NanoIFIR
NanoIFIR is the compact Nano subset of IFIR, an instruction following retrieval benchmark for expert domain search. It covers legal retrieval, clinical decision support, finance QA, medical and nutrition retrieval, precision medicine trial matching, and scientific evidence retrieval. The queries are often instructions, fact patterns, or case descriptions rather than plain keyword searches. The group is useful because a topically related document can still be wrong. A legal result must satisfy the precedent need, a clinical result must match the patient or decision context, a precision medicine result must satisfy trial eligibility, and a scientific result must provide evidence for the claim. BM25 exposes when expert terminology is enough; dense retrieval tests semantic and instruction following alignment; reranking hybrid shows where exact domain anchors and semantic constraints recover complementary candidates.
NanoLaw
NanoLaw is a compact legal retrieval group spanning English, German, and Chinese legal data. It includes Indian precedent and statute retrieval, German legal passage and QA retrieval, Chinese criminal case retrieval, LegalBench derived consumer contract and corporate lobbying retrieval, and plain English contract summary retrieval. The group is useful because legal retrieval is not one search pattern. Some tasks map long fact scenarios to cases or statutes. Others match contract questions to clauses, bill descriptions to summaries, German questions to judgments, or Chinese criminal cases to related cases. A model can be topically close and still be wrong if it misses jurisdiction, legal role, statutory function, contract obligation, procedural posture, or case analogy. BM25, dense retrieval, and reranking hybrid expose different legal matching behaviors.
NanoMedical
NanoMedical is a multilingual medical, biomedical, and public health retrieval group. It covers Chinese medical consultation answer selection, clinician facing clinical passage retrieval, consumer medical FAQ retrieval, nutrition and health literature search, public health FAQ retrieval in Arabic, scientific claim evidence retrieval, and COVID 19 literature retrieval in English and Polish. The group is useful because it treats medical retrieval as several different evidence matching problems rather than one domain. The group contains 1,586 queries, 66,052 task local documents, and 10,438 positive qrel rows. It should be read as a retrieval benchmark, not as a clinical decision tool. Some tasks retrieve scientific abstracts, some retrieve public guidance, and some retrieve online consultation answers. Those settings have different risks, document styles, and training requirements.
NanoRARb
NanoRARb is the Nano task group for RAR b, the Reasoning as Retrieval Benchmark. It converts reasoning problems into retrieval tasks: the query is a question, story context, code prompt, math problem, spatial scene, or temporal reasoning prompt, and the relevant document is the correct answer, continuation, entity, solution, or implementation from a large answer pool. The group tests whether a retriever can rank logically correct answers, not just topically related text. The group contains 3,400 queries, 156,037 task local documents, and 3,584 positive qrel rows. Most tasks are single positive; NanoSpartQA is the main multi positive exception. Documents are often short answer strings, while some queries are very long, especially the TempReason context tasks. This makes the group a compact stress test for reasoning level semantic retrieval.
NanoBRIGHT
NanoBRIGHT is the compact Nano set for BRIGHT, a reasoning intensive retrieval benchmark. It contains English retrieval tasks from math problem solving, theorem use, programming, StackExchange style evidence retrieval, and long document web evidence retrieval. The positive document is often useful because it supports a reasoning step, not because it paraphrases the query. This group is useful for evaluating whether retrievers can connect a query to a mechanism, theorem, algorithm, cited source, API behavior, or supporting evidence. Many queries contain enough domain vocabulary for BM25 to find topical neighbors, but topical neighbors are often wrong. Dense retrieval tests whether embedding similarity captures the hidden reasoning relation, and reranking hybrid is valuable when exact technical terms and semantic problem structure recover different positives.
NanoCodeRAG
NanoCodeRAG is the compact Nano set for CodeRAG Bench, a benchmark for retrieval augmented code generation. The group evaluates whether a retriever can find programming context that would help a generation model answer a developer request. It covers four source genres: Python library documentation, online tutorials, compact programming solutions, and Stack Overflow style posts. The tasks are all English code retrieval tasks, but their retrieval shapes are very different. Documentation and tutorials are long explanatory resources, Stack Overflow posts mix problem statements, answers, and code, while programming solutions are short snippets whose semantics may not repeat the query wording. BM25 exposes source genres where exact API names or topic words dominate; dense retrieval tests whether developer intent can be connected to code or long prose; reranking hybrid shows when exact code tokens and semantic matching form a better candidate pool.
NanoChemTEB
NanoChemTEB is the compact chemistry domain retrieval group. It combines two chemistry filtered QA retrieval tasks, NanoChemHotpotQA and NanoChemNQ, with NanoChemRxiv, a chemical literature retrieval task over ChemRxiv style paragraphs. The shared challenge is not English retrieval in general, but retrieval under chemical names, compounds, reactions, materials, methods, and scientific passage style. The group contrasts Wikipedia derived QA evidence with chemistry literature retrieval. The QA tasks ask familiar question to passage retrieval questions in a chemistry heavy subset. The ChemRxiv task is closer to scientific literature search, where exact chemical terminology can be very informative but the passage may be longer and more technical. BM25, dense retrieval, and reranking hybrid are all strong on this group, so the interesting signal is which task benefits from exact chemical terms, semantic answerability, or a combined candidate pool.
NanoR2MED
NanoR2MED is the Nano task group for R2MED, a reasoning driven medical retrieval benchmark. It contains eight English retrieval tasks spanning biomedical StackExchange style reference search, diagnostic and examination evidence retrieval, treatment evidence retrieval, and clinical case retrieval. The group is deliberately difficult because many queries require an implicit medical, scientific, or clinical inference before the relevant passage can be identified. The group contains 876 queries, 80,000 task local documents, and 2,678 positive qrel rows. Every task uses a 10,000 document candidate pool, but query count and positive density vary. NanoR2MED should be treated as a research evaluation resource, not as a clinical decision system.
NanoBuiltBench
NanoBuiltBench is a compact English benchmark for built asset information retrieval. It evaluates whether a model can align architecture, engineering, construction, and operations terminology across entity descriptions and classification system descriptions. The retrieval target is not generic web relevance: the model must connect IFC style building, infrastructure, product, equipment, or facility management entities to relevant Uniclass style product or class descriptions. The group has two tasks: a broader retrieval split and a reranking oriented variant. Both are terminology heavy and multi positive. A single asset can map to several acceptable classifications, and a correct match may depend on function, hierarchy, material, or system context rather than exact wording. BM25 measures how far controlled vocabulary overlap goes, dense retrieval tests semantic alignment across taxonomy language, and reranking hybrid shows whether exact terms and embedding similarity recover complementary candidates.
NanoCMTEB
NanoCMTEB is a compact Chinese retrieval group based on C MTEB retrieval tasks. It covers medical consultation retrieval, COVID policy and news retrieval, general Chinese web passage retrieval, e commerce product retrieval, translated MS MARCO style passage ranking, T2Ranking, and entertainment video retrieval. Most queries are short Chinese search intents, while documents range from product or video titles to long web and policy passages. The group is useful because it tests Chinese retrieval across several practical domains rather than one homogeneous web search setting. Short queries, mixed scripts, product codes, translated MS MARCO artifacts, medical wording, and multi positive relevance sets all appear. BM25 measures the strength of Chinese term and phrase overlap, dense retrieval tests intent matching across terse queries and domain language, and reranking hybrid shows where sparse and dense signals recover complementary candidates.
NanoIndicQA
NanoIndicQA is a language specific Nano benchmark for IndicQA retrieval. It covers eleven Indic language splits: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Odia, Punjabi, Tamil, and Telugu. Each split turns an IndicQA reading comprehension example into retrieval: the query is a question in the target language, and the positive document is the context paragraph containing the evidence needed to answer it. The group is useful as a controlled multilingual passage selection benchmark. All languages share the same retrieval shape, so differences mainly reflect script, morphology, paragraph length, named entities, and model coverage for Indic languages. BM25 shows how far exact same language term matching goes, dense retrieval tests cross script semantic passage matching, and reranking hybrid shows whether sparse and dense candidates complement each other in small paragraph pools.
NanoMuPLeR
NanoMuPLeR is a language specific translated/parallel legal retrieval benchmark for MuPLeR retrieval. It derives from European Union legal text and covers 14 European languages: Greek, English, Spanish, Finnish, French, Italian, Lithuanian, Latvian, Dutch, Polish, Portuguese, Slovak, Slovenian, and Swedish. Each split contains same language synthetic legal queries and DGT Acquis derived parallel legal passages. The group contains 2,800 queries, 140,000 task local documents, and 2,800 positive qrel rows. Every language has exactly 200 queries, 10,000 documents, and one positive per query. This parallel construction makes NanoMuPLeR useful for comparing whether a retrieval model preserves legal search quality across languages, scripts, morphology, and translation variation.
NanoMTEB-v2
NanoMTEB v2 is the English retrieval group derived from MTEB/BEIR style retrieval tasks. It combines ten compact splits covering counterargument retrieval, claim evidence retrieval, StackExchange duplicate question retrieval, financial QA, multi hop Wikipedia QA, scientific paper relatedness, controversial question argument retrieval, and biomedical literature search. The group is useful because it is not a single English passage retrieval task: the relevant item may be a counterargument, evidence passage, duplicate question, answer passage, related paper, argument, or literature record. The group contains 1,698 queries, 98,626 task local documents, and 10,158 positive qrel rows. Most tasks are multi positive, but the number of positives varies sharply. argu ana is single positive, while treccovid has 4,584 positives for only 50 queries. Aggregate scores therefore mix exact target retrieval, many relevant document ranking, and relation types that are not simple semantic similarity.
NanoMTEB-Dutch
NanoMTEB Dutch is a compact Dutch retrieval group covering translated BEIR NL tasks, native Dutch MTEB tasks, cross lingual Belebele retrieval, legal and medical retrieval, scientific evidence retrieval, news and tender retrieval, web FAQ retrieval, and Dutch Wikipedia retrieval. Ten of the twenty seven tasks are Dutch CQADupStack duplicate question splits, but the group is broader than duplicate QA. The group should be read as both a Dutch language benchmark and a translation robustness benchmark. Some tasks are native Dutch resources, such as legal, news, tender, FAQ, or bibliography retrieval. Others carry BEIR style relevance relations into Dutch. BM25 exposes exact Dutch terms and named entities; dense retrieval tests paraphrase, translation, and cross lingual matching; reranking hybrid is useful when sparse and dense candidates recover different positives.
NanoMTEB-French
NanoMTEB French is a compact French retrieval group drawn from MTEB French and related MTEB tasks. It combines educational resource retrieval, Belgian statutory article retrieval, French Wikipedia QA passage retrieval, French Mintaka answer retrieval, Syntec collective agreement retrieval, and French English product question answering. The group tests monolingual French retrieval and cross lingual product QA in one small suite. The source tasks differ more than their French surface suggests. Alloprof and BSARD map user facing questions to long educational or legal documents. FQuAD retrieves answer bearing passages. Mintaka retrieves short entity like answers. Syntec retrieves labor agreement clauses. xPQA tests product answerability across French and English directions. BM25 exposes exact wording and named entities; dense retrieval tests French paraphrase and cross lingual matching; reranking hybrid shows where both signals are useful.
NanoMTEB-German
NanoMTEB German is a compact German retrieval group covering five tasks from the MTEB and multilingual MTEB ecosystem. It brings together legal retrieval, open domain question answering, reading comprehension context retrieval, municipal service search, and e commerce product retrieval. The group is useful because it does not describe one single German retrieval problem: it tests whether a model can move between formal legal prose, encyclopedic passages, citizen facing administrative language, and short marketplace labels. The benchmark contains 982 queries, 23,455 task local documents, and 4,959 positive qrel rows. Most tasks are single positive retrieval tasks, but xmarket de is heavily multi positive, with category queries linked to many acceptable product records. This mixture makes the group a good diagnostic for German retrieval systems that need both precise answer bearing passage retrieval and broader many relevant item ranking.
NanoJMTEB-v2
NanoJMTEB v2 is a compact Japanese retrieval group derived from JMTEB, MTEB, and related Japanese datasets. It covers Japanese casual web search, government FAQ matching, quiz to entity retrieval, answer label retrieval, MIRACL and Mr. TyDi passage retrieval, long document retrieval, and four Japanese NLP Journal paper component matching tasks. The group is useful because it is not simply Japanese passage retrieval. Some tasks retrieve short answer labels, some retrieve noisy web snippets or FAQ answers, some retrieve Wikipedia like passages or full entity pages, and some match titles or abstracts to academic paper sections. BM25 exposes Japanese term and entity anchoring, dense retrieval tests semantic passage and label matching, and reranking hybrid indicates whether sparse and dense retrieval recover complementary candidates.
NanoMTEB-Korean
NanoMTEB Korean is a compact Korean retrieval group with five tasks spanning RAG evidence retrieval, implicit reasoning evidence retrieval, legal article lookup, MIRACL style Wikipedia retrieval, and KorQuAD/SQuAD style context retrieval. It is a useful group for model researchers because Korean retrieval quality is shaped by morphology, spacing variation, domain specific terms, and the difference between literal evidence matching and semantic answerability. The group contains 914 queries, 24,493 task local documents, and 1,400 positive qrel rows. autorag, lawir ko, and squad kor v1 are single positive in the Nano splits. ko strategy qa and miracl ko are multi positive, so a model can receive credit for retrieving several acceptable evidence passages. This means the group combines exact target retrieval with broader evidence list ranking.
NanoFaMTEB-v2
NanoFaMTEB v2 is the compact Persian retrieval group for FaMTEB. It covers seventeen Persian natural language retrieval tasks, including argument retrieval, fact verification evidence, finance QA, multi hop QA, MIRACL and Natural Questions style passage retrieval, MS MARCO style search, NeuCLIR news retrieval, Persian web search, duplicate question retrieval, scientific and biomedical search, synthetic QA, chatbot RAG FAQ retrieval, WebFAQ, and Wikipedia QA. The group is useful because Persian retrieval is not treated as a single uniform problem. Some tasks have short web like queries, some have paragraph or dialogue queries, and several have many relevant documents per query. A model must handle Persian script, morphology, translated benchmark artifacts, native web material, synthetic Persian data, and domain vocabulary from finance, medicine, science, news, and Wikipedia. BM25, dense retrieval, and reranking hybrid separate lexical anchoring, semantic matching, and candidate complementarity across this mix.
NanoMTEB-Polish
NanoMTEB Polish is a Polish retrieval group dominated by translated community question duplicate retrieval. Ten of its fourteen tasks are Polish CQADupStack domains, covering Android, English usage, GIS, Mathematica, Physics, Programmers, Statistics, TeX, Webmasters, and WordPress. The remaining tasks cover Polish financial QA retrieval, Natural Questions style fact retrieval, PUGG Polish Wikipedia QA retrieval, and Quora duplicate question retrieval. The group contains 2,800 queries, 140,000 task local documents, and 8,151 positive qrel rows. Every task has 200 queries and a 10,000 document candidate pool, so per task differences are easy to compare. The group is useful because it asks whether a model can retrieve Polish paraphrases and duplicates across technical domains while also handling finance, Wikipedia QA, and short duplicate question retrieval.
NanoRuMTEB
NanoRuMTEB is a compact Russian retrieval group based on ruMTEB retrieval tasks. It contains three Russian language subtasks: MIRACL style Wikipedia passage retrieval, RIA news headline to article retrieval, and RuBQ question to Wikipedia paragraph retrieval. The group is small, but it covers two important Russian retrieval modes: short question answering over Wikipedia and headline/article matching in news. The group contains 600 queries, 30,000 task local documents, and 1,113 positive qrel rows. All tasks are Russian, and all candidate pools contain 10,000 documents. The group is useful for checking whether a retriever handles Russian morphology, named entities, inflection, and native language query phrasing rather than only English or translated retrieval.
NanoMTEB-Scandinavian
NanoMTEB Scandinavian is a compact retrieval group for Danish, Norwegian, and Swedish tasks from the Scandinavian Embedding Benchmark ecosystem. It covers fact verification, extractive QA answer selection, encyclopedia article lookup, FAQ retrieval, news retrieval, and informal social question answering. The group is small in task count, but it is not a single domain benchmark: it moves from highly lexical title or evidence retrieval to conversational answer retrieval where lexical overlap is much weaker. The group contains 1,273 queries, 9,737 task local documents, and 1,753 positive qrel rows. Most tasks are monolingual within one Scandinavian language, while the group as a whole is multilingual because it spans Danish, Norwegian, and Swedish. Its value is that it tests whether a model can handle closely related North Germanic languages while preserving different source task relevance relations.
NanoMTEB-Spanish
NanoMTEB Spanish is a compact Spanish and Spanish English retrieval group. It covers complex entity answer QA, Spanish Wikipedia passage retrieval, Spanish consumer health passage and document retrieval, and product question answering in Spanish English, English Spanish, and Spanish Spanish directions. The group is useful because the target is not always a Spanish paragraph with obvious word overlap: some positives are short entity answers, health passages, or compact product snippets. The group contains 1,334 queries, 25,262 task local documents, and 4,806 positive qrel rows. It is multi positive overall, with MIRACL, Spanish Passage Retrieval, and xPQA contributing multiple relevant documents or snippets per query. This makes the group a good diagnostic for Spanish retrieval systems that need to combine semantic answerability, domain evidence, and cross lingual product matching.
NanoMTEB-Thai
NanoMTEB Thai is a compact Thai and Thai English retrieval group aligned with MTEB style task families. It includes Belebele reading comprehension retrieval in cross lingual and monolingual directions, Thai MIRACL and Mr. TyDi Wikipedia retrieval, Thai MKQA answer label retrieval, Thai long document retrieval, WebFAQ question answer retrieval, and Thai XQuAD context retrieval. The group tests Thai retrieval across script handling, word segmentation, answer granularity, cross lingual alignment, and document length. The group contains 1,800 queries, 48,356 task local documents, and 2,077 positive qrel rows. Most tasks are single positive or near single positive, but MIRACL, MKQA, and Mr. TyDi include multiple positives for some queries. It is a useful diagnostic because Thai retrieval quality changes sharply depending on whether the target is a passage, a short answer label, a long document, or a document in another language.
NanoVNMTEB
NanoVNMTEB is the Nano task group for VN MTEB retrieval. It contains Vietnamese retrieval versions of widely used MTEB and BEIR style tasks, including duplicate question retrieval, fact checking evidence retrieval, web search, natural question answering, finance QA, argument retrieval, biomedical retrieval, and scientific paper retrieval. The group evaluates Vietnamese retrieval quality and robustness to translated benchmark artifacts. The group contains 4,768 queries, 247,475 task local documents, and 24,671 positive qrel rows across 26 tasks. Most tasks are Vietnamese, while nfcorpus vn is marked multilingual because biomedical terminology and translation artifacts cross language boundaries. The group is large enough that aggregate scores can hide very different retrieval relations.
NanoMTEB-Misc
NanoMTEB Misc is a mixed multilingual retrieval group for NanoMTEB family tasks that do not belong cleanly to one language specific benchmark set. It combines NeuCLIR 2022 Persian, Russian, and Chinese news retrieval; RuSciBench Russian scientific citation and co citation retrieval; EuroPIRQ English, Finnish, and Portuguese legal passage retrieval; and German French CLSD translation pair retrieval from WMT19 and WMT21. The group contains 1,636 queries, 99,624 task local documents, and 7,538 positive qrel rows. It should be read as a stress test for mixed relevance definitions rather than as one coherent domain benchmark. Some tasks have broad multi positive news or citation relevance, while others are single positive legal question retrieval or cross lingual translation equivalence retrieval.
NanoMIRACL
NanoMIRACL is a language specific Nano benchmark for MIRACL, a multilingual ad hoc retrieval benchmark built around Wikipedia passage retrieval. The original MIRACL work covers eighteen languages and asks a monolingual retrieval question in each split: an Arabic query retrieves Arabic passages, a Japanese query retrieves Japanese passages, and so on. This group keeps that retrieval setting while making the task small enough to inspect one language at a time. The group is valuable because it holds the high level task constant while changing script, morphology, tokenization behavior, resource level, and Wikipedia coverage. The model is not translating and is not answering from a fixed article. It must rank the passage that contains the answer bearing evidence for a short natural language question. In the current Nano metadata, BM25 is often a strong lexical anchor, dense retrieval from harrier oss v1 270m is usually the best top rank signal, and reranking hybrid gives the broadest top 100 candidate coverage.