PaperTrailResearchbeginnerAbout 30 minutes

Live preview

Technical report · 2026

Smaller models, sharper retrieval: rethinking RAG at scale.

We studied 40 retrieval configurations across three model sizes. The result: a 7B model with disciplined retrieval beats a 70B model with naive context — at a tenth of the cost.

Download the report (PDF)Read the findings

Atlas Lab · Technical Report

Fig. 2

arXiv:2026.01432

40 configs3 model sizesOpen data

Key findings

What the data showed.

10×Lower costvs. naive 70B baseline

+18%Answer accuracywith disciplined retrieval

3.4×Faster responsesmedian latency

40Configs testedfully reproducible

Figures

Selected figures from the report.

Full-resolution versions and data are in the appendix.

Accuracy vs. model size

Figure 2

Cost per correct answer

Figure 4

Latency distribution

Figure 6

Retrieval ablation

Figure 9

View the full appendix

Methodology

How we ran it.

Everything here is reproducible from the public repo.

01
Dataset
12k question–answer pairs across five domains.
02
Configurations
40 retrieval setups × 3 model sizes.
03
Evaluation
Blind human grading plus automated scoring.
04
Reproduction
Seeds, prompts, and data released openly.

Fully reproducible

Every number, regenerated from scratch.

Clone the repo, pull the data, and re-run the exact configs behind every figure.

bash

$ git clone https://github.com/atlas-lab/rag-at-scale
$ cd rag-at-scale && uv sync

# reproduce Figure 2 (accuracy vs. model size)
$ python -m experiments.run --config configs/fig2.yaml
# → writes results/fig2.csv + figures/fig2.pdf

Seeds, prompts, and the 12k-pair dataset are released under CC-BY.

Authors

Who did the work.

Dr. Mara Ellison

Lead author

Retrieval & evaluation.

Sam Okonkwo

Co-author

Infrastructure & reproduction.

Dr. Priya Anand

Co-author

Statistics & analysis.

Read the full report.

32 pages, full results, and a link to the reproducible code and data.

Download the PDF View the code

Released under CC-BY. Cite freely.