PaperTrail
Live preview
Technical report · 2026

Smaller models, sharper retrieval: rethinking RAG at scale.

We studied 40 retrieval configurations across three model sizes. The result: a 7B model with disciplined retrieval beats a 70B model with naive context — at a tenth of the cost.

40 configs3 model sizesOpen data
Key findings

What the data showed.

10×Lower costvs. naive 70B baseline
+18%Answer accuracywith disciplined retrieval
3.4×Faster responsesmedian latency
40Configs testedfully reproducible
Figures

Selected figures from the report.

Full-resolution versions and data are in the appendix.

Accuracy vs. model size

Figure 2

Cost per correct answer

Figure 4

Latency distribution

Figure 6

Retrieval ablation

Figure 9

Methodology

How we ran it.

Everything here is reproducible from the public repo.

  1. 01

    Dataset

    12k question–answer pairs across five domains.

  2. 02

    Configurations

    40 retrieval setups × 3 model sizes.

  3. 03

    Evaluation

    Blind human grading plus automated scoring.

  4. 04

    Reproduction

    Seeds, prompts, and data released openly.

Fully reproducible

Every number, regenerated from scratch.

Clone the repo, pull the data, and re-run the exact configs behind every figure.

bash
$ git clone https://github.com/atlas-lab/rag-at-scale
$ cd rag-at-scale && uv sync

# reproduce Figure 2 (accuracy vs. model size)
$ python -m experiments.run --config configs/fig2.yaml
# → writes results/fig2.csv + figures/fig2.pdf

Seeds, prompts, and the 12k-pair dataset are released under CC-BY.

Authors

Who did the work.

Dr. Mara Ellison

Lead author

Retrieval & evaluation.

Sam Okonkwo

Co-author

Infrastructure & reproduction.

Dr. Priya Anand

Co-author

Statistics & analysis.

Read the full report.

32 pages, full results, and a link to the reproducible code and data.

Released under CC-BY. Cite freely.