Accuracy vs. model size
Figure 2
We studied 40 retrieval configurations across three model sizes. The result: a 7B model with disciplined retrieval beats a 70B model with naive context — at a tenth of the cost.
Full-resolution versions and data are in the appendix.
Figure 2
Figure 4
Figure 6
Figure 9
Everything here is reproducible from the public repo.
12k question–answer pairs across five domains.
40 retrieval setups × 3 model sizes.
Blind human grading plus automated scoring.
Seeds, prompts, and data released openly.
Clone the repo, pull the data, and re-run the exact configs behind every figure.
$ git clone https://github.com/atlas-lab/rag-at-scale
$ cd rag-at-scale && uv sync
# reproduce Figure 2 (accuracy vs. model size)
$ python -m experiments.run --config configs/fig2.yaml
# → writes results/fig2.csv + figures/fig2.pdfSeeds, prompts, and the 12k-pair dataset are released under CC-BY.
Lead author
Retrieval & evaluation.
Co-author
Infrastructure & reproduction.
Co-author
Statistics & analysis.
32 pages, full results, and a link to the reproducible code and data.
Released under CC-BY. Cite freely.