A CONTROLLED STUDY ON THE IMPACT OF INPUT-ONLY VS.INPUT-OUTPUT CONTAMINATION ON BENCHMARK EVALUATION OF LARGE LANGUAGE MODELS
Benchmark datasets are widely used to evaluate Large Language Models (LLMs), but the inclusion of evaluation data in pretraining corpora raises significant contamination concerns. Such exposure can artificially inflate performance, obscuring true model capabilities. This work conducts a controlled empirical study of benchmark contamination using four open- weight models (TinyLlama, Qwen2.5-1.5B, Phi-2, and Gemma- 2B-it) evaluated on OpenBookQA. We compare two conditions via LoRA finetuning: input-only contamination (exposure to questions) and input-output contamination (exposure to questions and answers).
Beyond standard accuracy, we evaluate contamination using behavioral metrics, including prediction agreement and prediction transitions. Our results demonstrate that input-output contami- nation consistently drives stronger performance gains and larger behavioral shifts. Notably, partial benchmark exposure (input- only) substantially alters benchmark prediction behavior, whereas explicit answer exposure (input-output) primarily promotes memorization. Furthermore, contamination substantially alters prediction behavior even when aggregate benchmark accuracy changes remain small. These findings demonstrate that accuracy alone is an insufficient metric for capturing contamination-induced changes, highlighting the necessity of behavioral evaluation in LLM assessment.
Deol, V., Ahmad, S. & Iqbal, D. (2026). A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.310
Deol, Vansh, et al.. "A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.310.
Deol, Vansh,Sami Ahmad, and Danish Iqbal. "A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.310.
2.TinyLlama Team, “Tinyllama: An open-source small language model,” Tech. Rep., 2023, technical Report. [Online]. Available: https://arxiv.org/abs/2401.02385
3.Qwen Team, “Qwen2.5 technical report,” 2024, technical Report. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/
4.Javaheripi, S. Bubeck, M. Abdin, J. Aneja, C. C. T. Mendes,W. Chen, A. Del Giorno, R. Eldan, S. Gopi, and S. Gunasekar, “Phi-2: The surprising power of small language models,” Microsoft Research, Tech. Rep., 2023. [Online]. Available: https://www.microsoft.com/en-us/ research/blog/phi-2-the-surprising-power-of-small-language-models/
5.Gemma Team, “Gemma: Open models based on gemini research and technology,” arXiv preprint, [Online]. Available: https://arxiv.org/abs/2403.08295
6.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,Neelakantan, Shyam, G. Sastry, A. Askell, S. Agarwal,
7.Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,
8.Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165
9.OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, [Online]. Available: https://arxiv.org/abs/2303.08774
10.Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI, Technical Report, February 2019. [Online]. Available: https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf