IJSMT Journal

International Journal of Science, Strategic Management and Technology

An International, Peer-Reviewed, Open Access Scholarly Journal Indexed in recognized academic databases · DOI via Crossref The journal adheres to established scholarly publishing, peer-review, and research ethics guidelines set by the UGC

ISSN: 3108-1762 (Online)
webp (1)

Plagiarism Passed
Peer reviewed
Open Access

A CONTROLLED STUDY ON THE IMPACT OF INPUT-ONLY VS.INPUT-OUTPUT CONTAMINATION ON BENCHMARK EVALUATION OF LARGE LANGUAGE MODELS

AUTHORS:
Vansh Deol
Sami Ahmad
Danish Iqbal
Mentor
Affiliation
Department of Information Technology Noida Institute of Engineering & Technology Greater Noida, India
CC BY 4.0 License:
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract

Benchmark datasets are widely used to evaluate Large Language Models (LLMs), but the inclusion of evaluation data in pretraining corpora raises significant contamination concerns. Such exposure can artificially inflate performance, obscuring true model capabilities. This work conducts a controlled empirical study of benchmark contamination using four open- weight models (TinyLlama, Qwen2.5-1.5B, Phi-2, and Gemma- 2B-it) evaluated on OpenBookQA. We compare two conditions via LoRA finetuning: input-only contamination (exposure to questions) and input-output contamination (exposure to questions and answers).


Beyond standard accuracy, we evaluate contamination using behavioral metrics, including prediction agreement and prediction transitions. Our results demonstrate that input-output contami- nation consistently drives stronger performance gains and larger behavioral shifts. Notably, partial benchmark exposure (input- only) substantially alters benchmark prediction behavior, whereas explicit answer exposure (input-output) primarily promotes memorization. Furthermore, contamination substantially alters prediction behavior even when aggregate benchmark accuracy changes remain small. These findings demonstrate that accuracy alone is an insufficient metric for capturing contamination-induced changes, highlighting the necessity of behavioral evaluation in LLM assessment.

Keywords
Article Metrics
Article Views
52
PDF Downloads
5
HOW TO CITE
APA

MLA

Chicago

Copy

Deol, V., Ahmad, S. & Iqbal, D. (2026). A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models. International Journal of Science, Strategic Management and Technology, 02(05). https://doi.org/10.55041/ijsmt.v2i5.310

Deol, Vansh, et al.. "A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models." International Journal of Science, Strategic Management and Technology, vol. 02, no. 05, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i5.310.

Deol, Vansh,Sami Ahmad, and Danish Iqbal. "A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models." International Journal of Science, Strategic Management and Technology 02, no. 05 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i5.310.

References
1.Mihaylov, P. Clark, T. Khot, and A. Sabharwal, “Can a suit of armor conduct electricity? a new dataset for open book question answering,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018. [Online]. Available: https://arxiv.org/abs/1809.02789

2.TinyLlama Team, “Tinyllama: An open-source small language model,” Tech. Rep., 2023, technical Report. [Online]. Available: https://arxiv.org/abs/2401.02385

3.Qwen Team, “Qwen2.5 technical report,” 2024, technical Report. [Online]. Available: https://qwenlm.github.io/blog/qwen2.5/

4.Javaheripi, S. Bubeck, M. Abdin, J. Aneja, C. C. T. Mendes,W. Chen, A. Del Giorno, R. Eldan, S. Gopi, and S. Gunasekar, “Phi-2: The surprising power of small language models,” Microsoft Research, Tech. Rep., 2023. [Online]. Available: https://www.microsoft.com/en-us/ research/blog/phi-2-the-surprising-power-of-small-language-models/

5.Gemma Team, “Gemma: Open models based on gemini research and technology,” arXiv preprint, [Online]. Available: https://arxiv.org/abs/2403.08295

6.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,Neelakantan, Shyam, G. Sastry, A. Askell, S. Agarwal,

7.Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin,S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford,

8.Sutskever, and D. Amodei, “Language models are few-shot learners,” in Advances in Neural Information Processing Systems (NeurIPS), vol. 33, 2020. [Online]. Available: https://arxiv.org/abs/2005.14165

9.OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, [Online]. Available: https://arxiv.org/abs/2303.08774

10.Radford,  J.  Wu,  R.  Child,  D.  Luan,  D.  Amodei,  and  I. Sutskever, “Language models are unsupervised multitask learners,” OpenAI, Technical Report, February 2019. [Online]. Available: https://cdn.openai.com/better-language-models/language_ models_are_unsupervised_multitask_learners.pdf

 
Ethics and Compliance
✓ All ethical standards met
This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.
Indexed In
Similar Articles
A Study on Service Quality of Supply Chain Management in Scotts Garments Ltd
string(7) "Ruban.N" Ruban.N,
(2026)
DOI: 10.55041/ijsmt.v2i3.270
Financial Inclusion: An Empirical Study of Households Availing Banking Services with Special Reference to Tumkur District
string(20) "Dr. Rakesh Nadig H S" S, D. R. N. H.
(2026)
DOI: 10.55041/ijsmt.v2i3.316
Labelground: An Offline Zero-Shot AI Platform for Efficient Dataset Annotation in Computer Vision
string(16) "Thamizh Selvan G" G, T. S.
(2026)
DOI: 10.55041/ijsmt.v2i5.011
Scroll to Top