A Controlled Study on the Impact of Input-Only Vs.Input-Output Contamination on Benchmark Evaluation of Large Language Models

Deol, Vansh; Ahmad, Sami; Iqbal, Danish

doi:https://doi.org/10.55041/ijsmt.v2i5.310

Plagiarism Passed

Peer reviewed

Open Access

A CONTROLLED STUDY ON THE IMPACT OF INPUT-ONLY VS.INPUT-OUTPUT CONTAMINATION ON BENCHMARK EVALUATION OF LARGE LANGUAGE MODELS

AUTHORS:

Vansh Deol

Sami Ahmad

Danish Iqbal

Mentor

Affiliation

Department of Information Technology Noida Institute of Engineering & Technology Greater Noida, India

DOI: 10.55041/ijsmt.v2i5.310

CC BY 4.0 License:

This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

DOWNLOAD ARTICLE

REVIEW REPORT

CITE THIS ARTICLE

Abstract

Benchmark datasets are widely used to evaluate Large Language Models (LLMs), but the inclusion of evaluation data in pretraining corpora raises significant contamination concerns. Such exposure can artificially inflate performance, obscuring true model capabilities. This work conducts a controlled empirical study of benchmark contamination using four open- weight models (TinyLlama, Qwen2.5-1.5B, Phi-2, and Gemma- 2B-it) evaluated on OpenBookQA. We compare two conditions via LoRA finetuning: input-only contamination (exposure to questions) and input-output contamination (exposure to questions and answers).

Beyond standard accuracy, we evaluate contamination using behavioral metrics, including prediction agreement and prediction transitions. Our results demonstrate that input-output contami- nation consistently drives stronger performance gains and larger behavioral shifts. Notably, partial benchmark exposure (input- only) substantially alters benchmark prediction behavior, whereas explicit answer exposure (input-output) primarily promotes memorization. Furthermore, contamination substantially alters prediction behavior even when aggregate benchmark accuracy changes remain small. These findings demonstrate that accuracy alone is an insufficient metric for capturing contamination-induced changes, highlighting the necessity of behavioral evaluation in LLM assessment.

Keywords

Article Information

Article Metrics

Article Views

126

PDF Downloads

HOW TO CITE

References

Ethics and Compliance

✓ All ethical standards met

This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.

Indexed In

International Journal of Science, Strategic Management and Technology

ISSN: 3108-1762 (Online)

A CONTROLLED STUDY ON THE IMPACT OF INPUT-ONLY VS.INPUT-OUTPUT CONTAMINATION ON BENCHMARK EVALUATION OF LARGE LANGUAGE MODELS

About Journal

Policies & Ethics

Indexing Platforms

Contact Us

A CONTROLLED STUDY ON THE IMPACT OF INPUT-ONLY VS.INPUT-OUTPUT CONTAMINATION ON BENCHMARK EVALUATION OF LARGE LANGUAGE MODELS

About Journal

Contact Us

Share on