IJSMT Journal

International Journal of Science, Strategic Management and Technology

An International, Peer-Reviewed, Open Access Scholarly Journal Indexed in recognized academic databases · DOI via Crossref The journal adheres to established scholarly publishing, peer-review, and research ethics guidelines set by the UGC

ISSN: 3108-1762 (Online)
webp (1)

Plagiarism Passed
Peer reviewed
Open Access

DATA DUPLICATION DETECTION AND REMOVAL SYSTEM USING MACHINE LEARNING

AUTHORS:
ANSH BALGOTRA
Mentor
Affiliation
Department of Information technology, Maharaja Agrasen Institute of Technology, New Delhi, India
CC BY 4.0 License:
This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Abstract
The problem of missing data is a critical issue in various domains, as it can lead to inaccurate analysis and flawed decision-making. Traditional methods for handling missing values have been replaced by machine learning techniques, which offer more efficient solutions. Research in this area has explored various approaches to data imputation, analyzing their strengths and limitations. A systematic literature review of studies from 2016 to 2021 identified key factors influencing the effectiveness of thesemethods, providing valuable insights for researchers and data analysts. In parallel, the rapid expansion of data storage and processing has led to challenges in managing large -scale information, particularly in deduplication. Duplicate data, originating from multiple sources, complicates storage efficiency and retrieval accuracy. Cloud service providers have adopted data deduplication techniques to optimize storage costs and bandwidth usage. However, the conflict between encryption for security and deduplication efficiency presents a challenge. To address this, hybrid chunking methods, such as the Two Threshold Two Divisor (TTTD) and Dynamic Prime Coding (DPC) algorithm, have been proposed. These techniques improve deduplication performance while balancing security requirements. Furthermore, entity resolution plays a crucial role in information integration, aiming to consolidate and organize data from diverse sources. Deduplication, as a key step in this process, enhances data quality by identifying and eliminating redundant records. Research in this domain spans machine learning, data mining, and information retrieval, focusing on both supervised and unsupervised approaches. By analyzing various methodologies, researchers can refine existing techniques to improve accuracy, processing speed, and computational efficiency. Overall, advancements in machine learning, deduplication, and entity resolution contribu te to more effective data management, addressing challenges in missing data imputation, secure deduplication, and large-scale information integration.
Keywords
Missing Data Data Quality Machine Learning Processing Speed Computational Efficiency Structured Data Unstructured Data Database Management Encryption Accuracy Performance
Article Metrics
Article Views
15
PDF Downloads
0
HOW TO CITE
APA

MLA

Chicago

Copy

BALGOTRA, A. (2026). Data Duplication Detection and Removal System using Machine Learning. International Journal of Science, Strategic Management and Technology, Volume 10(01). https://doi.org/10.55041/ijsmt.v2i2.032

BALGOTRA, ANSH. "Data Duplication Detection and Removal System using Machine Learning." International Journal of Science, Strategic Management and Technology, vol. Volume 10, no. 01, 2026, pp. . doi:https://doi.org/10.55041/ijsmt.v2i2.032.

BALGOTRA, ANSH. "Data Duplication Detection and Removal System using Machine Learning." International Journal of Science, Strategic Management and Technology Volume 10, no. 01 (2026). https://doi.org/https://doi.org/10.55041/ijsmt.v2i2.032.

References

  • Deb and A. W.-C. Liew, ‘‘Missing value imputation for the analysis of incomplete traffic accident data,’ Inf. Sci., vol. 339, pp. 274–289, 2016,

  • -F. Tsai and F.-Y. Chang, ‘‘Combining instance selection for better missing value imputation,’ J. Syst. Softw., vol. 122, pp. 63–71, Dec. 2016,

  • Dhindsa, M. Bhandari, and R. R. Sonnadara, ‘‘What’s holding up the big data revolution in healthcare?’’ BMJ, vol. 363, pp. 1–2, Dec. 2018,

  • Janssen, H. van der Voort, and A. Wahyudi, ‘‘Factors influencing big data decision-making quality,’ J. Bus. Res., vol. 70, pp. 338–345, Jan. 2017,

  • Deduplication. [cited 20 September 2023]. https://www.dremio.com/wiki/deduplication/

  • Qi X, Yang M, Ren W, Jia J, Wang J, Han G, Fan D. Find duplicates among the PubMed, EMBASE, and Cochrane Library Databases in systematic review. PLoS One. 2013 [PMC free article] [PubMed] [Google Scholar]

  • Kwon Y, Lemieux M, McTavish J, Wathen N. Identifying and removing duplicate records from systematic review searches. J Med Libr Assoc. 2015;103:184–188.[PMC free article] [PubMed] [Google Scholar]

  • Rathbone J, Carter M, Hoffmann T, Glasziou P. Better duplicate detection for systematic reviewers: evaluation of Systematic Review Assistant- Deduplication Module. Syst Rev. 2015;4:6. [PMC free article] [PubMed] [Google Scholar]

  • Bramer WM, Giustini D, de Jonge GB, Holland L, Bekhuis T. De-duplication of database search results for systematic reviews in J Med Libr Assoc. 2016;104:240–243. [PMC free article] [PubMed] [Google Scholar]

  • Otten R, de Vries R, Schoonmade L. Amsterdam Efficient Deduplication (AED) method. Zenodo. 2019 [Google Scholar]

Ethics and Compliance
✓ All ethical standards met
This article has undergone plagiarism screening and double-blind peer review. Editorial policies have been followed. Authors retain copyright under CC BY-NC 4.0 license. The research complies with ethical standards and institutional guidelines.
Indexed In
Similar Articles
Automated Lung Cancer Detection using NAS: A High-Performance Deep Learning Approach
string(15) "NETHRASHRUTHI R" R, N.
(2026)
DOI: 10.55041/ijsmt.v2i2.137
A Study on Structured Physical Activity for Enhancing Self Esteem In Children with Learning Disabilities in the Light Of NEP 2020
string(12) "Nisha Gautam" Gautam, N.
(2026)
DOI: 10.55041/ijsmt.v2i2.013
Ethics, Risk Assessment, and Standardization in Nanotechnology
string(17) "Surendra K Pandey" Pandey, S. K.
(2026)
DOI: 10.55041/ijsmt.v2i2.008
Scroll to Top