CFP last date
22 June 2026
Reseach Article

Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models

by Saiteja Jonnalagadda
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 103
Year of Publication: 2026
Authors: Saiteja Jonnalagadda
10.5120/ijcaea79a3291b37

Saiteja Jonnalagadda . Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models. International Journal of Computer Applications. 187, 103 ( May 2026), 32-38. DOI=10.5120/ijcaea79a3291b37

@article{ 10.5120/ijcaea79a3291b37,
author = { Saiteja Jonnalagadda },
title = { Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models },
journal = { International Journal of Computer Applications },
issue_date = { May 2026 },
volume = { 187 },
number = { 103 },
month = { May },
year = { 2026 },
issn = { 0975-8887 },
pages = { 32-38 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number103/generative-ai-for-synthetic-patient-data-generation-to-enhance-identity-matching-and-deduplication-models/ },
doi = { 10.5120/ijcaea79a3291b37 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-05-17T02:29:11.149730+05:30
%A Saiteja Jonnalagadda
%T Generative AI for Synthetic Patient Data Generation to Enhance Identity Matching and Deduplication Models
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 103
%P 32-38
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

The paper examines how the concept of Generative Artificial Intelligence can be utilized to tackle the important problem of patient identity matching and deduplication in healthcare informatics, through the use of Generative Adversarial Networks and Variational Autoencoders. The privacy limitations and the fragmentation of data tend to complicate the creation of the effective record linkage algorithms. To circumvent this limitation, the study employs a synthetic data generation framework that generates patient records of high fidelity that are reflective of the statistical characteristics of real-world clinical datasets. The experiment uses the Synthea simulator of patient population and Python-based GAN libraries to generate a specialized data sample of 389 data cases. Such cases include demographic factors, longitudinal medical records, and deliberate clerical mistakes like phonetic misspellings and reversed numbers. The effectiveness is assessed by training deduplication models on this artificially augmented data as a measure of the enhancement of accuracy and recall of similar entries in different systems. The software products are TensorFlow to construct the model architecture, RecordLinkage toolkits to match and Pandas to manipulate data. Findings show that the generative models can represent the peculiarities of human error and increase the sensitivity of the deduplication models by a significant margin, without violating patient privacy. This study shows that in contemporary electronic health record settings, synthetic data is an effective tool for optimizing identity resolution mechanisms.

References
  1. I. Goodfellow, J. Pouget-Abadie, M. Mirza, et al., “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020. https://doi.org/10.1145/3422622
  2. J. Jordon, L. Szpruch, F. Houssiau, et al., “Synthetic Data–what, why and how?,” arXiv preprint, 2022. https://doi.org/10.48550/arXiv.2205.03257
  3. M. Frid-Adar, I. Diamant, E. Klang, et al., “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, vol. 321, pp. 321–331, 2018. https://doi.org/10.1016/j.neucom.2018.09.013
  4. I. Wolterink, T. Leiner, M. A. Viergever, et al., “Generative adversarial networks for noise reduction in low-dose CT,” IEEE Transactions on Medical Imaging, vol. 36, no. 12, pp. 2536–2545, 2017. https://doi.org/10.1109/TMI.2017.2708987
  5. Q. Yang, P. Yan, Y. Zhang, et al., “Low-Dose CT Image Denoising Using a Generative Adversarial Network With Wasserstein Distance and Perceptual Loss,” IEEE Transactions on Medical Imaging, vol. 37, no. 6, pp. 1348–1357, 2018. https://doi.org/10.1109/TMI.2018.2827462
  6. H. Ali, M. R. Biswas, F. Mohsen, et al., “The role of generative adversarial networks in brain MRI: a scoping review,” Insights into Imaging, vol. 13, no. 1, 2022. https://doi.org/10.1186/s13244-022-01237-0
  7. E. Jung, M. Luna, and S. H. Park, “Conditional GAN with 3D discriminator for MRI generation of Alzheimer’s disease progression,” Pattern Recognition, vol. 133, 2023. https://doi.org/10.1016/j.patcog.2022.109061
  8. K. Packhäuser, L. Folle, F. Thamm, and A. Maier, “Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems,” in Proc. IEEE ISBI, 2023, pp. 1–5. https://doi.org/10.1109/ISBI53787.2023.10230346
  9. P. Eigenschink, T. Reutterer, R. Vamosi, et al., “Deep Generative Models for Synthetic Data: A Survey,” IEEE Access, vol. 11, pp. 47304–47320, 2023. https://doi.org/10.1109/ACCESS.2023.3275134
  10. W. A. C. Castañeda and P. Bertemes Filho, “Synthetic health data generation for enhancement of non-invasive diabetes AI-based prediction,” 2023. https://doi.org/10.20944/preprints202308.1464.v1
Index Terms

Computer Science
Information Sciences

Keywords

Generative AI Synthetic Data Patient Matching Data Deduplication Healthcare Informatics