AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems

Rama Krishna Reddy Arumalla

Call for Paper

August Edition

IJCA solicits high quality original research papers for the upcoming August edition of the journal. The last date of research paper submission is 20 July 2026

Submit your paper

Know more

The week's pick

CAD-Genesis: An Open-Source AI-Powered Add-in for Natural Language-Driven Parametric CAD Modeling and Cross-Platform Integration in SolidWorks and Fusion 360

Anil Mandloi Prakhi Mandloi

Random Articles

Encryption Approaches for Secure Deduplication in Cloud Environment

Dec

2016

An Implementation of Secure Wireless Network for Avoiding Black hole Attack

February

2015

MHD Convection Slip Fluid Flow With Radiation and Heat Deposition in a Channel in a Porous Medium

December

2011

FMEA and Alternatives v/s Enhanced Risk Assessment Mechanism

May

2014

Reseach Article

AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems

by Rama Krishna Reddy Arumalla

International Journal of Computer Applications

Foundation of Computer Science (FCS), NY, USA

Volume 187 - Number 98

Year of Publication: 2026

Authors: Rama Krishna Reddy Arumalla

10.5120/ijcadab8ea8eb453

Rama Krishna Reddy Arumalla . AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems. International Journal of Computer Applications. 187, 98 ( Apr 2026), 6-11. DOI=10.5120/ijcadab8ea8eb453

@article{ 10.5120/ijcadab8ea8eb453,

author = { Rama Krishna Reddy Arumalla },

title = { AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems },

journal = { International Journal of Computer Applications },

issue_date = { Apr 2026 },

volume = { 187 },

number = { 98 },

month = { Apr },

year = { 2026 },

issn = { 0975-8887 },

pages = { 6-11 },

numpages = {9},

url = { https://ijcaonline.org/archives/volume187/number98/ai-assisted-incident-detection-and-automated-recovery-in-distributed-e-commerce-systems/ },

doi = { 10.5120/ijcadab8ea8eb453 },

publisher = {Foundation of Computer Science (FCS), NY, USA},

address = {New York, USA}

}

%0 Journal Article

%1 2026-04-28T21:29:18.354909+05:30

%A Rama Krishna Reddy Arumalla

%T AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems

%J International Journal of Computer Applications

%@ 0975-8887

%V 187

%N 98

%P 6-11

%D 2026

%I Foundation of Computer Science (FCS), NY, USA

Abstract

Distributed e-commerce systems now face unprecedented issues of uptime and performance because of the complexity of microservices systems. The intended study suggests an Intelligent Observability and Incident Response Framework that would actively detect bottlenecks and automate the recovery processes. The research paper is based on a filtered dataset of 452 working telemetry examples, including such measures as request latency, CPU utilization, memory pressure, and error rates recorded during the peak traffic scenarios. The framework takes advantage of a pile of open-source monitoring agents, time-series databases, and automated orchestration engines to shift it away to predictive observability. The findings show the Mean Time to Detect and Mean Time to Repair are reduced significantly. These results indicate that machine learning can be used in conjunction with conventional telemetry to identify silent failures not detected by conventional threshold-based alerts. The paper describes the architecture design, the implementation of the smart layer, and an overall discussion of the system performance at different load states, which can be applied to the blueprint of a resilient digital commerce infrastructure.

References

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper: A Large-Scale Distributed Systems Tracing Infrastructure,” Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010. https://research.google.com/pubs/archive/36356.pdf
J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.https://doi.org/10.1145/2408776.2408794
W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting Large-Scale System Problems by Mining Console Logs,” Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2009.https://doi.org/10.1145/1629575.1629587
J. Thalheim, A. Rodrigues, I. E. Akkus, P. Bhatotia, R. Chen, B. Viswanath, L. Jiao, and C. Fetzer, “Sieve: Actionable Insights from Monitored Metrics in Microservices,” IEEE/ACM International Conference on Distributed Systems Platforms, 2017.https://arxiv.org/abs/1709.06686
F. Lin, K. Muzumdar, N. Laptev, M. Curelea, S. Lee, and S. Sankar, “Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment,” IEEE International Conference on Big Data, 2019.https://arxiv.org/abs/1911.01225
Y. Gan, Y. Zhang, K. Chen, et al., “Root Cause Analysis of Failures in Microservices Through Anomaly Detection,” Proceedings of the IEEE International Conference on Cloud Computing (CLOUD), 2019.https://ieeexplore.ieee.org/document/8812060
M. Chen, A. Accardi, A. Archibald, et al., “AI for IT Operations (AIOps): Challenges and Opportunities,” IEEE Intelligent Systems, vol. 35, no. 2, pp. 6–14, 2020.https://doi.org/10.1109/MIS.2020.2973845
Z. Chen, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” IEEE Transactions on Network and Service Management, 2024.https://arxiv.org/abs/2406.06975
A. Lavin and S. Ahmad, “Evaluating Real-Time Anomaly Detection Algorithms,” IEEE International Conference on Machine Learning and Applications (ICMLA), 2015.https://doi.org/10.1109/ICMLA.2015.141
Z. Chen et al., “An Anomaly Detection Algorithm for Microservice Architecture Based on Robust Principal Component Analysis,” IEEE Access, vol. 8, pp. 226397–226408, 2020.https://doi.org/10.1109/access.2020.3044610
Z. Chen, Z. Jiang, Y. Su, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” 2024 IEEE 17th International Conference on Cloud Computing (CLOUD), Shenzhen, China, 2024, pp. 54–65. https://doi.org/10.1109/CLOUD62652.2024.00016
J. Soldani and A. Brogi, “Anomaly Detection and Failure Root Cause Analysis in Microservice-Based Cloud Applications: A Survey,” Journal of Systems and Software, 2021.https://doi.org/10.48550/arXiv.2105.12378
V.-H. Le and H. Zhang, “Log-Based Anomaly Detection Without Log Parsing,” 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 2021, pp. 492–504.https://doi.org/10.1109/ASE51524.2021.9678773

Index Terms

Computer Science

Information Sciences

Keywords

AIOps Intelligent Observability Microservices Monitoring Automated Incident Response Self-Healing Systems Distributed Tracing Anomaly Detection E-Commerce Infrastructure