CFP last date
20 May 2026
Reseach Article

AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems

by Rama Krishna Reddy Arumalla
International Journal of Computer Applications
Foundation of Computer Science (FCS), NY, USA
Volume 187 - Number 98
Year of Publication: 2026
Authors: Rama Krishna Reddy Arumalla
10.5120/ijcadab8ea8eb453

Rama Krishna Reddy Arumalla . AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems. International Journal of Computer Applications. 187, 98 ( Apr 2026), 6-11. DOI=10.5120/ijcadab8ea8eb453

@article{ 10.5120/ijcadab8ea8eb453,
author = { Rama Krishna Reddy Arumalla },
title = { AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems },
journal = { International Journal of Computer Applications },
issue_date = { Apr 2026 },
volume = { 187 },
number = { 98 },
month = { Apr },
year = { 2026 },
issn = { 0975-8887 },
pages = { 6-11 },
numpages = {9},
url = { https://ijcaonline.org/archives/volume187/number98/ai-assisted-incident-detection-and-automated-recovery-in-distributed-e-commerce-systems/ },
doi = { 10.5120/ijcadab8ea8eb453 },
publisher = {Foundation of Computer Science (FCS), NY, USA},
address = {New York, USA}
}
%0 Journal Article
%1 2026-04-28T21:29:18.354909+05:30
%A Rama Krishna Reddy Arumalla
%T AI-Assisted Incident Detection and Automated Recovery in Distributed E-Commerce Systems
%J International Journal of Computer Applications
%@ 0975-8887
%V 187
%N 98
%P 6-11
%D 2026
%I Foundation of Computer Science (FCS), NY, USA
Abstract

Distributed e-commerce systems now face unprecedented issues of uptime and performance because of the complexity of microservices systems. The intended study suggests an Intelligent Observability and Incident Response Framework that would actively detect bottlenecks and automate the recovery processes. The research paper is based on a filtered dataset of 452 working telemetry examples, including such measures as request latency, CPU utilization, memory pressure, and error rates recorded during the peak traffic scenarios. The framework takes advantage of a pile of open-source monitoring agents, time-series databases, and automated orchestration engines to shift it away to predictive observability. The findings show the Mean Time to Detect and Mean Time to Repair are reduced significantly. These results indicate that machine learning can be used in conjunction with conventional telemetry to identify silent failures not detected by conventional threshold-based alerts. The paper describes the architecture design, the implementation of the smart layer, and an overall discussion of the system performance at different load states, which can be applied to the blueprint of a resilient digital commerce infrastructure.

References
  1. B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper: A Large-Scale Distributed Systems Tracing Infrastructure,” Proceedings of the USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2010. https://research.google.com/pubs/archive/36356.pdf
  2. J. Dean and L. A. Barroso, “The Tail at Scale,” Communications of the ACM, vol. 56, no. 2, pp. 74–80, 2013.https://doi.org/10.1145/2408776.2408794
  3. W. Xu, L. Huang, A. Fox, D. A. Patterson, and M. I. Jordan, “Detecting Large-Scale System Problems by Mining Console Logs,” Proceedings of the ACM Symposium on Operating Systems Principles (SOSP), 2009.https://doi.org/10.1145/1629575.1629587
  4. J. Thalheim, A. Rodrigues, I. E. Akkus, P. Bhatotia, R. Chen, B. Viswanath, L. Jiao, and C. Fetzer, “Sieve: Actionable Insights from Monitored Metrics in Microservices,” IEEE/ACM International Conference on Distributed Systems Platforms, 2017.https://arxiv.org/abs/1709.06686
  5. F. Lin, K. Muzumdar, N. Laptev, M. Curelea, S. Lee, and S. Sankar, “Fast Dimensional Analysis for Root Cause Investigation in a Large-Scale Service Environment,” IEEE International Conference on Big Data, 2019.https://arxiv.org/abs/1911.01225
  6. Y. Gan, Y. Zhang, K. Chen, et al., “Root Cause Analysis of Failures in Microservices Through Anomaly Detection,” Proceedings of the IEEE International Conference on Cloud Computing (CLOUD), 2019.https://ieeexplore.ieee.org/document/8812060
  7. M. Chen, A. Accardi, A. Archibald, et al., “AI for IT Operations (AIOps): Challenges and Opportunities,” IEEE Intelligent Systems, vol. 35, no. 2, pp. 6–14, 2020.https://doi.org/10.1109/MIS.2020.2973845
  8. Z. Chen, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” IEEE Transactions on Network and Service Management, 2024.https://arxiv.org/abs/2406.06975
  9. A. Lavin and S. Ahmad, “Evaluating Real-Time Anomaly Detection Algorithms,” IEEE International Conference on Machine Learning and Applications (ICMLA), 2015.https://doi.org/10.1109/ICMLA.2015.141
  10. Z. Chen et al., “An Anomaly Detection Algorithm for Microservice Architecture Based on Robust Principal Component Analysis,” IEEE Access, vol. 8, pp. 226397–226408, 2020.https://doi.org/10.1109/access.2020.3044610
  11. Z. Chen, Z. Jiang, Y. Su, M. R. Lyu, and Z. Zheng, “TraceMesh: Scalable and Streaming Sampling for Distributed Traces,” 2024 IEEE 17th International Conference on Cloud Computing (CLOUD), Shenzhen, China, 2024, pp. 54–65. https://doi.org/10.1109/CLOUD62652.2024.00016
  12. J. Soldani and A. Brogi, “Anomaly Detection and Failure Root Cause Analysis in Microservice-Based Cloud Applications: A Survey,” Journal of Systems and Software, 2021.https://doi.org/10.48550/arXiv.2105.12378
  13. V.-H. Le and H. Zhang, “Log-Based Anomaly Detection Without Log Parsing,” 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), Melbourne, Australia, 2021, pp. 492–504.https://doi.org/10.1109/ASE51524.2021.9678773
Index Terms

Computer Science
Information Sciences

Keywords

AIOps Intelligent Observability Microservices Monitoring Automated Incident Response Self-Healing Systems Distributed Tracing Anomaly Detection E-Commerce Infrastructure