by Gibu Mathew
We faced intermittent problems starting 12 Oct 2015, 21:52 PST /
Oct 13, 2015 10:22 AM IST due to a problem with our database server cluster.
Data collection and polling was intermittent at this time for many
customers and you may observe some points missing in the charts in the
web console. The Site24x7 Web Console too was not reachable.
Our engineers are working on this problem and we will update here
once we have found the root cause.
We sincerely apologize for the inconvenience this is causing.
- Gibu
Like (3)
Reply
Replies (1)
Please
find the RCA for today's outage:
On Oct 12th, around 21:50 PM PDT we were provisioning user space for new users in the system. This resulted in an unexpected slowness in our primary data cluster thus degrading the performance of the whole system. To resolve the performance problem, we stopped the provisioning of user space to reduce load on our system. This again did not help. The problem got worsened due to heavy load due to backlogs on our servers.
Finally,
we killed the main DB server in the cluster and swapped it with the
slave to resolve the performance problem.
Timeline
of Events
At
21:50 PM PDT User space provisioning started
At 21:55 PM PDT Received alert on the performance
problem
We had the
following issues during this period:
Our Web Console was inaccessible
intermittently,
Monitoring was happening at
a slower pace,
Server Monitoring data
were not persisted immediately.
At
00:00 AM PDT Stopped the hung user provisioning thread and restarted
all the application servers.
At
00:10 AM PDT All monitoring started working properly in expected poll
frequency. The Site24x7 Web Console had accessibility issues
intermittently, still.
At 3:00 AM PDT Swapped the slave DB server to master in our
primary data cluster.
Issue with client servers and server agents had started
stabilizing now.
At
5:30 AM PDT Completed all the backlogs and service was completely stable.
We
sincerely apologize for the inconvenience caused due to this.
Get
back to us for any further clarifications.
Like (0)
Reply