Important Announcement: Technical Difficulties, US-1 DC

This week we experienced atypical behaviour affecting our metadata clusters in relation to recently announced issue with our US-1 - United States, Chicago, IL, Datacenter.

For more information about the datacenters, click here.

Unfortunately, some workspaces become inaccessible due to an overload of the underlying component responsible for handling workspace metadata, which happens periodically at 6:18 PM PDT starting Tuesday, July 16th and our DevOps engineers have to always step in to repair the metadata clusters.

The metadata clusters store information about LDM, dashboards, reports, and other related data, all of which are essential for workspace functionality. During the initial investigation, we identified unexpectedly high traffic as the root cause. Consequently, we have pinpointed the suspects and contributors to the significant load increase, communicated with the owners about the changes and further actions, and applied some limitations. We have also increased the capacity of the clusters to handle the incoming connections.

However the issue occurred again which means that the root cause of the issue is yet to be identified.

Our engineering team now has a much more detailed collection of the dedicated logs and their analysis and we are applying new measures to minimize the impact on our customers.

How are you affected?

Some workspaces hosted in US-1 - United States, Chicago, IL, Datacenter are not accessible around 6:18 PM PDT.

No action required!

There is no action required from you right now. We appreciate your patience and understanding as we work through these challenges. We will continue to provide updates as more information becomes available.

To get a notification email for updates subscribe to this announcement.

We greatly appreciate your patience in this matter, and if you should have any direct questions regarding this, please reach out to GoodData Support.

Updates

July 19, 2024 19:04 Radek Novacek	While we are reviewing all platform logs, customers' activity and all recent potentially related changes, we have so far been unable to identify the root cause of the overload issue which happens around 6:18PM PDT. With this in mind, we are doing everything we can to prevent another occurrence, and have taken several steps to reduce the chances of this happening. We added two additional metadata clusters to increase the overall datacenter capabilities to handle more traffic and decrease the impact should the issue reoccur, and are currently going through a zero-downtime migration of existing workspace metadata to the new clusters. We also paused non-essential GoodData automated tasks, and, we upgraded half of the clusters to a newer database engine version that includes potentially helpful improvements. We are actively monitoring the situation and will be ready to take action and quickly restore full functionality to the platform if the issue reoccurs in spite of our preventative measures.
July 20, 2024 01:56 Radek Novacek	We have been monitoring the situation, and the platform is currently stable as the issue has not reappeared today. We will continue to monitor our systems closely.
July 21, 2024 13:00 Martin Burian	I am pleased to provide an additional update. The platform remains stable, and the platform-wide issue has not reappeared since our last update. We did notice a minor isolated issue impacting one customer, which we are currently investigating. We are in contact with the affected customer. It seems our efforts most likely isolated the issue and significantly decreased its blast radius. We will keep you informed.
July 22, 2024 13:32 Jakub Kopecky	Following up with an update on the situation. We can confirm that our platform has remained stable and there have been no issues since our last update. We believe that the steps we have taken have mitigated the issue, as our actions seem to confirm our suspicions about the root cause. As we transition into the business week, we are closely monitoring for any potential impact from increased traffic during business days. We will continue with extra monitoring for the next few days until we are confident that the issue is fully resolved.
July 23, 2024 13:02 Jakub Kopecky	Situation is stable and we are optimistic it will stay so going forward. The extra monitoring is still on and we will keep you informed.
July 24, 2024 11:09 Jakub Kopecky	As the issue has not recurred, we are pleased to officially declare it fully resolved. The root cause of this issue was a combination of increased overall load on the metadata clusters from customer activity and internal metadata collection activities used for troubleshooting purposes. This overload led to the exhaustion of resources on the metadata clusters during peak hours, making them unreachable and consequently impacting workspace access, report computation, and ETL processes. To resolve the issue, we increased the resources in the datacenter by adding two new metadata clusters and adjusted the internal processes for metadata collection activities. We are implementing further actions to ensure this issue does not happen again. An official RCA document will be available upon request, which can be directed to either Support or the Account Owner. We apologize for any inconvenience this situation has caused. A stable platform is our priority, and we are committed to providing the best possible service.