Yesterday evening, between around 19:30 and 1:00 CET today, parts of our tado° backend systems had limited availability. The result for affected customers was that opening the tado° app was very slow, or that, when it opened, devices and zones were shown as disconnected. However, manual control on the devices was not affected and working fine for our customers.
We want to sum up what happened:
We have been notified through our monitoring and alerting systems about a slow down in our central server application at around 19:45 CET. Our immediate reaction was to resize internal queues and database connection pools to meet the increasing load. This took multiple iterations of deploying new versions until our central application was in a healthy shape again.
Furthermore, to achieve that, we had to temporarily switch off some load to this service by stopping some of our device gateway servers which resulted in some devices losing the connection. This incident was resolved at 22:15 CET and all customers were able to open the tado° app again and see their home status.
When we brought up the device gateways again at around 22:30 CET, they first needed to work through our fleet of devices reconnecting at pretty much the same time. It continuously affected device gateway instances, right after being started. Only when we scaled up the number of device gateways, we have been able to handle the wave of load from re-connecting devices faster. This brought back most of the connected devices at ~0:15 CET. Another restart of a slow device gateway lead to all devices connecting again at ~0:45 CET.
We are utilizing various auto-scaling measures that are provided by our cloud provider “Amazon Web Services” to adjust the number of our backend servers to the observed load at all times, without manual intervention required. We have monitoring and alerting in place, should scaling up not be enough or if any other errors are happening. We are continuously working on optimizing our monitoring and communication towards customers.
Yesterday’s incident, though, was related to the fast growth in our user base and the number of connected devices which we observe again since September this year.
Sometimes, simply increasing the number of servers through auto-scaling is not enough. For the past weeks we have been working on improving our systems to deal with this increased load and following yesterday’s incident will give this even higher priority across our development department. This way, we will try to make sure that an incident like this will not happen again.
We are very sorry for any inconveniences this caused!
Your tado° Team