On Tuesday, Feb 28th PST (Wednesday, Mar 1st UTC) our main REST API experienced an extended outage.
The issue was root caused to service disruption for the AWS S3 (Simple Storage Service) in the US EAST 1 hosting region (Virginia) and operations of our APIs has now fully recovered.
The REST APIs is the single gateway by which customers interact with our service and therefore impacted the majority of our functionalities.
The issues were caused by a service disruption for the AWS S3 (Simple Storage Service) in the US EAST 1 hosting region (Virginia).
Imports rely on the ability to upload data to our S3 buckets used for storage:
Query execution needs to be able to read/retrieve data from our S3 storage buckets:
The new web Console (https://console.treasuredata.com/app/) was not reachable because the JavaScript assets were hosted on S3.
Some functionalities continued to persist in our REST APIs and legacy web Console, although due to the outage of the functionalities mentioned above, the usability remained vastly limited.
The incident began on Tue Feb 28th, 09:40 PST / Tue Feb 28th, 17:40 UTC.
At around Tue Feb 28th, 12:50 PST / Tue Feb 28th, 20:50 UTC the AWS status page reported that S3 read access was recovering while they continued to experience issues with write operations.
As an effect, these functionalities recovered:
While Query execution recovered in the backend, our service was still unable to complete the entire lifecycle of the queries which culminates with uploading query results and logs to S3. As a consequence, queries ran and completed but failed to upload, hence got retried with increasing time gap between retries (exponential backoff).
At the same time, the API still had mild trouble accepting new query requests and sporadically errored out due to the elevated error rates.
Beginning on Tue Feb 28th, 14:08 PST / Tue Feb 28th, 22:08 UTC, the AWS status page declared the outage completely recovered, including write operations (creation of new objects).
At around the same time our monitoring highlighted the gradual recovery of disrupted functionalities.
Because provisioning of EC2 EBS (Elastic Block Storage) was also impaired at this point, we were still unable to launch new server instances to consume the requests cumulated during the outage and waiting to be processed.
Starting on Tue Feb 28th, 14:36 PST / Tue Feb 28th, 22:08 UTC we recovered the ability to add new servers. We began to add capacity up to nearly doubling it in certain cases.
The extra capacity added critical resources to our system to process the waiting requests that accumulated during the outage and worked through the request backlog faster.
On Tue Feb 28th, 15:32 PST / Tue Feb 28th, 23:32 UTC our system returned to normal operations and, while we continued to monitor closely the system, all issues had been resolved.
While the root cause of this problem was completely outside of our control, this incident was an important learning opportunity for us.
Our system was built to rely on the functionalities of AWS S3. While AWS S3 is well known for high availability and reliability, being completely dependent on its availability render us vulnerable because we don’t have viable mitigating strategies in case issues occur. As we are gearing up to revisit our import infrastructure, this is a fundamental design requirement we definitely plan to take into account.
In the grand scheme of things, the inability to execute queries or access the web Console is minor compared to the issues arising from the import requests being rejected (Streaming ingestion, Bulk import, Web Uploader) or staying unprocessed (Mobile SDK / JavaScript SDK ingestion APIs, Data Connector, Heroku logdrain). While the former tends to be a problem the system is generally capable/designed to recover from, the latter is susceptible of causing longer term impact.
More specifically for Streaming ingestion, the inability to accept incoming payloads caused back-pressure to our customers’ td-agent / fluentd-based logging instances, causing them to buffer the events for the duration of the outage, which in this case was extensive. We realize that the architecture doesn’t lend itself well to handle this type of trouble and we are already actively pursuing changes to our Streaming ingestion pipeline that will make it capable of ‘staging’ customer’s incoming payloads in our infrastructure even in the event of an outage similar to the one experienced today.
Last but not least, we are fully aware that whilst the outage was not caused by us, it remains our responsibility to take the necessary precautions to guarantee our customers as high a quality of service and reliability as possible, which transcends the infrastructure provider(s) we rely on.
We apologize for the inconvenience this outage has caused and appreciate your patience.
If you have any question concerning this change, please feel free to contact us at support@treasuredata.com.
Sincerely,
The Treasure Data Team