Unable to access API and Web Interface

Incident Report for Treasure AI

Postmortem

On Tuesday, Feb 28th PST (Wednesday, Mar 1st UTC) our main REST API experienced an extended outage.

The issue was root caused to service disruption for the AWS S3 (Simple Storage Service) in the US EAST 1 hosting region (Virginia) and operations of our APIs has now fully recovered.

The REST APIs is the single gateway by which customers interact with our service and therefore impacted the majority of our functionalities.

Root Cause

The issues were caused by a service disruption for the AWS S3 (Simple Storage Service) in the US EAST 1 hosting region (Virginia).

Impaired services

Imports rely on the ability to upload data to our S3 buckets used for storage:

Streaming importing (td-agent, fluentd) exhibited high APIs error rates that backed up td-agent / fluentd instances for up to 4 hours. As long as the buffers allocated to said td-agent / fluentd instances were sufficiently large to hold the dispatched logs, no data was lost. All requests got eventually accepted and processed after the incident was resolved.
Bulk Import showed high APIs error rates that caused the related CLI commands to error out. Those commands will need to be executed again manually by the customer in order to recover.
Data Connector had high APIs error rates which caused the related CLI command to error out. The commands will need to executed again manually in order to recover.

The Connector UI in the new Console was unreachable all together.

Execution of Data Connector jobs and schedules was delayed for the duration of the incident but eventually all pending work was completed.
The Web Uploader in the old web Console was unable to stage files for upload in the Console. The problem persisted during the incident and prevented use of the File uploaded entirely.
Mobile SDK (Android, iOS, Unity) and JavaScript SDK APIs successfully received all the data without any loss. However the incoming data remained buffered for 4 hours and the events got eventually fully processed once the incident was resolved.
Heroku Logdrain (from TD Heroku addons) correctly received all requests and buffered the events for 4 hours. No data was lost and the events got eventually entirely processed once the incident was resolved.

Query execution needs to be able to read/retrieve data from our S3 storage buckets:

The execution of Presto and Hadoop/Hive queries was blocked during the incident. New requests to execute queries (from web Console and REST APIs) were sporadically rejected due to the high APIs load but kept queued until the incident got resolved. Eventually all queued queries got regularly executed to completion.
The execution of Result Output to 3rd party services associated to queries submitted from web Console, REST APIs, or scheduled was delayed and retried several times. Once the incident was resolved, queried resumed normal execution.

The new web Console (https://console.treasuredata.com/app/) was not reachable because the JavaScript assets were hosted on S3.

Some functionalities continued to persist in our REST APIs and legacy web Console, although due to the outage of the functionalities mentioned above, the usability remained vastly limited.

Timeline

Incident detected

The incident began on Tue Feb 28th, 09:40 PST / Tue Feb 28th, 17:40 UTC.

Read operations recovered

At around Tue Feb 28th, 12:50 PST / Tue Feb 28th, 20:50 UTC the AWS status page reported that S3 read access was recovering while they continued to experience issues with write operations.

As an effect, these functionalities recovered:

Execution of Presto and Hadoop-Hive queries
New web Console (https://console.treasuredata.com/app/).

While Query execution recovered in the backend, our service was still unable to complete the entire lifecycle of the queries which culminates with uploading query results and logs to S3. As a consequence, queries ran and completed but failed to upload, hence got retried with increasing time gap between retries (exponential backoff).

At the same time, the API still had mild trouble accepting new query requests and sporadically errored out due to the elevated error rates.

Write operations recovered

Beginning on Tue Feb 28th, 14:08 PST / Tue Feb 28th, 22:08 UTC, the AWS status page declared the outage completely recovered, including write operations (creation of new objects).

At around the same time our monitoring highlighted the gradual recovery of disrupted functionalities.

Because provisioning of EC2 EBS (Elastic Block Storage) was also impaired at this point, we were still unable to launch new server instances to consume the requests cumulated during the outage and waiting to be processed.

Mitigation of access flood

Starting on Tue Feb 28th, 14:36 PST / Tue Feb 28th, 22:08 UTC we recovered the ability to add new servers. We began to add capacity up to nearly doubling it in certain cases.

The extra capacity added critical resources to our system to process the waiting requests that accumulated during the outage and worked through the request backlog faster.

Resolution

On Tue Feb 28th, 15:32 PST / Tue Feb 28th, 23:32 UTC our system returned to normal operations and, while we continued to monitor closely the system, all issues had been resolved.

Learnings and Conclusions

While the root cause of this problem was completely outside of our control, this incident was an important learning opportunity for us.

Our system was built to rely on the functionalities of AWS S3. While AWS S3 is well known for high availability and reliability, being completely dependent on its availability render us vulnerable because we don’t have viable mitigating strategies in case issues occur. As we are gearing up to revisit our import infrastructure, this is a fundamental design requirement we definitely plan to take into account.

In the grand scheme of things, the inability to execute queries or access the web Console is minor compared to the issues arising from the import requests being rejected (Streaming ingestion, Bulk import, Web Uploader) or staying unprocessed (Mobile SDK / JavaScript SDK ingestion APIs, Data Connector, Heroku logdrain). While the former tends to be a problem the system is generally capable/designed to recover from, the latter is susceptible of causing longer term impact.

More specifically for Streaming ingestion, the inability to accept incoming payloads caused back-pressure to our customers’ td-agent / fluentd-based logging instances, causing them to buffer the events for the duration of the outage, which in this case was extensive. We realize that the architecture doesn’t lend itself well to handle this type of trouble and we are already actively pursuing changes to our Streaming ingestion pipeline that will make it capable of ‘staging’ customer’s incoming payloads in our infrastructure even in the event of an outage similar to the one experienced today.

Last but not least, we are fully aware that whilst the outage was not caused by us, it remains our responsibility to take the necessary precautions to guarantee our customers as high a quality of service and reliability as possible, which transcends the infrastructure provider(s) we rely on.

We apologize for the inconvenience this outage has caused and appreciate your patience.

If you have any question concerning this change, please feel free to contact us at support@treasuredata.com.

Sincerely,

The Treasure Data Team

Posted Feb 28, 2017 - 23:51 PST

Resolved

All systems are operating normally for over an hour. This incident was resolved. We will update with postmortem later.

Posted Feb 28, 2017 - 16:59 PST

Monitoring

The service is now fully operational.

Posted Feb 28, 2017 - 15:38 PST

Update

The extra import server capacity added earlier completed the processing of the waiting import task queue. The streaming import pipeline has now returned to normal operations.

We will monitor the capacity for a while longer before proceeding to remove the redundant capacity.

Posted Feb 28, 2017 - 15:37 PST

Update

We are now able to add new server capacity to process the pending requests, job, and queries accumulated during the incident faster.
In the meantime, our monitoring shows the increase of waiting requests/tasks is now under control.

Posted Feb 28, 2017 - 14:49 PST

Update

The problem of AWS S3 has been resolved, but AWS still have problems in adding more machine resources.

We will keep monitoring the system so that it can safely process remaining jobs that are accumulated during the incident.

Posted Feb 28, 2017 - 14:31 PST

Update

We have observed some recovery of the data processing. We will keep monitoring the system until it becomes fully functional.

Posted Feb 28, 2017 - 13:39 PST

Update

AWS has reported a recovery of reading data from S3. Recovery of uploading data to S3 is ongoing.

Posted Feb 28, 2017 - 13:04 PST

Update

We are still experiencing the AWS S3 issue. We will report if there is any update.
Thank you for your patience.

Posted Feb 28, 2017 - 11:41 PST

Identified

We have identified the issue in our backend storage service (AWS S3). We will keep monitoring the service status.

To access our web interface, you can use this address as a temporary solution: https://console.treasuredata.com/databases
Note that, however, query engines are still not operational.

We sincerely apologize for this inconvenience.

Posted Feb 28, 2017 - 10:33 PST

Update

We are seeing issues in AWS S3, which is used for our backend storage.

Posted Feb 28, 2017 - 10:04 PST

Investigating

We are now experiencing problems in accessing TD API and web interface. Query processing (Presto and Hive) are also affected by this.

Posted Feb 28, 2017 - 09:53 PST

This incident affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Hadoop / Hive Query Engine, Presto Query Engine, Presto JDBC/ODBC Gateway).