[US/Tokyo Region] Performance issues

Incident Report for Treasure Data

Postmortem

Post-Mortem Summary

This incident was related to a database maintenance performed on January 19, during which management of Plazma table resources was migrated from the existing RDBMS to a dedicated Aurora database. The purpose of this change was to better isolate table resource workloads and improve long-term scalability.

The migration itself completed successfully, and both the original database and the newly introduced Aurora database were operating normally immediately after the maintenance.

On the following day, when internal batch processes began running, the newly introduced database experienced unexpected load. Investigation determined that the internal batch system was configured to connect to the master database endpoint instead of a reader endpoint. As a result, internal batch processing competed with user-facing requests, causing increased latency when accessing table resources.

To mitigate the impact and enable rapid recovery, temporary database tuning was applied. This tuning was intended solely as a short-term measure to stabilize the system and has since been fully reverted. The permanent fix consisted of correcting the configuration of the internal batch system so that it accesses the appropriate database endpoint.

Following these actions, system performance recovered in both the US and Tokyo regions, and all services returned to normal operation. No data loss or data corruption occurred.

To prevent similar issues in the future, we are reviewing our configuration and deployment practices, including reducing configuration differences between staging and production environments and strengthening validation of database endpoints used by internal workloads.

Posted Jan 26, 2026 - 10:38 PST

Resolved

Incident Impact Summary (Plazma / Table Resources)

We have confirmed that the system in the Tokyo region is operating normally with no remaining issues.

What happened
- All operations accessing Plazma table resources experienced increased latency.

When
- US region: 17:00 PST – 18:40 PST
- Tokyo region: 09:00 JST – 12:50 JST

Temporary impact during recovery
- Updates to table row counts and preview data were temporarily paused.

Recovery status
- US region: Table metadata updates resumed at 20:00 PST.
- Tokyo region: Table metadata updates resumed at 13:40 JST.
Both regions are now fully recovered.

Data integrity
- No data loss or data corruption occurred.

We will prepare and publish a post-mortem document to explain the circumstances under which this incident occurred one day after the recent system maintenance.

With this confirmation, the incident is now closed.

Posted Jan 20, 2026 - 21:07 PST

Monitoring

The system in the US region is operating normally with no observed issues.

In the Tokyo region, table metadata updates have also been resumed. At this point, the Tokyo region is expected to be fully recovered, and we will continue to monitor system performance.

Once stability is fully confirmed, we plan to close this status page.

Posted Jan 20, 2026 - 20:41 PST

Update

In the US region, table metadata updates have been resumed. At this point, the system in the US region is expected to be fully recovered, and we are continuing to monitor it closely.

For the Tokyo region, metadata updates will be resumed after we complete confirmation of system performance. We will provide another update when this step begins.

Posted Jan 20, 2026 - 20:12 PST

Update

Performance in the Tokyo region improved at approximately 12:50 JST. The Tokyo region had been experiencing gradually increasing performance degradation since around 09:00 JST.
We are currently proceeding with the steps to resume updates of table metadata in Data Workbench.

Due to the temporary pause of metadata updates, application metrics that rely on this information, such as Parent Segment size, have also been temporarily not updated.

We will continue monitoring the system and share further updates as needed.

Posted Jan 20, 2026 - 20:06 PST

Update

We are continuing to work on a fix for this issue.

Posted Jan 20, 2026 - 19:40 PST

Identified

We identified the same issue in the Tokyo region and applied database parameter tuning. We are currently assessing the impact of this change on system performance.
As a result of this temporary mitigation applied in both the US and Tokyo regions, the following impact is observed:
In Data Workbench, updates to table metadata for row counts and preview data are temporarily paused.
We will continue monitoring and provide further updates as we make progress.

Posted Jan 20, 2026 - 19:37 PST

Investigating

Starting at approximately 17:00 PST, we observed increased latency affecting all operations accessing table resources in Plazma.
At around 19:00 PST, we applied temporary tuning to the database managing table resources and confirmed recovery based on internal performance metrics.

We are currently continuing to monitor the system to ensure overall performance remains stable.

Visible impacts are:

- Streaming, Mobile, and JavaScript/Browser imports may delay

- Jobs execution may delay

- Table creation may error

- Console execution may delay

- Plazma Public API may return errors

- Treasure Workflow may delay and fail

- Workflow REST API may be unavailable

- Workflow operations in Console and CLI may become unavailable

- Presto JDBC/ODBC queries and CDP segmentation queries may fail

- Console Table preview update may delay

- The REST API to submit and cancel jobs may be unavailable

- Data Connector Integrations may become unavailable

- ADH (Ads Data Hub) and DCR (Data Clean Room) service may be unavailable

- ADL (Active Data Layer) service may be unavailable

We will send an additional update in 30 minutes

Posted Jan 20, 2026 - 18:55 PST

This incident affected: US (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Hadoop / Hive Query Engine, Presto Query Engine, Presto JDBC/ODBC Gateway, Workflow, CDP API, Data Access API (beta), ADL) and Tokyo (Web Interface, REST API, Streaming Import REST API, Mobile/Javascript REST API, Data Connector Integrations, Hadoop / Hive Query Engine, Presto Query Engine, Presto JDBC/ODBC Gateway, Workflow, CDP API, Data Access API (beta), ADL).