We took down production by misconfiguring our etl

rw-book-cover

Metadata

Author:: Hex
Full Title:: We Took Down Production by Misconfiguring Our ETL
Category:: 🗞️Articles
URL:: https://hex.tech/blog/we-took-down-production/
Finished date:: 2022-12-17

Highlights

how easy it was to get ourselves into a bad state with tools that aren’t typically considered dangerous (View Highlight)
if your Fivetran sync ever goes down, the WAL will grow continuously and unboundedly as the production DB keeps adding and changing data. (View Highlight)
- Note: In the case of MySQL binlog, you just set up a retention period and… you lose the data, but don’t have unbounded growth.
because Fivetran wasn’t consuming the WAL, it started building up (and up… and up…) on our production DB instance. (View Highlight)
- Note: But I wonder why in 8 days nobody noticed that Fivetran replication wasn’t working…
we were stuck waiting for the optimization step, unable to further scale our database until it finished. This was the point at which we called AWS Support, where we learned there was absolutely nothing they could do to terminate or accelerate the optimization process. And worse, after 12 hours, we were only 75% of the way done. (View Highlight)
So we kicked off a backup, expecting it to take about 20 minutes, the length of our typical nightly backup. Reader, it did not take 20 minutes. What we had neglected to account for was that the nightly backups of our production database are incremental. But our replica had never been backed up before, which means the backup was starting from scratch. And doing anything 1.8 TB – even copying it from one place to another – takes a long time. (View Highlight)
Back on the phone with AWS, they helpfully informed that after 20-some minutes, our backup was only 39% complete, and that there was absolutely nothing they could do to terminate or accelerate the process (sound familiar?). (View Highlight)
Postgresql 13 has the ability to limit the size of the WAL, which in this scenario might lead to some internal data consistency issues, but will prevent a larger outage (View Highlight)
conduct quarterly fire drills with all engineers (View Highlight)

Metadata

Author: Hex
Full Title:: We Took Down Production by Misconfiguring Our ETL
Category:: 🗞️Articles
URL:: https://hex.tech/blog/we-took-down-production/
Finished date:: 2023-01-25

Highlights

how easy it was to get ourselves into a bad state with tools that aren’t typically considered dangerous (View Highlight)
if your Fivetran sync ever goes down, the WAL will grow continuously and unboundedly as the production DB keeps adding and changing data. (View Highlight)
- Note: In the case of MySQL binlog, you just set up a retention period and… you lose the data, but don’t have unbounded growth.
because Fivetran wasn’t consuming the WAL, it started building up (and up… and up…) on our production DB instance. (View Highlight)
- Note: But I wonder why in 8 days nobody noticed that Fivetran replication wasn’t working…
we were stuck waiting for the optimization step, unable to further scale our database until it finished. This was the point at which we called AWS Support, where we learned there was absolutely nothing they could do to terminate or accelerate the optimization process. And worse, after 12 hours, we were only 75% of the way done. (View Highlight)
So we kicked off a backup, expecting it to take about 20 minutes, the length of our typical nightly backup. Reader, it did not take 20 minutes. What we had neglected to account for was that the nightly backups of our production database are incremental. But our replica had never been backed up before, which means the backup was starting from scratch. And doing anything 1.8 TB – even copying it from one place to another – takes a long time. (View Highlight)
Back on the phone with AWS, they helpfully informed that after 20-some minutes, our backup was only 39% complete, and that there was absolutely nothing they could do to terminate or accelerate the process (sound familiar?). (View Highlight)
Postgres 13 has the ability to limit the size of the WAL, which in this scenario might lead to some internal data consistency issues, but will prevent a larger outage (View Highlight)
conduct quarterly fire drills with all engineers (View Highlight)

Dr. Mario's 2nd 🧠

Explorer

We took down production by misconfiguring our etl

Metadata

Highlights

Metadata

Highlights

Webmentions

❤️ Likes

🔄 Reposts

💬 Replies

🔗 Mentions

Graph View

Table of Contents