(String: {%- set hs_blog_post_body -%} {%- set in_blog_post_body = true -%} <span id="hs_cos_wrapper_post_body" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_rich_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="rich_text"> <div class="blog-post__lead h2"> <p>Yesterday, some of us, developers, were hit hard by the AWS outage.</p> </div></span>)

So AWS Went Down. Here’s How You Can be Prepared If It Happens Again

Photo of Rafał Wiliński

Rafał Wiliński

Updated Feb 21, 2024 • 4 min read
img_amazon.jpg

Yesterday, some of us, developers, were hit hard by the AWS outage.

Many websites went down for a few of hours, some of us couldn’t even continue working on your projects due to the outages of services like CircleCI.

Let me give you a quick recap, if don’t know what I’m talking about. Amazon’s S3 web-based storage service was experiencing widespread issues, leading to the service being either partially or fully broken on websites, apps and devices upon which it relies.

There’s one essential misconception about what happened. Everybody is saying “AWS is down”, but in fact, only 1 out of 14 regions went down. Your services were only affected if your infrastructure is based in the us-east-1 region. What can you do to prepare for the next outage?

1. Active-Active or Active-Passive Failover on Route53

The term Active-Active is not AWS specific and relates to “high-availability”. Basically, it means that the traffic intended for the failed node is either passed onto a working node or load-balanced across the remaining nodes. In our case, the failed “node” might be a whole region or the whole AZ. This approach is somewhat expensive because it requires you to have a full copy of your infrastructure in another region.

The Active-Passive approach is a bit of a cheaper solution. Fully redundant instances are only brought online when their associated primary node fails. Note that this approach also comes with extra time overhead to provision and warm-up your servers.

Luckily, AWS provides a very handy tool to detect this type of failures and automatically re-route whole traffic. It’s called Route53, and by definition, it’s a scalable cloud Domain Name System (DNS) but also comes with some additional features.

Two additional features I’m talking about are “Health Checks” and “Traffic Policies”. We use them often at Netguru. Once Health Check detects that your endpoint is unreachable, it will start routing traffic to a failover endpoint/node.

Configuring such a scheme is fairly simple:

1. Go to Route53 -> Health Checks -> Create health check

2. You should see the following screen:

image00-6.png

In the example above, I’m checking if netguru.com is up and running. If three consecutive health checks performed every 30 seconds fail, it should report failure.

3. The second piece of our scheme is the policy. Go to Route53 -> Traffic Policies and create a new policy.

4. Creating a policy is pretty straightforward. All we need to do is add a Failover Rule and add primary and secondary endpoints, like so:

image01-4.png

During the next outage, all your traffic should be redirected to an unaffected node automatically, without your supervision or need to issue explicit commands.

2. Cross-region assets replication

If you aren’t going to use an Active-Active/Passive configuration, you can still use automatic asset replication. Once enabled, every object uploaded to a particular S3 bucket is automatically replicated to a designated destination bucket located in a different AWS region. This is particularly useful in cases similar to the recent outage.

Enabling this feature is also very easy and might save you a lot of money.

1. Head to the bucket that you’d like to replicate

2. Go to Properties. Before enabling cross-region replication, you have to enable Versioning first.

3. Once you’ve done that, you would also need to adjust how your app serves static content. The quickest solution is to write the S3 assets’ hostname (from http://mybucket.s3.amazonaws.com to http://my-backup-bucket.s3.amazonaws.com) in case a particular CloudWatch alarm (in our case, the one associated with health checks) is on.

image02-2.png

In conclusion, there indeed are a few things you can do yourself to mitigate such a big outage as the recent one. In other, not so serious cases, things like Multi-AZ deployments should be enough.

Photo of Rafał Wiliński

More posts by this author

Rafał Wiliński

Rafal started his career in IT when he was seventeen. Since that time, he has managed to shift his...
Fuel your digital growth with cloud solutions  Discover powerful tools to drive revenue in the cloud Learn more

We're Netguru!

At Netguru we specialize in designing, building, shipping and scaling beautiful, usable products with blazing-fast efficiency
Let's talk business!

Trusted by: