Safe Production Deployments (Part 2) – Traffic Shifting

Published:

November 12, 2020

June 7, 2021

In a previous blog post we successfully demonstrated that it’s possible to deploy isolated test environments of a service automatically whenever a new pull request is opened. In this blog post we will extend that proof of concept (POC) by also deploying the service to a production environment. This will happen automatically every time we merge something to the master branch. As described in the previous post, there are at least two different ways of achieving this.The first option is to have a dedicated (mutable) production stack which is updated each time a pull request is merged. Another alternative is to promote a previous test environment (i.e., a stack that was automatically launched from a development branch when opening a pull request) to become the new production stack. For this POC we decided to go with the second option, mainly due to the following reasons.

  • Faster deploys and rollbacks. Since the new stack is already up and running, the production deployment is as simple as shifting traffic to the new instance. Rollbacks are just as straightforward, by simply reversing the traffic shift.
  • The code that will be run in production is guaranteed to be exactly the same code as the one tested in the test environment.
  • By always deploying new, immutable stacks instead of updating the old one, the risk of the stack getting stuck in a bad state (for example when performing tricky refactorings of resources) is much reduced.

There are several different ways to implement traffic shifting, but for this POC we decided to try out a DNS based approach. In our organization’s AWS account there was already a hosted zone in place for the organization’s domain name. So, in the stack with less frequently changed resources we added a hosted zone for a new subdomain. For the scope of the POC, we performed a manual setup step as well, by adding the NS records to connect the two hosted zones.The main change introduced compared to the POC from part 1 is the addition of a workflow in our Github Actions-based pipeline that will be triggered upon every merge to the master branch. The workflow basically does the following:

  • Obtain some information about the service instance deployed in this PR (such as the domain name of the new ALB instance).
  • Use the AWS CLI to point the DNS record of our subdomain to the new ALB instead of the old one.

A downside of using DNS based traffic shifting is that we have limited control over how quickly changes propagate to clients. In our implementation the time-to-live (TTL) of the DNS record is set to 60 seconds (since we use a DNS record of type alias record the TTL is “inherited” by the DNS record of the underlying resource), so changes should be relatively quickly propagated, assuming that the TTL is respected by clients and intermediary hosts. Still, bear in mind to always be backwards-compatible when performing database migrations or API changes, since both the old and the new web clients will be reachable by your end users for a short period of time (due to this potential delay in DNS propagation).As this is just a POC to demonstrate traffic shifting in practice, the are several future improvements that might need to be addressed before using this pipeline in a real application:

  • Currently, if you modify the stack with resources that are shared between all the instances in a pull request, those changes will affect also the code that is running in the production environment. This would need to be mitigated somehow in order to ensure a safe development environment.
  • Stacks that are not in use anymore should be automatically removed after a while, perhaps after 1-2 weeks when we can be sure we do not need to roll back to that version. A scheduled Github Action could be used to achieve this.
  • Weighted routing could be used for doing the traffic shift between environments gradually (i.e. canary deployments). This could also be used for A/B testing.
  • It would be useful to run automated integration tests and monitoring during or after performing the traffic shift, and roll back to the old version automatically in case something goes wrong.
  • More environments could be added between the testing environments and the production environment, such as a staging environment for example.

As with the previous blog post, feel free to fork the example repo and try it out yourself. Thanks for reading!

Written by:

Devies