Safe production deployments (part 1) – isolated test environments

Published:

November 6, 2020

June 7, 2021

In this blog post we will describe how we implemented a proof of concept for deploying an isolated instance of an AWS-hosted service on every pull request. The purpose is to enable extensive quality control of the service, by actually deploying it before any code is merged. This will in turn give faster feedback cycles and reduce lead times. This workflow also eliminates the risk of changes made by different developers interfering when testing them in parallel.First some background. In many projects we have worked with, the process for deploying new code has been:

  1. Open a pull request
  2. Static code analysis
  3. Unit tests
  4. Build step
  5. Manual deploy to test environment (optional)
  6. Merge the pull request
  7. Deploy to test environment Run integration tests after/during deploy (and do rollback if needed)
  8. Deploy to production environment

This traditional approach has some drawbacks. First, since new code isn't deployed before it's merged, there is a risk that issues aren't discovered until after the merge. These may be issues with new code, or changes to infrastructure. Any issues discovered after merging leads to a broken master branch and the team will be blocked from deploying until the issue is fixed. Second, infrastructure changes may fail to apply on the currently running stack. In the best case scenario the changes are rolled back, but in the worst case scenario, the rollback fails and the stack ends up in a bad state, which may or may not impact end users.What if we could deploy to an isolated test environment for every pull request? This would enable us to discover potential issues that arise during deployment, and we can build more confidence in our change by running integration tests toward the isolated test environment. All of this before we merge our code to the master branch.In order to try this out, we built a simple web application running on a ECS cluster in AWS (you will find the example code here). This proof of concept successfully demonstrates that it's possible to launch any number of instances of the same service that are isolated from each other, and that it's possible to access each of these instances separately.The infrastructure consists of two stacks, one with resources that rarely change (VPC, ECR, ECS), and one with the actual web application (ECS Service), together with an Application Load Balancer (ALB). In addition to the actual service, we built a pipeline with Github Actions. The pipeline is executed when a pull request is opened or updated, and will deploy a completely new “service stack” (an isolated test environment) for each individual pull request. Since the new stack is fronted by an Application Load Balancer, it will have a unique URL to access the web application. This will enable you to try out your changes in isolation and without fear of affecting anyone else that is dependent on your service.

Note that we do not duplicate the whole service for each pull request, only a part of it. This is because the duplicated stacks depend on some resources to exist, such as the ECR registry. To define which resources will be duplicated in each environment, we have defined what we call an “isolation boundary”. Resources that make up the actual service reside within the boundary, and less frequently changed infrastructure outside it. The downside to this is that we can’t test the complete service for each pull request. However, some resources will always need to reside outside the boundary in order to handle traffic shifting between different stacks in a real-life scenario.This proof of concept currently lacks a production environment that is deployed from the master branch. There are at least two possible ways to achieve this. The most straightforward option is to have a permanent production environment, upon which we trigger a stack update with the new changes every time the master branch is updated. The downside is that the same stack update was not performed in the test environment (since it was created from scratch), so we cannot be certain that the resulting stacks will be identical.Another option is to keep the test environment untouched and instead promote it to become the new production environment by shifting traffic to it from the old production environment. This approach is slightly more complicated but it has some benefits. Most importantly, we can be certain that the code that we have tested will be exactly the same as what will run in production. The deployment will also be faster since we just have to shift traffic from one stack to another, and rollbacks will be as easy as just reversing this traffic shift. This will be the topic of part 2 of this series.Feel free to fork the example repository and play around with it yourself. Note that this strategy is not at all limited to Dockerized services only. It should be fairly easy to apply the same strategy to any kind of architecture, for example a set of AWS Lambdas fronted by an Application Load Balancer.

Written by:

Devies