Canary deployment with Ruby on Rails on Kubernetes

3 minute read

If you are reading this, chances are that you are using Ruby on Rails as your framework of choice. There have always been two main options in the past for deploying rails apps. One was Heroku, with its easy git push buildpacks and fire and forget deployments. The other one was capistrano which targets people that manage their own infrastructure. Those people are willing to dedicate more time into maintaining that infrastructure. Heroku is a great option when you are small enough for it to be cheap. After a certain point, it’s better to manage everything yourself (unless you are a funded startup, then go and burn the money).

Ruby on Rails also has database migrations which maintain the db schema across many databases. All those migrations have to run at some point of time during your deployment cycle. Heroku builds a packaged app and runs the database migrations and asset precompilation (if necessary) before it activates every deploy. The same thing comes out of the box with Capistrano. It copies the code over to the server, precomlipes the assets and then the database migrations run. If everything was successful, it creates the current symlink and restarts the server.

Using kubernetes, you package your application into a (more or less stateless) docker container. The ‘deployment’ strategy is pulling a new docker image and restarting the deployment pods using that image. We use rolling update to achieve zero downtime deployments, since we tend to deploy many times per day. After a trial and error period using initContainers on every deployment pod, I came up with an idea that could produce a true canary deployment strategy. You have to be careful because there will be a short period of time when old code is running with the new database schema. I plan to touch database migrations in one of my following posts.

There is one canary deployment container, set up with an initContainer that does the work, and busybox, doing nothing when the init finishes:

	- name: db-migrator
	  image: your-repo/your-container
	  imagePullPolicy: Always
	  command: ["/app/bin/rake", "db:migrate"]
	- name: canary-busybox
	  image: busybox
	    - containerPort: 80
	  command: ["/bin/sh", "-ec", "while :; do echo '.'; sleep 5 ; done"]

In Kubernetes, you use initContainers to setup the state before you run the actual container with the application code. You can also assert some configuration rules (i.e; the database exists) before the actual container starts. If you don’t trust your CI process, you can run the specs before running the main container. Whatever you do run in the initContainers list, it expects a successful exit to run the main app code. We are using this to assert that the database migrations (and other re-deploy tasks we run) suceed.

The way we do this is simple, we rollout the new version of the canary deployment, and then scale it to 1 replica. The replica starts, it runs the initContainer and then runs the database migrations, and other pre-deploy scripts until it’s successful. When all those exit with success, the deployment will be successful and we can restart the deployment running the main app after that. Then scale the canary container back to 0 replicas. This also creates an opportunity to run certain smoke tests before you restart the web services. Doing so can make the system even more reliable and resilient to errors.

The actual code that does the whole deployment process is here:

kubectl -n app rollout restart deployment app-canary
kubectl -n app scale deployment/app-canary --replicas=1
while : ; do
  kubectl -n app get deployment/app-canary | grep app-canary | gawk '{ print $2 }' | grep "1/1"
  if [ $? -eq 1 ]; then
    echo "Canary container is not up yet, retrying in 30 seconds"
    sleep 30
    echo "Canary container up and running, restarting deployment"

kubectl -n app rollout restart deployment app
kubectl -n app rollout restart deployment app-worker
kubectl -n app scale deployment/app-canary --replicas=0

If you are automating your deployment process, a wise thing would be to add a timeout and a notification if it fails to complete in given time. There could be some underlying error frying your production database, and you might not be aware of it while it’s happening.