ZenHub recently migrated its entire production and CI/CD infrastructure to Kubernetes. This blog post documents the history of containers at ZenHub, why we eventually went with Kubernetes, how we got there, and what we are looking to accomplish in the future.
When I joined ZenHub, almost 2 years ago, our deployment process was a script in TravisCI that manually deployed the frontend and backend code to AWS VMs. It had problems, one of the biggest was not always updating all of our VMs, causing debugging headaches. We began to research improved deployment solutions and it became clear containers were the way to go. Docker has all but becomes the standard container system, and in addition, our on-prem customers were asking for ZenHub to be deployed via containers instead of a VM.
We researched container deployment technologies and settled on Docker Swarm because it was cloud-agnostic and relatively easy to understand with a small scope of features. After getting our application running inside Docker containers, we started to build out Docker Swarm and additional tools needed to complete the infrastructure:
- Terraform to manage the base cloud infrastructure on GCP.
- Packer to create base images for VMs running Swarm workers.
- Consul to store application configuration and operate as service discovery.
- Vault to store application secrets.
- Jenkins to handle CI and CD Pritunl to manage access to internal service access.
After all these tools were set up and ZenHub was running on Docker Swarm we started to feel the pain of having a completely self-built system. In isolation, each tool is very powerful, but gluing them all together was difficult and hard to manage for our team of 2 dev-ops engineers. The glue became very fragile, and we were not confident in making changes to the infrastructure. Besides, it was hard for developers to inspect the running system to debug application problems. Ingesting and searching logs, running one-off commands, inspecting configuration required finding the VM where the docker service was running, SSHing into that instance, finding the container ID, and finally running the correct docker commands. This was tedious and annoying to perform on a daily basis or when solving problems.
After running Docker Swarm for around one year we came together to review the infrastructure we had now and discuss if we wanted to migrate. We unanimously decided to investigate using Kubernetes, hosted by GKE, to run our container infrastructure. It would solve many of the headaches we were having with our current Docker Swarm infrastructure, and also set us up to deliver Kubernetes to our on-prem customers that have been requesting it.
After a week-long spike, we were able to run the ZenHub application in Kubernetes locally and on GKE. We reviewed the outcome with the team and decided to continue refining our implementation. We picked a few tools to help us:
- Terraform to manage the base cloud infrastructure on GCP.
- Helm to template our Kubernetes resources across multiple environments.
- Helmfile to manage multiple sets of configuration and secrets for each environment.
- SOPS to allow us to commit encrypted secrets to our Helm chart repository.
- Buildkite for CI/CD.
Along with deploying Kubernetes we had the challenge of picking a new CI/CD system. Our previous infrastructure used Jenkins, which is a very powerful tool but suffered from the same general problem as our previous infrastructure: it was hard to set up and manage. We also briefly researched using GCP’s Cloud Build but, unfortunately, it cannot connect to VPCs. So if we wanted to use it, it would require us publicly exposing our Kubernetes cluster, which we did not want to do for security concerns. We eventually settled on Buildkite: it provides a great Web-UI and integration with GitHub and allows you to run their build agents inside of your own infrastructure. This gives you a lot of flexibility in how they are set up and configured but avoids you having to manage any of the glue. As soon as you have build agents running you can start creating pipelines in your repositories and triggering builds via GitHub. It also provides additional security since you don’t have to allow connections from outside of your infrastructure.
After two more weeks, we had our staging environment migrated to Kubernetes, and after another few weeks of improvements and testing, we started migrating traffic from our existing Docker Swarm infrastructure to our new Kubernetes infrastructure.
We immediately gained massive useability improvements:
- Logs were automatically ingested and parsed (JSON) for easy searching, filtering, and metric creation.
- GKE provides a relatively simple web interface for quickly inspecting the state of the cluster and each deployment.
- For developers, accessing the cluster was trivial using kubectl commands.
- Simple autoscaling via CPU metrics.
- Easier understanding of how our application is deployed and how to make changes to it.
- Fine-grained permissions via RBAC to grant access to specific Kubernetes resources.
Everything wasn’t perfect however, we did run into a few gotchas as we rolled out Kubernetes:
- Helm’s templating syntax (Go templates) leaves a lot to be desired, and can sometimes be confusing to read and use.
- Kubernetes resources can be killed and rescheduled when cluster nodes are “rebalanced” - you must set up a PodDisruptionBudget to control how and when this happens. This bit us with long-running Kubernetes jobs that were created right after scaling down other resources: the cluster would reclaim nodes that were no longer needed, which sometimes involved killing the job while it was running, and restarting it on another node.
- Having multiple docker images deployed by a single Helm chart is hard to manage: you need to either know all the image tags at deploy time or be OK with deploying a default image tag (ie. master/production) which may trigger an unnecessary deployment update.
We are still refining our Kubernetes deployment and have some ideas to explore in the future:
- Improved autoscaling via custom metrics.
- Using Kustomize, instead of Helm, to configure our Kubernetes resources. We hope this will help fix our Helm deployment problem mentioned above and give the modularity needed to push forward with per pull request environments and ZenHub Enterprise.
- Packaging ZenHub Enterprise 3 using Kubernetes and rolling it out to our on-prem customers.
Where are we today? Overall we are very pleased with Kubernetes and our new infrastructure. There are still many things to improve but, in general, we are in a much better position than we were with Docker Swarm.
Our infrastructure is more stable, easier to scale and modify, and (maybe most importantly) easier to understand. We also gained better insights via the out-of-the-box metrics provided by Kubernetes and GKE, and finally we improved our security via GCP IAM and Kubernetes RBAC.
So, what’s next? Currently, we are focusing our efforts on bringing this new Kubernetes based infrastructure to our on-prem customers so they start seeing some of the same benefits.
We surveyed over 500 software engineers, developers, and project managers to find out what happens when a software development team adopts ZenHub. Read the results!