In my current project, we are building a platform based on the Amazon Kubernetes offering (EKS), our goal is to provide a blueprint to deploy a fully operational kubernetes cluster with the minimum effort possible. One of the advantages of using a managed service such as EKS is that it removes a lot of the complicated work of setting up a kubernetes cluster.

Creating a platform is much more than just having a kubernetes cluster up and running, there are other important aspects that my team need to oversee, such as security, logging and auditing, networking, maintenance, CI/CD etc. In this blog post I will provide a quick overview on some of these problems and how we tackled some of these topics.

 

BUILDING THE BLUEPRINT

Since the beginning of the blueprint our goal was clear, we wanted to provide a fully automated deployment of a kubernetes cluster and all the required components. The tool of choice was terraform, we were aware that terraform is not a perfect tool and there are some constrains when using it. Terraform can break miserably, it relies on a state file to keep track of the existing resources which sometimes does not match the real state of the resources it creates, this is especially true when deploying against a kubernetes cluster.

 

There is also dealing with circular dependencies which can be a bit cumbersome, and much more. Nevertheless, it provided a solid foundation for us to build our blueprint and deal with multiple kinds of AWS and Kubernetes resources in the same code base.

 

One important aspect when we started building our blueprint is that we wanted to make it as modular as possible, users should be able to enable/disable certain features in the cluster. Here we faced a major issue with terraform due to the way it processes the resources tree, the depends_on can become a nightmare to manage and you can find yourself easily in the situation where terraform breaks miserably.

 

Here are some of the important guidelines we adopted from the beginning:

  • use coding guidelines to make sure all team members wrote code in a similar manner;
  • avoid the use of null_resourcesor other kind of resources that you are not able to keep track;
  • avoid using a wide range of providers;
  • avoid using YAML code inside the terraform file, use yaml_encode;
  • always test you changes in at least 2 scenarios: deploy from zero, or update from the latest version;
  • avoid repeating the same code over and over again, if it’s a pattern, create a module.

 

HANDLING KUBERNETES RESOURCES

There are multiple ways to create kubernetes resources, using plain YAML manifests, helm charts and tools such as Kustomize, the good news here is that terraform provides a way to use the 3 types of deployment. The challenge here was how we were going to be able to maintain all these resources inside the blueprint and how we would “patch” them to match our internal requirements.

 

Here are some important guidelines we adopted:

  • changing a upstream chart/manifest is not allowed, whenever possible use Kustomize, Kustomize allows you to patch the upstream resources and the update process will have less errors, fortunately terraform also have support for using it inside terraform;
  • if possible, automate the update process of the upstream resources in the base source (ie. shell script);
  • avoid using templates to generate dynamic kubernetes manifests.

BLUEPRINT COMPONENTS

In the last 2 topics I described the building blocks of our blueprint, next I will briefly discuss another important aspect of our platform, as stated before, having a kubernetes cluster running is just part of the problem, to have a full platform we must tackle other concerns such as security, networking, logging, and auditing, ci/cd, etc.

 

On this blog post I will only talk about security and networking.

 

Security

A critical part of any platform is security, if a platform is not secure by default it will eventually lead to higher operational/maintenance costs and in the worst-case scenario, business costs. We all know that it is impossible to foresee all possible threats, so the best approach here is to provide a framework that allows us to protect our platform.

 

There are multiple solutions for the problem, one of the paths we have chosen was to control all API requests by the using an admission controller, in our case – Gatekeeper, because it allowed us to write our own policies using a known language. Gatekeeper is the first component to be deployed, this way we assure that all other components will comply with our security rules.

 

Some of the topics we covered on security were:

  • comply with CIS for Amazon EKS;
  • only allow images from repositories we trust, and by trust we mean our own repositories, all images are scanned for virus and vulnerabilities;
  • benchmark our cluster using tools like kube-bench on a regular basis.

Networking

When you deploy a kubernetes cluster it provides a software defined network ready to be used, but it does not offer a full solution, it is up the platform engineers to adapt it to the platform requirements, in our case the focus was security and making the life easier for the development teams.

 

Some topics we covered on networking were:

  • Network protection, security groups for pods;
  • Service mesh out-of-the-box, AWS App Mesh;
  • Application disclosure, API gateway integration.

 

CONCLUSION

As you can see when building a platform based on kubernetes we need to focus our efforts on multiple fronts and because technologies are always evolving, every platform needs to evolve with them. In the current days there are a lot of tools that allows platform engineers to develop using the same concepts used in software development.

  • abstraction of components allows teams to maintain them with the least impact possible for the platform;
  • identifying patterns and reuse them as modules assure less maintenance costs of the platform;
  • describing your platform using IaC assures stability and developers can always replicate the same environment;
  • offering our kubernetes platform as a software artifact it is subjected to the same life cycle of a software product, your team can use the same flow your development teams use.

Thank you for reading this blog post, I hope that my thoughts here can give some contribute to the community of engineers working on a daily basis in the same field.

Share this article