Use GitOps workflow for building a production grade on-premise Kubernetes cluster on cheap VPS provider, with complete CI/CD 🎉
The goal 🎯
This guide is mainly intended for any developers or some SRE who want to build a Kubernetes cluster that respect following conditions:
- On-Premise management (The Hard Way), so no vendor lock in to any managed Kubernetes provider (KaaS/CaaS)
- Hosted on affordable VPS provider (Hetzner), with strong Terraform support, allowing GitOps principles
- High Availability with cloud Load Balancer, resilient storage and DB with replication, allowing automatic upgrades or maintenance without any downtime for production apps
- Include complete monitoring, logging and tracing stacks
- Complete CI/CD pipeline
- Budget target ~$60/month for complete cluster with all above tools, can be far less if no need for HA, CI or monitoring features
What you’ll learn 📚
- Use Terraform to manage your infrastructure, for both cloud provider and Kubernetes, following the GitOps principles
- How to set up an On-Premise resilient Kubernetes cluster, from the ground up, with automatic upgrades and reboot
- Use K3s as lightweight Kubernetes distribution
- Use Traefik as ingress controller, combined to cert-manager for distributed SSL certificates, and secured access attempt to our cluster through Hetzner Load Balancer
- Use Longhorn as resilient storage, installed to dedicated storage nodes pool and volumes, include PVC incremental backups to S3
- Install and configure data stateful components as PostgreSQL and Redis clusters to specific nodes pool via well-known Bitnami Helms
- Test our resilient storage with some No Code apps, as n8n and nocodb, managed by Flux
- Complete monitoring and logging stack with Prometheus, Grafana, Loki
- Mount a complete self-hosted CI pipeline with the lightweight Gitea + Concourse CI combo
- Test above CI tools with a sample .NET API app, using database cluster, with automatic CD using Flux
- Integrate the app to our monitoring stack with OpenTelemetry, and use Tempo for distributed tracing
- Go further with SonarQube for Continuous Inspection on code quality, including automatic code coverage reports
- Do some load testing scenarios with k6 and frontend SPA sample deployment using the .NET API.
You probably don’t need Kubernetes 🪧
All of this is of course overkill for any personal usage, and is only intended for learning purpose or getting a low-cost semi-pro grade K3s cluster.
Docker Swarm is probably the best solution for 99% of people that need a simple container orchestration system. Swarm stays an officially supported project, as it’s built in into the Docker Engine, even if we shouldn’t expect any new features.
I wrote a complete dedicated 2022 guide here that explains all steps in order to have a semi-pro grade Swarm cluster (but no GitOps oriented, using only Portainer UI).
Cluster Architecture 🏘️
Here are the node pools that we’ll need for a complete self-hosted Kubernetes cluster, where each node pool is scalable independently:
| Node pool | Description |
|---|---|
controller | The control planes nodes, use at least 3 or any greater odd number (when etcd) for HA kube API server |
worker | Workers for your production/staging stateless apps |
storage | Dedicated nodes for running Longhorn for resilient storage and DB, in case you won’t use managed databases |
monitor | Workers dedicated for monitoring, optional |
runner | Workers dedicated for CI/CD pipelines execution, optional |
Here a HA architecture sample with replicated storage (via Longhorn) and PostgreSQL DB (controllers, monitoring and runners are excluded for simplicity):
Cloud provider choice ☁️
As a HA Kubernetes cluster can be quickly expensive, a good cloud provider is an essential part.
After testing many providers, as Digital Ocean, Vultr, Linode, Civo, OVH, Scaleway, it seems like Hetzner is very well suited in my opinion:
- Very competitive price for middle-range performance (plan only around $6 for 2 CPU/4 GB for each node)
- No frills, just the basics, VMs, block volumes, load balancer, firewall, and that’s it
- Nice UI + efficient CLI tool
- Official strong Terraform support, so GitOps ready
Please let me know in below comments if you have other better suggestions !
Final cost estimate 💰
| Server Name | Type | Quantity | Unit Price |
|---|---|---|---|
worker | LB1 | 1 | 5.39 |
manager-0x | CX21 | 1 or 3 for HA cluster | 0.5 + 4.85 |
worker-0x | CX21 | 2 or 3 | 0.5 + 4.85 |
storage-0x | CX21 | 2 for HA database | 0.5 + 4.85 |
monitor-0x | CX21 | 1 | 0.5 + 4.85 |
runner-0x | CX21 | 1 | 0.5 + 4.85 |
€0.5 is for primary IPs.
We will also need some expendable block volumes for our storage nodes. Let’s start with 20 GB, 2*0.88.
(5.39+8*(0.5+4.85)+2*0.88)*1.2 = €59.94 / month
We targeted €60/month for a minimal working CI/CD cluster, so we are good !
You can also prefer to take 2 larger cx31 worker nodes (8 GB RAM) instead of 3 smaller ones, which will optimize resource usage, so:
(5.39+7*0.5+5*4.85+2*9.2+2*0.88)*1.2 = €63.96 / month
For an HA cluster, you’ll need to put 2 more cx21 controllers, so €72.78 (for 3 small workers option) or €76.80 / month (for 2 big workers option).
Let’s party 🎉
Enough talk, let’s go Charles !