Building a Production Kubernetes Cluster on Bare Metal
Setting up a complete Kubernetes cluster on physical servers was one of the most challenging and rewarding projects I've undertaken. Rather than relying on managed services, I wanted to understand how everything actually works under the hood.
Why Bare Metal?
Cloud costs were becoming prohibitive for the startup I was working with, and I wanted to dive deep into infrastructure to understand failure modes better. Managed services are great, but when things go wrong, you need to understand the underlying systems.
Architecture Overview
The setup included:
- 3 physical servers for control plane redundancy
- 5 worker nodes for application workloads
- MetalLB for load balancing
- Calico for networking
- Ceph/Rook for distributed storage
- Istio for service mesh
- Prometheus + Grafana for monitoring
- Keycloak for authentication
Key Challenges
When MetalLB wouldn't assign IPs, I had to debug networking at the packet level. When Ceph storage crashed, I learned about consensus algorithms the hard way. These weren't abstract concepts anymore—they were real problems affecting production systems.
Lessons Learned
The best lesson was that building infrastructure isn't about memorizing kubectl commands—it's about understanding how distributed systems fail. Always monitor what you can't see, because that's often where problems hide.