Building a Production Kubernetes Cluster on Bare Metal

Setting up a complete Kubernetes cluster on physical servers was one of the most challenging and rewarding projects I've undertaken. Rather than relying on managed services, I wanted to understand how everything actually works under the hood.

Why Bare Metal?

Cloud costs were becoming prohibitive for the startup I was working with, and I wanted to dive deep into infrastructure to understand failure modes better. Managed services are great, but when things go wrong, you need to understand the underlying systems.

Architecture Overview

The setup included:

3 physical servers for control plane redundancy
5 worker nodes for application workloads
MetalLB for load balancing
Calico for networking
Ceph/Rook for distributed storage
Istio for service mesh
Prometheus + Grafana for monitoring
Keycloak for authentication

Key Challenges

When MetalLB wouldn't assign IPs, I had to debug networking at the packet level. When Ceph storage crashed, I learned about consensus algorithms the hard way. These weren't abstract concepts anymore—they were real problems affecting production systems.

Lessons Learned

The best lesson was that building infrastructure isn't about memorizing kubectl commands—it's about understanding how distributed systems fail. Always monitor what you can't see, because that's often where problems hide.