Scratch-building infra - cluster bare autoscaler
Scratch-building infra?
This is a post in a series, where I explain, how I build, configure, and maintain the whole infrastructure that I use on a daily basis. It is a private data center (DC), consisting of 14+ k8s nodes (depending on the day, week, or month) and hosting a number of various services. Since this is an ongoing thing for many years now, I decided to give it some light and share with others.
You may follow this blog (there is an RSS feed), or just follow the RSS tag #scratchbuild-infra
Cluster Bare Autoscaler (CBA)?
My basement "data-center" runs currently about 15 k8s nodes. Three of those are control-plane, and the rest are workers. For the last few years I've been running nodes in an underutilized fashion. This means I tried to keep the overall load average low. I've been adding more and more services, but also, I added a few nodes here and there. Hardware setup varied a bit. Generally those are HP thin clients (T-620, T-630 with 8-16GB of ram, dual or quad cores). Additionally, these nodes are low-power devices (average consumption per unit is 9W).
Since I've been working recently around the K8S cluster-autoscaler, I started thinking about automatic scale-down of my cluster. And scale-up of course!
Cluster Autoscaler has pluggable configurations of various cloud providers, which works with the nodeGroups as first-class citizens. Scale-ups/downs happen on nodeGroups basis.
This is something, which does not exist in my cluster. I don't have a nodeGroup notion. Just k8s workers (of various kinds, labeled properly across hardware types). I thought about forking the CA project and hacking around it, but I gave up, as it would cause a lot of spaghetti code and many additional abstractions to get it working with my setup.
Also, my nodes are bare-metal boxes, which are bootstrapped using Kubespray. The bootstrap process is time-consuming. I don't think that a scale-up (in my conditions) should consist of a brand-new k8s node bootstrap.
This is when I started thinking about working on another version of autoscaler, which would match my use-case. I called it Cluster Bare Autoscaler.
This project is "work-in-progress" now. Follow this blog (or the GitHub repo) for updates.
Scale-down operation
The idea is simple - having 15 running nodes in my cluster I could scale-down one-by-one, under a specific utilization threshold. These thresholds could be CPU, memory, IOwaits, load average or some combination of those. Maybe even some higher-level metrics like SLOs of running services on top of this cluster (this could be tricky, as there is a lot of dependencies that can cause a service SLO to flap - but you get the point).
And from here - it's a quite simple design (at least top-level, as details are ofc more complex!). When there is an underutilization situation detected, autoscaler triggers a scale-down of 1 node. After this operation, there is some cool-down period, which is required to stabilize performance metrics. After this period finishes, autoscaler may verify its decision-making and consider additional scale-down.
How this "scale-down" operation would work technically? It would be a simple k8s drain -> systemctl poweroff. In essence, a node would be powered off.
Scale-up operation
As opposite to scale-down, the scale-up would be triggered in a situation, where an overutilization threshold was crossed. This could mean that a combination of selected metrics (mentioned above) for a specific period of time would indicate, that nodes are just doing too much and overall system performance is a bit degraded.
Technically, the scale-up is just a "power on the node" operation. It would be pluggable, as there is a number of ways to power-cycle server. Be it Wake-on-LAN, IPMI or any other means. In my case this will be a WOL, as this is what my terminals support.
Node Groups vs. Node Types
I mentioned before, that Cluster Autoscaler works on a nodeGroup notion. My situation is different, as I don't have any node groups. What I do is I label nodes with specific nodeType labels to differentiate it.
Afterward, cluster services target specific node types (using taints/tolerations or node affinities/antiaffinities).
So considering this, the scale-up and scale-down should consider not a cluster as a whole, but each node types separately.
Additional considerations
One of the most important questions is - what happens, if a k8s worker node is turned off for a long time? Days, weeks? I tried answering this, and looks like nothing wrong. It just joins the cluster again. Worker nodes are stateless, so there's nothing happening there.
The main consideration is about operating system updates. In a situation, when I want to update my worker nodes (the operating system version or any specific configuration setting), I will need to remember, to apply this to all nodes, including those that are currently shut down by the autoscaler.
Another question is "how many nodes is the minimum agreeable number"? This one is simple - I think this should be configurable on a cluster level, but also, on a "per node type" level.
What's next?
The project is started: cluster-bare-autoscaler. I will work on the design over the next few days and will post about it later. In the meantime, I'll also create a post on Wake-on-lan tests I run recently, as there were a few interesting things there (like network boundaries, sending multicasts from Golang code and some others).
You may follow that Github project if interested, but also, remember about following this blog. It provides RSS feed for all posts, or for specific tagss: #scratchbuild-infra