Intro
This post covers the first part of the CBA implementation. The initial release (a working POC) on Github was 0.3.3, but there were plenty of decisions and implementations before that. This (and a few next) posts will describe this lengthy road to 0.3.3 version.
The goal of the first step was to create the smallest useful control loop that could look at a Kubernetes cluster, choose a safe candidate node (for shutdown), verify safety of the scale-down operation, possibly perform it, and then enter the cooldown period, so the cluster have time to stabilize. Afterwards the whole flow continues forever. Well - that was the idea 🙂
The controller pattern
In the previous posts, I described why CBA exists and why a bare-metal Kubernetes cluster needs a different shape of autoscaling than a cloud environment (in other words - why I did not use the upstream Cluster Autoscaler project).
The important difference is that CBA does not create and terminate cloud instances. It works with a fixed pool of physical machines. Some of them are active Kubernetes nodes. Some of them may be powered off. The autoscaler has to move between those states carefully.
So the first useful question was:
Can I safely remove one existing node from service and power it off?
That question feels small, but it is not simple. A Kubernetes node is a scheduling target plus a place where pods are running and a source of capacity. Powering it off without respecting Kubernetes first would be asking for trouble.
That is why the first real implementation of CBA became a controller, not a full autoscaler. Actually a controller loop that observe the cluster, choose one candidate for shutdown, and decide whether it was allowed to take this kind of disruptive action.
In Kubernetes projects, a controller is both a design pattern and a running component.
As a pattern, a controller continuously observes the current state of the system, compares it with the desired state, and performs small corrective actions. This is how Kubernetes itself works: a Deployment controller notices that a Deployment wants three replicas but only two Pods exist, so it creates another Pod. So the controller pattern is like:
- observe current state
- compare it with desired state
- take corrective action
- repeat
CBA’s first version followed the same idea, but in a simpler polling-based form. It did not need informers, watches, or work queues yet. It loaded config, connected to Kubernetes, listed nodes, filtered unsafe candidates, selected one node, evaluated scale-down, optionally acted, and then waited for the next loop.
If you want to learn more about controller pattern in k8s world, there're 2 books I found helpful in this terms, when I was learning about it:
- Programming Kubernetes (by Michael Hausenblas, Stefan Schimanski, ISBN: 9781492047094, 2019)
- Kubernetes Patterns (by Bilgin Ibryam, Roland HuĂź, ISBN: 9781492050278, 2019)
The reconciler pattern
A reconciler is the part of the controller that performs one decision pass. It reads the current state, compares it with the desired or acceptable state, decides whether anything should change, performs a small action if needed, and returns. The controller then runs the reconciler again later.
In a typical Kubernetes controller, this may be driven by watches, informers, and work queues. For the first version of CBA, I did not need that machinery yet. A polling loop was enough. The main loop was simple:
for {
if err := r.Reconcile(ctx); err != nil {
slog.Error("reconcile error", "err", err)
}
time.Sleep(cfg.PollInterval)
}This first version of the reconciler looked like:
func (r *Reconciler) Reconcile(ctx context.Context) error {
slog.Info("Running reconcile loop")
nodes, err := r.client.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
return err
}
eligibleNodes := []v1.Node{}
for _, node := range nodes.Items {
skip := false
for key, val := range r.cfg.IgnoreLabels {
if nodeVal, ok := node.Labels[key]; ok && nodeVal == val {
slog.Info("Skipping node due to ignoreLabels", "node", node.Name, "label", key)
skip = true
break
}
}
if !skip {
eligibleNodes = append(eligibleNodes, node)
}
}
slog.Info("Filtered nodes", "eligible", len(eligibleNodes), "total", len(nodes.Items))
// TODO: Evaluate if any nodes can be safely shut down
return nil
}As you can see, the above code was not yet production - it was just a basic reconciliation try:
1. observe the current state (list nodes)
2. define a desired state by generating eligible nodes list and picking a candidate for a shutdown
3. now, the idea was to shutdown that candidate ("small corrective action" of the reconciler)
4. wait
5. observe againThis obviously is a very naive approach in this shape. This kind of a loop can become dangerous if it acts too often or tries to fix too much at once. If the decision logic is wrong, repeating it every few seconds can amplify the mistake.
That's why it was running with a --dry-run from the very beginning, doing almost nothing - just outputting logging information, so I could read it and understand the behavior. After I was confident, that this basic loop was working as intended, I could continue.
Also worth noting, that at this point, I was mainly running this from my laptop. There was no need for deploying it into the cluster yet.
Start simple, ensure that things work as you expect it to work, and move forward.
The reconciler struct
Conceptually, it had this kind of shape:
type Reconciler struct {
cfg *config.Config
clientset kubernetes.Interface
powerController power.Controller
strategy strategy.ScaleDownStrategy
state *NodeState
}
The exact code evolved, but the responsibility split was already visible:
- config tells the controller what it is allowed to manage
- the Kubernetes client tells it what exists right now
- the power controller hides the physical shutdown mechanism
- the scale-down strategy decides whether a candidate may be removed
- state (node state) prevent repeated destructive actions
At the very beginning, list of nodes was... hardcoded in the configuration, not autodetected! It was the easiest way to ensure, that CBA operates only on nodes, that I'm certain it's safe (this was quickly changed into a dynamic node autodiscovery, but in my opinion important to underline, how it all started).
Power controller in this struct was basically a solution for an idea of a pluggable mechanism, that provides various ways to power on and off the node. E.g. in my case it was (and still is) the WOL (Wake-on-Lan, that I described in CBA-00: Wake on Lan). But I wanted to leave the gate open for any other factors of poweron/off.
Similarly for the scale-down strategy (later also scale-up). In the very beginning, I was thinking simple: "just fetch load average from each node, possibly calculate 95th percentile from the whole cluster, and decide basing on those numbers".
Last part was the state (node state). Remembering it (somehow, remember that I wanted to keep CBA stateless) was important, not to shutdown and power-on the same box over and over again. I'll describe this later, but for now, I'll just tell you, that the cooldown period was my idea for this problem (time window, that when active, no scale-up/scale-down operations where run).
At this stage, a simplified version of the flow looked like this:
func (r *Reconciler) Reconcile(ctx context.Context) error {
nodes, err := r.clientset.CoreV1().Nodes().List(ctx, metav1.ListOptions{})
if err != nil {
return err
}
candidate := r.pickCandidate(nodes.Items)
if candidate == nil {
return nil
}
allowed, err := r.strategy.ShouldScaleDown(ctx, candidate)
if err != nil || !allowed {
return err
}
return r.scaleDown(ctx, candidate)
}
So:
- Get the current nodes.
- Pick one possible node.
- Ask whether it can be scaled down.
- If yes, perform the scale-down path.
Calculations and cluster safety
One of the most important early design choices was this - before CBA tried to calculate whether a node was underused (e.g. the lowest load average across whole cluster, so maybe a candidate to shutdown), it first had to decide whether the node was even allowed to be touched.
The first safety layer was eligibility filtering. Some nodes should not reach the scale-down strategy at all. For example:
- control-plane nodes should be ignored
- explicitly excluded nodes should be ignored (for instance my Home-Assistant node, that has USB dongle that is responsible for Zigbee network may not be shutdown)
- already cordoned nodes should usually be ignored (cordoning happen for a specific reason)
- nodes in cooldown should be ignored (to prevent powering on and off the same node again in a short time window)
As you see - the node eligibility was the first safety layer. Strategy came after that.
That principle stayed important later as CBA grew:
Do not start by making the calculations clever. Start by shrinking the set of things the controller is allowed to damage.
Choosing scale-down candidate
After filtering, the first reconciler needed to choose a candidate for a scale-down op. At this stage, the goal was not to find the globally optimal node, trying to solve placement, topology, maintenance rotation, group balancing, or any kind of advanced scheduling problem. The first version needed a controlled path from "there are eligible nodes" to "let us evaluate one of them". In other words:
- list eligible nodes
- pick one candidate
- evaluate only that candidate
- take at most one disruptive action per loop
In the very beginning, the selection of a candidate was very simple - just take the last element from the list of eligible nodes:
// Pick the last eligible node as candidate for now (simplest approach)
candidate := eligibleNodes[len(eligibleNodes)-1]
slog.Info("Candidate for scale-down", "node", candidate.Name)This was a very naive approach, but also very good step in that early POC stage - not to make things too complex from the very beginning. Baby steps and learn from there.
There is however one, practical problem with deterministic selection, that I need to mention here. If the controller always picks nodes in the same order, it may keep considering the same machine over and over again. For this case, I introduced a cooldown period per-node, meaning, that after a node was scaled-down, it could not be a scale-down candidate for the next 24h.
Obviously, this kind of approach did not make too much sense, since it could pick a node with a highest current utilization, which would impact negatively the overall cluster performance. But hey - remember - we're running in a dry-run at this moment, slowly moving forward and evolving.
Why scale-down came first
Making this decision - whether CBA should prefer scale-down over scale-up (or otherwise) was a joyful process! In the end, I decided to prioritize scaling-down (remember - one of the CBA main goals was power saving), but of course, only in case such scale-down would not impact negatively the overall performance of the cluster.
Later, if a cluster was overutilized, I planned to implement the scale-up logic. In the meantime, I could always simply power-on the shutdown node manually via WOL.
Cordon and drain at a high level
The first real scale-down path had to respect Kubernetes before sending the power-off command to the node. The high-level order was:
- mark the node unschedulable
- list pods running on that node
- skip pods that should not be evicted
- evict regular workload pods
- call the power controller
The key design decision is the order. CBA should never treat poweroff as the first Kubernetes action. A physical shutdown is the last step in a sequence that starts inside the cluster.
First, stop new pods from being scheduled there (that's the cordon operation). Then move existing workloads away (that's the drain operation). Only after that does it make sense to cross the boundary from Kubernetes control into machine power control (in other words - when k8s part is done -> let's shutdown the box).
For the details, you may find that old shape of cordonAndDrain() method interesting.
The power controller
The first reconciler needed a way to say:
Kubernetes is done with its part. Now power off this machine.
But the reconciler should not care how that happens. Maybe the first implementation only logs the action. Maybe a later one calls some HTTP endpoint. Maybe another one uses IPMI, SSH, Wake-on-LAN related infrastructure, or a host-local shutdown helper. Those are important details, but they should not be mixed into the core control loop.
That is why the first power controller abstraction was useful. Conceptually, it was as small as this:
type PowerController interface {
Shutdown(ctx context.Context, nodeName string) error
}
Small interfaces are often the best early interfaces. This also made the early implementation easier to test. The reconciler could call a power controller without needing real hardware integration from day one.
And what was the actual outcome of the first Shutdown method?
func (l *LogPowerController) Shutdown(ctx context.Context, nodeName string) error {
slog.Info("PowerController: simulated shutdown", "node", nodeName)
return nil
}Yest, just log, that the shutdown simulation was run correctly. Remember - this was a POC.
Cooldowns and local state
A loop that drain and shut down nodes needs a brake. Without a brake, a bad condition could repeat every poll interval. If the strategy was too permissive, or if the cluster looked empty enough for several loops, the controller could keep selecting nodes and trying to remove them one after another.
So the first version introduced local state and a global cooldown. The idea was simple:
- after one scale-down attempt, remember that it happened
- do not immediately try another disruptive action
- give Kubernetes time to react
- give the operator time to observe
- come back later and re-evaluate the cluster
The cooldown period was a solution for this. Basically, after a node was shutdown, the whole recociler entered a cooldown state, during which, no other ops (scale-down/scale-ups) were permitted:
if r.state.IsGlobalCooldownActive(time.Now(), r.cfg.Cooldown) {
slog.Info("Global cooldown active — skipping scale-down")
return nil
}There was also a secondary cooldown implemented - a one, that was related to a node. So if node was just scaled-down, it could not be picked as a next candidate for a scale-down (to prevent shutting down and powering on the same node all over again):
if r.state.IsInCooldown(node.Name, time.Now(), r.cfg.Cooldown) {
slog.Info("Skipping node due to cooldown", "node", node.Name)
continue
}and the related structure, that held information about the state:
type NodeStateTracker struct {
mu sync.Mutex
recentlyShutdown map[string]time.Time
poweredOff map[string]bool
lastShutdownTime time.Time
}So as you can see - this is in-memory structure, that is not persisted onto any disk or database. Restarting CBA would mean loosing this state information. That was fine for that POC stage.
That's it for this part
To summarize, this early version provided only some basic components and actions:
- run as a normal Go process
- load CBA config
- create a Kubernetes client
- list nodes
- filter unsafe candidates
- choose one node
- run a scale-down decision
- simulate the operation in dry-run mode
- perform the high-level cordon and drain flow
- call a power controller abstraction
- back off through a global cooldown
Next time, we'll introduce the first scale down strategy: ResourceAwareScaleDown, we'll discuss metrics emission and calculation and also we'll shuffle scale-down candidates to prevent picking the same node over and over again.