A curious case of MariaDB on Longhorn/K8S
What is it about?
In my home (not "lab", but "prod" XD) datacenter (ok, a basement one) I have this HA setup of MariaDB (by Bitnami]).
The topology is very simple (inside of Kubernetes, so "let's say simple"):
- Primary node is hosted on top of Longhorn volume - thanks to that there is no network communication between mariadb process and its storage, so no latency. Also, volume is a local filesystem mount, so any kind of file-locking operations are available. This is cool, as this Longhorn volume is a distributed, network storage, but from application perspective - it's a local filesystem.
- Secondary node is hosted on top of QNAP/NFS - it's a bit slower, but its fine. Storage is very resilient, and also I can afford taking snapshots and mysqldump backups here.
So what is the problem?
A couple of days ago I needed to perform a rolling restart of all K8S nodes. All was fine, but shortly afterwards I realised, that MariaDB was constantly crashing due to filesystem errors (a various ones). One of those:
[ERROR] [FATAL] InnoDB: fdatasync() returned 5
[ERROR] mysqld got signal 6
This could be because you hit a bug. It is also possible
that this binary or one of the libraries it was linked
against is corrupt, improperly built, or misconfigured.
This error can also be caused by malfunctioning hardware.
To report this bug, see https://mariadb.com/kb/en/reporting-bugs
We will try our best to scrape up some info that will
hopefully help diagnose the problem, but since we have
already crashed, something is definitely wrong and this
may fail.
And I couldn't find the root cause. I performed hardware analysis, run disk repairs (fsck) and verified in dmesg, that there were no actual hardware problems. Still no luck.
Closer to solution
I started thinking about Longhorn itself. It was running perfectly in this setup for last 2 months, but maybe something was off here.
So I looked into logs of instance-manager (a Longhorn component hosted on each node that provides volumes to workloads) and found this, a seconds before yet-another-mariadb-crash:
time="2022-01-14T12:45:51Z" level=debug msg="Stop monitoring instance manager instance-manager-e-ed9b8934" controller=longhorn-instance-manager instance manager=instance-manager-e-ed9b8934 node=node6
time="2022-01-14T12:45:51Z" level=error msg="error receiving next item in engine watch: rpc error: code = Canceled desc = grpc: the client connection is closing" controller=longhorn-instance-manager instance manager=instance-manager-e-ed9b8934 node=node6
time="2022-01-14T12:45:51Z" level=debug msg="removed the engine from imc.instanceManagerMonitorMap" controller=longhorn-instance-manager instance manager=instance-manager-e-ed9b8934 node=node6
time="2022-01-14T12:45:51Z" level=warning msg="Cannot find the instance manager for the running instance pvc-a66176a1-f874-4974-a53c-28e69578dc32-e-b41f41ca, will mark the instance as state ERROR"
time="2022-01-14T12:45:51Z" level=debug msg="Instance handler updated instance pvc-a66176a1-f874-4974-a53c-28e69578dc32-e-b41f41ca state, old state running, new state error"
time="2022-01-14T12:45:51Z" level=debug msg="Requeue volume due to error <nil> or Operation cannot be fulfilled on engines.longhorn.io \"pvc-a66176a1-f874-4974-a53c-28e69578dc32-e-b41f41ca\": the object has been modified; please apply your changes to the latest version and try again" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node6 owner=node6 state=attached volume=pvc-a66176a1-f874-4974-a53c-28e69578dc32
time="2022-01-14T12:45:51Z" level=warning msg="Engine of volume dead unexpectedly, reattach the volume" accessMode=rwo controller=longhorn-volume frontend=blockdev migratable=false node=node6 owner=node6 state=attached volume=pvc-a66176a1-f874-4974-a53c-28e69578dc32
time="2022-01-14T12:45:51Z" level=info msg="Event(v1.ObjectReference{Kind:\"Volume\", Namespace:\"longhorn-system\", Name:\"pvc-a66176a1-f874-4974-a53c-28e69578dc32\", UID:\"d00b2f85-a6ad-494b-ac81-906b4b01ce31\", APIVersion:\"longhorn.io/v1beta1\", ResourceVersion:\"15833564\", FieldPath:\"\"}): type: 'Warning' reason: 'DetachedUnexpectly' Engine of volume pvc-a66176a1-f874-4974-a53c-28e69578dc32 dead unexpectedly, reattach the volume"
time="2022-01-14T12:45:51Z" level=debug msg="Stop monitoring engine" controller=longhorn-engine engine=pvc-a66176a1-f874-4974-a53c-28e69578dc32-e-b41f41ca node=node6
Bingo! It is something odd with Longhorn itself. I confirmed, that problem was: instance-manager pods were restarting every 60 minutes. A very weird problem.
Even closer to solution
So the question now was: "why this started happening shortly after node were rebooted?".
I even reinstalled whole Longhorn in the meantime, just to rule-out some manual changes, that I might have done in the meantime.
I scrolled through git log of changes in Longhorn deployments, and quickly found out, that 2 weeks ago I added priorityClasses to Longhorn components - just to make sure no volume PODs are ever evicted.
So I verified those priorityClasses and found out, that for some reason, instance-manager PODs have default priorityClass. Wow.
Not sure why, as I configured priorityClasses correctly in my Helm charts and it was fine before nodes were rebooted.
Solution?
Basically make sure, all Longhorn components have some high-level priority class. I created a custom one (called high-ppriority :) ), with a very high index.
And verify it:
kubectl -n longhorn-system describe pods | grep 'Priority Class'
PS
I even found a bug request for that: https://github.com/longhorn/longhorn/issues/2820