Maintenance
Backup
On your provisioning machine
During the installation process, a cluster config directory is created
on your provisioning machine, located in the top-level sub-directory
clusters
in your clone of the stackspin git repository. Although
these files are not essential for your Stackspin cluster to continue
functioning, you may want to back this folder up because it allows easy
access to your cluster.
On your cluster
Stackspin supports using the program Velero to make backups of your Stackspin instance to external storage via the S3 API. See Backups with Velero (Optional) in the installation instructions for setup details.
For the maintenance operations described below – in particular, restoring
backups – you need the velero
client program installed, typically on your
provisioning machine although you can also run it on the VPS if preferred. You
may find it at Velero’s github release page.
By default Velero will make nightly backups of the entire cluster (minus Prometheus data). To make a manual backup, run
cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait
from your VPS. See velero --help
for other commands, and Velero’s
documentation for more information.
Note: in case you want to make an (additional) backup of application
data via alternate means, all persistent volume data of the cluster are
stored in directories under /var/lib/Stackspin/local-storage
.
Restore
Restoring from backups is a process that for now has to be done via the command line. We intend to allow doing this from the Stackspin dashboard instead in the future.
These instructions explain how to restore the persistent data of an individual app (such as Nextcloud, or Zulip) to a previous point in time, from a backup to S3-compatible storage made using velero, on a Stackspin cluster that is in a healthy state. Using backups to recover from more severe problems, like a broken or completely destroyed Stackspin cluster, is also possible, by reinstalling the cluster from scratch and restoring individual app data on top of that. However, that procedure is not so streamlined and not documented here. If you are in that situation, please reach out to us for advice or assistence.
Select backup
To show a list of available backups, perform the following command on your VPS:
$ kubectl get backup -A
Once you have chosen a backup to restore from, record its name as written in
the kubectl
output.
Note
Please be aware that for technical reasons the restore operation will restore not only the persistent data from this backup, but also the app’s software version that was running at that time. Although the auto-update mechanism should in turn update the app to a recent version, and the recent app version should be able to automatically perform any necessary data format migrations on the old data, this operation has not been tested for older backups, so please proceed carefully. As an example of what could go wrong, Nextcloud requires upgrades to be done in a serial fashion, never skipping a major version upgrade, so if your backup is from two or more major Nextcloud versions ago, some manual intervention is required. If you have any doubts, please reach out to us.
Restore app data
Warning
Please note that restoring data is a destructive operation! It will replace the app’s data as they are now. There is no way to undo a restore operation, unless you have a copy of the current app data, in the form of a current Stackspin backup or an app-specific data export. For that reason, we recommend making another backup right before beginning a restore operation.
To restore the data of app $app
(for restoring the dashboard, see the note
at the end of this subsection) from the backup named $backup
, perform the
following commands:
$ flux suspend kustomization $app
$ flux suspend helmrelease -n stackspin-apps $app
$ kubectl delete all -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete secret -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete configmap -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete pvc -n stackspin-apps -l stackspin.net/backupSet=$app
$ velero restore create arbitrary-name-of-restore-operation --from-backup=$backup -l stackspin.net/backupSet=$app
At this point, please first wait for the restore operation to finish, see text below.
$ flux resume helmrelease -n stackspin-apps $app
$ flux resume kustomization $app
Note
Specifically for Nextcloud, the kubectl delete pvc ...
command might hang due
to a Kubernetes job that references that PVC. To solve that, look for such jobs
using kubectl get job -n stackspin-apps
and delete any finished ones using
kubectl delete job ...
. That should let the kubectl delete pvc ...
command finish; if it was already terminated, run it again.
The velero restore create ...
command initiates the restore operation, but
it doesn’t wait until the operation is complete. You may use the commands
suggested in the terminal output to check on the status of the operation.
Additionally, once the restore operation is finished, it may take some more
time for the various app components to be fully started and for the app to be
operational again.
Note
To restore the “dashboard” data, which contains among other things the set
of Stackspin users, follow the instructions above, using dashboard
as
$app
, except that the kustomization to suspend and resume is the
single-sign-on
one, and the helmrelease to suspend and resume is the
single-sign-on-database
one in the stackspin
namespace.
Change the IP of your cluster
In case your cluster needs to migrate to another IP, make sure to update
the IP address in /etc/rancher/k3s/k3s.yaml
and, if applicable, your
local kube config and inventory.yml in the cluster directory
clusters/stackspin.example.org
.
Delete evicted pods
In case your cluster disk is full, kubernetes taints the node with
DiskPressure
. Then it tries to evict pods, which is pointless in a single
node setup but can still happen. We have experienced hundreds of pods in
evicted
state that still showed up after DiskPressure
had recovered. See
also the out of resource handling with kubelet documentation.
You can delete all evicted pods with this command:
$ kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'
Run Nextcloud occ
commands
Nextcloud includes a CLI tool called occ
(“OwnCloud Console”).
This tool can be used for all kinds of tasks
you might want to do as a system administrator.
To use the tool, you need to enter Nextcloud’s “pod” and change to the correct user. The following commands achieve that:
exec
opens a root terminal inside the pod:
$ kubectl -n stackspin-apps exec deploy/nc-nextcloud -it -- bash
Change to the www-data
user:
$ su -s /bin/bash www-data
Run occ
:
$ php occ list