Maintenance

Backup

On your provisioning machine

During the installation process, a cluster config directory is created on your provisioning machine, located in the top-level sub-directory clusters in your clone of the stackspin git repository. Although these files are not essential for your Stackspin cluster to continue functioning, you may want to back this folder up because it allows easy access to your cluster.

On your cluster

Stackspin supports using the program Velero to make backups of your Stackspin instance to external storage via the S3 API. See Backups with Velero (Optional) in the installation instructions for setup details.

For the maintenance operations described below – in particular, restoring backups – you need the velero client program installed, typically on your provisioning machine although you can also run it on the VPS if preferred. You may find it at Velero’s github release page.

By default Velero will make nightly backups of the entire cluster (minus Prometheus data). To make a manual backup, run

cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait

from your VPS. See velero --help for other commands, and Velero’s documentation for more information.

Note: in case you want to make an (additional) backup of application data via alternate means, all persistent volume data of the cluster are stored in directories under /var/lib/Stackspin/local-storage.

Restore

Restoring from backups is a process that for now has to be done via the command line. We intend to allow doing this from the Stackspin dashboard instead in the future.

These instructions explain how to restore the persistent data of an individual app (such as Nextcloud, or Zulip) to a previous point in time, from a backup to S3-compatible storage made using velero, on a Stackspin cluster that is in a healthy state. Using backups to recover from more severe problems, like a broken or completely destroyed Stackspin cluster, is also possible, by reinstalling the cluster from scratch and restoring individual app data on top of that. However, that procedure is not so streamlined and not documented here. If you are in that situation, please reach out to us for advice or assistence.

Select backup

To show a list of available backups, perform the following command on your VPS:

$ kubectl get backup -A

Once you have chosen a backup to restore from, record its name as written in the kubectl output.

Note

Please be aware that for technical reasons the restore operation will restore not only the persistent data from this backup, but also the app’s software version that was running at that time. Although the auto-update mechanism should in turn update the app to a recent version, and the recent app version should be able to automatically perform any necessary data format migrations on the old data, this operation has not been tested for older backups, so please proceed carefully. As an example of what could go wrong, Nextcloud requires upgrades to be done in a serial fashion, never skipping a major version upgrade, so if your backup is from two or more major Nextcloud versions ago, some manual intervention is required. If you have any doubts, please reach out to us.

Restore app data

Warning

Please note that restoring data is a destructive operation! It will replace the app’s data as they are now. There is no way to undo a restore operation, unless you have a copy of the current app data, in the form of a current Stackspin backup or an app-specific data export. For that reason, we recommend making another backup right before beginning a restore operation.

To restore the data of app $app (for restoring the dashboard, see the note at the end of this subsection) from the backup named $backup, perform the following commands:

$ flux suspend kustomization $app
$ flux suspend helmrelease -n stackspin-apps $app
$ kubectl delete all -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete secret -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete configmap -n stackspin-apps -l stackspin.net/backupSet=$app
$ kubectl delete pvc -n stackspin-apps -l stackspin.net/backupSet=$app
$ velero restore create arbitrary-name-of-restore-operation --from-backup=$backup -l stackspin.net/backupSet=$app

At this point, please first wait for the restore operation to finish, see text below.

$ flux resume helmrelease -n stackspin-apps $app
$ flux resume kustomization $app

Note

Specifically for Nextcloud, the kubectl delete pvc ... command might hang due to a Kubernetes job that references that PVC. To solve that, look for such jobs using kubectl get job -n stackspin-apps and delete any finished ones using kubectl delete job .... That should let the kubectl delete pvc ... command finish; if it was already terminated, run it again.

The velero restore create ... command initiates the restore operation, but it doesn’t wait until the operation is complete. You may use the commands suggested in the terminal output to check on the status of the operation. Additionally, once the restore operation is finished, it may take some more time for the various app components to be fully started and for the app to be operational again.

Note

To restore the “dashboard” data, which contains among other things the set of Stackspin users, follow the instructions above, using dashboard as $app, except that the kustomization to suspend and resume is the single-sign-on one, and the helmrelease to suspend and resume is the single-sign-on-database one in the stackspin namespace.

Change the IP of your cluster

In case your cluster needs to migrate to another IP, make sure to update the IP address in /etc/rancher/k3s/k3s.yaml and, if applicable, your local kube config and inventory.yml in the cluster directory clusters/stackspin.example.org.

Delete evicted pods

In case your cluster disk is full, kubernetes taints the node with DiskPressure. Then it tries to evict pods, which is pointless in a single node setup but can still happen. We have experienced hundreds of pods in evicted state that still showed up after DiskPressure had recovered. See also the out of resource handling with kubelet documentation.

You can delete all evicted pods with this command:

$ kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'

Run Nextcloud `occ` commands

Nextcloud includes a CLI tool called occ (“OwnCloud Console”). This tool can be used for all kinds of tasks you might want to do as a system administrator.

To use the tool, you need to enter Nextcloud’s “pod” and change to the correct user. The following commands achieve that:

exec opens a root terminal inside the pod:

$ kubectl -n stackspin-apps exec deploy/nc-nextcloud -it -- bash

Change to the www-data user:

$ su -s /bin/bash www-data

Run occ:

$ php occ list

Maintenance

Backup

On your provisioning machine

On your cluster

Restore

Select backup

Restore app data

Change the IP of your cluster

Delete evicted pods

Run Nextcloud occ commands

Run Nextcloud `occ` commands