Maintenance =========== .. _backup: Backup ------ On your provisioning machine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ During the installation process, a cluster config directory is created on your provisioning machine, located in the top-level sub-directory ``clusters`` in your clone of the stackspin git repository. Although these files are not essential for your Stackspin cluster to continue functioning, you may want to back this folder up because it allows easy access to your cluster. On your cluster ~~~~~~~~~~~~~~~ Stackspin supports using the program Velero to make backups of your Stackspin instance to external storage via the S3 API. See :ref:`backups-with-velero` in the installation instructions for setup details. For the maintenance operations described below -- in particular, restoring backups -- you need the ``velero`` client program installed, typically on your provisioning machine although you can also run it on the VPS if preferred. You may find it at `Velero's github release page`_. By default Velero will make nightly backups of the entire cluster (minus Prometheus data). To make a manual backup, run .. code:: console cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait from your VPS. See ``velero --help`` for other commands, and `Velero’s documentation`_ for more information. Note: in case you want to make an (additional) backup of application data via alternate means, all persistent volume data of the cluster are stored in directories under ``/var/lib/Stackspin/local-storage``. Restore ------- Restoring from backups is a process that for now has to be done via the command line. We intend to allow doing this from the Stackspin dashboard instead in the near future. These instructions explain how to restore the persistent data of an individual app (such as Nextcloud, or Zulip) to a previous point in time, from a backup to S3-compatible storage made using velero, on a Stackspin cluster that is in a healthy state. Using backups to recover from more severe problems, like a broken or completely destroyed Stackspin cluster, is also possible, by reinstalling the cluster from scratch and restoring individual app data on top of that. However, that procedure is not so streamlined and not documented here. If you are in that situation, please `reach out to us`_ for advice or assistence. Select backup ~~~~~~~~~~~~~ To show a list of available backups, perform the following command on your VPS: .. code:: console $ kubectl get backup -A Once you have chosen a backup to restore from, record its name as written in the ``kubectl`` output. .. note:: Please be aware that for technical reasons the restore operation will restore not only the persistent data from this backup, but also the app's software version that was running at that time. Although the auto-update mechanism should in turn update the app to a recent version, and the recent app version should be able to automatically perform any necessary data format migrations on the old data, this operation has not been well tested for older backups, so please proceed carefully. As an example of what could go wrong, Nextcloud requires upgrades to be done in a serial fashion, never skipping a major version upgrade, so if your backup is from two or more major Nextcloud versions ago, some manual intervention is required. If you have any doubts, please `reach out to us`_. Restore app data ~~~~~~~~~~~~~~~~ .. warning:: Please note that restoring data is a destructive operation! It will replace the app's data as they are now. There is no way to undo a restore operation, unless you have a copy of the current app data, in the form of a current Stackspin backup or an app-specific data export. For that reason, we recommend making another backup right before beginning a restore operation. To restore the data of app ``$app`` from the backup named ``$backup``, perform the following commands: .. code:: console $ flux suspend kustomization $app $ flux suspend helmrelease -n stackspin-apps $app $ kubectl delete all -n stackspin-apps -l stackspin.net/backupSet=$app $ kubectl delete pvc -n stackspin-apps -l stackspin.net/backupSet=$app $ velero restore create arbitrary-name-of-restore-operation --from-backup=$backup -l stackspin.net/backupSet=$app At this point, please first wait for the restore operation to finish, see text below. .. code:: console $ flux resume helmrelease -n stackspin-apps $app $ flux resume kustomization $app .. note:: Specifically for Nextcloud, the ``kubectl delete pvc ...`` command might hang due to a Kubernetes job that references that PVC. To solve that, look for such jobs using ``kubectl get job -n stackspin-apps`` and delete any finished ones using ``kubectl delete job ...``. That should let the ``kubectl delete pvc ...`` command finish; if it was already terminated, run it again. The ``velero restore create ...`` command initiates the restore operation, but it doesn't wait until the operation is complete. You may use the commands suggested in the terminal output to check on the status of the operation. Additionally, once the restore operation is finished, it may take some more time for the various app components to be fully started and for the app to be operational again. Change the IP of your cluster ----------------------------- In case your cluster needs to migrate to another IP, make sure to update the IP address in ``/etc/rancher/k3s/k3s.yaml`` and, if applicable, your local kube config and inventory.yml in the cluster directory ``clusters/stackspin.example.org``. Delete evicted pods ------------------- In case your cluster disk is full, kubernetes `taints`_ the node with ``DiskPressure``. Then it tries to evict pods, which is pointless in a single node setup but can still happen. We have experienced hundreds of pods in ``evicted`` state that still showed up after ``DiskPressure`` had recovered. See also the `out of resource handling with kubelet`_ documentation. You can delete all evicted pods with this command: .. code:: console $ kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1' Apply changes to flux variables ------------------------------- Before installing, you configured cluster variables in your cluster directory in `.flux.env`. If you change any of these variables after installation you can apply the changes by following the :ref:`install_core_apps` instructions until the step ``kubectl apply -k $CLUSTER_DIR``. Then, use the following command that will apply the changes to all installed kustomizations: .. code:: console $ flux get -A kustomizations --no-header | awk -F' ' '{system("flux reconcile -n " $1 " kustomization " $2)}' Run Nextcloud ``occ`` commands ------------------------------ Nextcloud includes a CLI tool called ``occ`` ("OwnCloud Console"). This tool can be used for all kinds of tasks you might want to do as a system administrator. To use the tool, you need to enter Nextcloud's "pod" and change to the correct user. The following commands achieve that: ``exec`` opens a root terminal inside the pod: .. code:: console $ kubectl -n stackspin-apps exec deploy/nc-nextcloud -it -- bash Change to the ``www-data`` user: .. code:: console $ su -s /bin/bash www-data Run ``occ``: .. code:: console $ php occ list .. _Velero's github release page: https://github.com/vmware-tanzu/velero/releases/latest .. _Velero’s documentation: https://velero.io/docs/v1.4/ .. _reach out to us: https://stackspin.net/contact.html .. _taints: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/ .. _out of resource handling with kubelet: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/