Maintenance
===========

.. _backup:

Backup
------

On your provisioning machine
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

During the installation process, a cluster config directory is created
on your provisioning machine, located in the top-level sub-directory
``clusters`` in your clone of the stackspin git repository. Although
these files are not essential for your Stackspin cluster to continue
functioning, you may want to back this folder up because it allows easy
access to your cluster.

On your cluster
~~~~~~~~~~~~~~~

Stackspin supports using the program Velero to make backups of your
Stackspin instance to external storage via the S3 API. See
:ref:`backups-with-velero` in the installation instructions for setup details.

For the maintenance operations described below -- in particular, restoring
backups -- you need the ``velero`` client program installed, typically on your
provisioning machine although you can also run it on the VPS if preferred. You
may find it at `Velero's github release page`_.

By default Velero will make nightly backups of the entire cluster (minus
Prometheus data). To make a manual backup, run

.. code:: console

   cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait

from your VPS. See ``velero --help`` for other commands, and `Velero’s
documentation`_ for more information.

Note: in case you want to make an (additional) backup of application
data via alternate means, all persistent volume data of the cluster are
stored in directories under ``/var/lib/Stackspin/local-storage``.

Restore
-------

Restoring from backups is a process that for now has to be done via the command
line. We intend to allow doing this from the Stackspin dashboard instead in the
near future.

These instructions explain how to restore the persistent data of an individual
app (such as Nextcloud, or Zulip) to a previous point in time, from a backup to
S3-compatible storage made using velero, on a Stackspin cluster that is in a
healthy state. Using backups to recover from more severe problems, like a
broken or completely destroyed Stackspin cluster, is also possible, by
reinstalling the cluster from scratch and restoring individual app data on top
of that. However, that procedure is not so streamlined and not documented here.
If you are in that situation, please `reach out to us`_ for advice or
assistence.

Select backup
~~~~~~~~~~~~~

To show a list of available backups, perform the following command on your VPS:

.. code:: console

   $ kubectl get backup -A

Once you have chosen a backup to restore from, record its name as written in
the ``kubectl`` output.

.. note::

   Please be aware that for technical reasons the restore operation will restore
   not only the persistent data from this backup, but also the app's software
   version that was running at that time. Although the auto-update mechanism
   should in turn update the app to a recent version, and the recent app version
   should be able to automatically perform any necessary data format migrations on
   the old data, this operation has not been well tested for older backups, so
   please proceed carefully. As an example of what could go wrong, Nextcloud
   requires upgrades to be done in a serial fashion, never skipping a major
   version upgrade, so if your backup is from two or more major Nextcloud
   versions ago, some manual intervention is required. If you have any doubts,
   please `reach out to us`_.

Restore app data
~~~~~~~~~~~~~~~~

.. warning::

   Please note that restoring data is a destructive operation! It will replace the
   app's data as they are now. There is no way to undo a restore operation,
   unless you have a copy of the current app data, in the form of a current
   Stackspin backup or an app-specific data export. For that reason, we
   recommend making another backup right before beginning a restore operation.

To restore the data of app ``$app`` from the backup named ``$backup``, perform
the following commands:

.. code:: console

   $ flux suspend kustomization $app
   $ flux suspend helmrelease -n stackspin-apps $app
   $ kubectl delete all -n stackspin-apps -l stackspin.net/backupSet=$app
   $ kubectl delete pvc -n stackspin-apps -l stackspin.net/backupSet=$app
   $ velero restore create arbitrary-name-of-restore-operation --from-backup=$backup -l stackspin.net/backupSet=$app

At this point, please first wait for the restore operation to finish,
see text below.

.. code:: console

   $ flux resume helmrelease -n stackspin-apps $app
   $ flux resume kustomization $app

.. note::

   Specifically for Nextcloud, the ``kubectl delete pvc ...`` command might hang due
   to a Kubernetes job that references that PVC. To solve that, look for such jobs
   using ``kubectl get job -n stackspin-apps`` and delete any finished ones using
   ``kubectl delete job ...``. That should let the ``kubectl delete pvc ...``
   command finish; if it was already terminated, run it again.

The ``velero restore create ...`` command initiates the restore operation, but
it doesn't wait until the operation is complete. You may use the commands
suggested in the terminal output to check on the status of the operation.
Additionally, once the restore operation is finished, it may take some more
time for the various app components to be fully started and for the app to be
operational again.

Change the IP of your cluster
-----------------------------

In case your cluster needs to migrate to another IP, make sure to update
the IP address in ``/etc/rancher/k3s/k3s.yaml`` and, if applicable, your
local kube config and inventory.yml in the cluster directory
``clusters/stackspin.example.org``.

Delete evicted pods
-------------------

In case your cluster disk is full, kubernetes `taints`_ the node with
``DiskPressure``. Then it tries to evict pods, which is pointless in a single
node setup but can still happen. We have experienced hundreds of pods in
``evicted`` state that still showed up after ``DiskPressure`` had recovered. See
also the `out of resource handling with kubelet`_ documentation.

You can delete all evicted pods with this command:

.. code:: console

   $ kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'

Apply changes to flux variables
-------------------------------

Before installing, you configured cluster variables in your cluster directory
in `.flux.env`. If you change any of these variables after installation you can
apply the changes by following the :ref:`install_core_apps`
instructions until the step ``kubectl apply -k $CLUSTER_DIR``. Then, use the
following command that will apply the changes to all installed kustomizations:

.. code:: console

   $ flux get -A kustomizations --no-header | awk -F' ' '{system("flux reconcile -n " $1 " kustomization " $2)}'

Run Nextcloud ``occ`` commands
------------------------------

Nextcloud includes a CLI tool called ``occ`` ("OwnCloud Console").
This tool can be used for all kinds of tasks
you might want to do as a system administrator.

To use the tool, you need to enter Nextcloud's "pod"
and change to the correct user. The following commands achieve that:


``exec`` opens a root terminal inside the pod:

.. code:: console

   $ kubectl -n stackspin-apps exec deploy/nc-nextcloud -it -- bash

Change to the ``www-data`` user:

.. code:: console

   $ su -s /bin/bash www-data

Run ``occ``:

.. code:: console

   $ php occ list

.. _Velero's github release page: https://github.com/vmware-tanzu/velero/releases/latest
.. _Velero’s documentation: https://velero.io/docs/v1.4/
.. _reach out to us: https://stackspin.net/contact.html
.. _taints: https://kubernetes.io/docs/concepts/scheduling-eviction/taint-and-toleration/
.. _out of resource handling with kubelet: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/