.. _upgrades:

Upgrades
========

Most of the time, a Stackspin cluster will pull and apply upgrades from our
repository automatically. You can find details on automated upgrades in the
:ref:`system_administration/upgrades:Automated upgrades` section.

From time to time though, changes are introduced that
require some manual action. We mark this by increasing the first component of
the Stackspin version number, and call this a "major release".
Please follow :ref:`system_administration/upgrades:Manual upgrades` for more information.

Automated upgrades
------------------

Flux
~~~~

Flux maintenance window
^^^^^^^^^^^^^^^^^^^^^^^

Automated upgrades based on Flux are configured to run during the night so
they don't produce a possible downtime during working hours.
Currently the maintenance window is configured for 2am until 4am (using the same
timezone of your cluster host). We plan to make this time configurable via the
Stackspin Dashboard.

During this time the Flux ``Stackspin``
``gitRepository`` is resumed, and suspended outside this window.

If you want to apply automated upgrades outside this window you can manually
resume the ``Stackspin`` ``gitRepository``:

.. code:: console

   $ flux resume source git stackspin

And afterwards suspend it again:

.. code:: console

   $ flux suspend source git stackspin

.. _apply_flux_env:

Apply changes to flux variables
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Before installing, you configured cluster variables in your cluster directory
in `.flux.env`. If you change any of these variables after installation you can
apply the changes by following the :ref:`install_core_apps`
instructions until the step ``kubectl apply -k $CLUSTER_DIR``. Then, use the
following command that will apply the changes to all installed kustomizations:

.. code:: console

   $ flux get -A kustomizations --no-header | awk -F' ' '{system("flux reconcile -n " $1 " kustomization " $2)}'

System-upgrade-controller
~~~~~~~~~~~~~~~~~~~~~~~~~

We use the `Rancher system-upgrade-controller <https://github.com/rancher/system-upgrade-controller>`_
to `auto-upgrade k3s <https://docs.k3s.io/upgrades/automated>`_ on the cluster
node. ``system-upgrade-controller`` is pretty versatile and could be used for
other automated upgrades in the future, but right now we only use it for ``k3s``.

Similar to automated application upgrades with Flux, automated upgrades for ``k3s``
are configured to run during the night so they don't produce a possible downtime during
working hours.
Currently the maintenance window is configured for 1am until 2am (using the same
timezone of your cluster host). We plan to `make this time configurable via the
Stackspin Dashboard <https://open.greenhost.net/stackspin/dashboard/-/issues/101>`_.

During this time the ``system-upgrade-controller`` deployment is scaled up
so that the pod  is running and performing possible upgrades, it there are any.
Outside this time the ``system-upgrade-controller`` is scaled down.

If you want to apply automated ``k3s`` upgrades outside this window you can manually
scale up ``system-upgrade-controller`` ``deployment``.

.. code:: console

   $ kubectl -n system-upgrade create job --from=cronjob/system-upgrade-controller-scale-up system-upgrade-controller-up

And afterwards scale it down again:

.. code:: console

   $ kubectl -n system-upgrade create job --from=cronjob/system-upgrade-controller-scale-down system-upgrade-controller-down

Please be aware that is currently not supported to apply a custom ``k3s``
version because Stackspin would override this again, resulting in a possible
``k3s`` downgrade.

Manual upgrades
---------------

General instructions for upgrading to a new major version:

#. Check out the new branch on provisioning machine:

   .. code:: console

    git checkout v2

#. Load python virtualenv if necessary.
#. Run the upgrade script. Enter the name of your cluster's folder in
   ``clusters/`` for ``<cluster_name>``.

   .. code:: console

     ./bin/upgrade-scripts/to-2.0/upgrade.sh <cluster_name>

#. Check that all components are being upgraded by flux:

   .. code:: console

     watch flux get kustomizations

   This may take quite a while, but all Kustomizations should become Ready
   after some time. If not, investigate and/or ask for support. In case of
   problems, please also take a look at the
   :ref:`troubleshooting` section.

From 1.0 to 2
~~~~~~~~~~~~~
Please follow the general upgrade guide above.

* If the upgrade script fails (i.e. with
  ``Helm upgrade failed: cannot patch "kube-prometheus-stack-prometheus-node-exporter" with kind DaemonSet``)
  please run the upgrade script again. If the issue persists, please reach
  out to us so we can help out.
* If you have Nextcloud and/or Zulip installed, after following the upgrade
  instructions, they will most likely end up in a failing state. If that
  happens, run ``./bin/upgrade-scripts/to-2.0/fix-app.sh <cluster_name> <app>``
  for each of them. This script requires the ``helm`` binary to be installed.
  Especially Nextcloud installation can take a while, the script can take up to
  20 minutes to complete.

From 0.8 to 1.0
~~~~~~~~~~~~~~~

#. Check out the new branch on provisioning machine:

   .. code:: console

    git checkout v1.0
#. Load python virtualenv if necessary.
#. Update python packages:

   .. code:: console

    pip install -r requirements.txt

#. Run the ansible playbook:

   .. code:: console

     python -m stackspin stackspin.example.org install

#. Export the KUBECONFIG env variable, pointing to the `kube_config_cluster.yml`
   in your cluster directory, i.e.:

   .. code:: console

      export KUBECONFIG=stackspin.example.org/kube_config_cluster.yml
#. Run the upgrade script:

   .. code:: console

     ./bin/upgrade-scripts/to-1.0/upgrade.sh

After running the last step (upgrade script), flux will update all components.
In some cases an error can occur at that step.

* If the nextcloud upgrade fails because the `setup-apps` job consistently
  fails with `App "ONLYOFFICE" cannot be installed because it is now compatible
  with this version of the server.`, then manually run `php occ app:update
  --all` and then `php occ app:enable onlyoffice` inside the nextcloud
  container.
* If the `kube-system-config` Kustomization fails calling a validation webhook,
  simply retry it with `flux reconcile kustomization kube-system-config`.

Upgrading to 0.8
~~~~~~~~~~~~~~~~

.. note::
  0.8 introduce many breaking changes. We gave our best to make the upgrade
  smooth but this will require a lot of manual intervention. Please reach out
  to us for help if needed !

When upgrading to version 0.8 OpenAppStack will be renamed to its final name:
*Stackspin*. This comes with many changes, some of which need to be applied
manually.

We have written a script to automate a lot of the preparations for the upgrade.
However, afterwards you might need to get your hands dirty to get all your
applications to work again. **Read this whole upgrade guide carefully, before
you get started!**


Log in to your Stackspin server

.. code:: console

   $ ssh <server>

Download our upgrade script

.. code:: console

   $ wget https://open.greenhost.net/stackspin/stackspin/-/raw/main/bin/upgrade-scripts/to-0.8.0/rename-to-stackspin.sh
   $ chmod +x rename-to-stackspin.sh

First of all, if you have any ``-override`` configmaps or secrets, you'll want
to move them from the ``oas`` namespace to the ``stackspin`` namespace, and from
``oas-apps`` to ``stackspin-apps`` (you also need to make these namespaces
first). You also need to rename them from ``oas-X`` to ``stackspin-X``. You can
use a command like this to rename the cm and move it to the right namespace.

.. code:: console

   $ kubectl get cm -n oas-apps oas-$APP-override -o json | jq '.metadata.name="stackspin-$APP-override"' | jq '.metadata.namespace="stackspin-apps"' | kubectl apply -f -

**This script will cause serious down time and it will not do everything for
you**. Rather, it will prepare your cluster for the upgrade.

The script does the following:


#. Install ``jq``
#. Shut down the cluster, make a back-up of the data, and bring the cluster back
   up
#. Copy all relevant ``oas-*`` secrets to ``stackspin-*``
#. Move all PersistentVolumeClaims to the ``stackspin`` and ``stackspin-apps``
   namespaces and sets the PersistentVolumes ReclaimPolicy to "Retain" so your
   data is not accidentally deleted.
#. Delete all OAS ``flux`` kustomizations
#. Delete the ``oas`` and ``oas-apps`` namespace
#. Create the new ``stackspin`` source and kustomization

Because there are not many Stackspin users yet, the script can need some manual
adjustments. It was written for clusters on which all applications are
installed. If you have *not* installed some of the applications, please remove
these applications form the script manually.

Execute the upgrade preparation script:

.. code:: console

   $ ./rename-to-stackspin.sh

After this, you need to update secrets and Flux in the cluster by running
``install/install-stackspin.sh``. Then re-install applications by running
``install/install-app.sh <app>`` from the Stackspin repository. See the
application specific upgrade guides below.

**After all your applications work again**, you can clean up the old secrets and
reset the Persistent Volume ReclaimPolicy to ``Delete``

.. code:: console

   $ wget https://open.greenhost.net/stackspin/stackspin/-/raw/main/bin/upgrade-scripts/to-0.8.0/cleanup.sh
   $ chmod +x cleanup.sh
   $ ./cleanup.sh

Nextcloud
^^^^^^^^^

Your SSO users will have new usernames, because the OIDC provider has been
renamed from ``oas`` to ``stackspin`` and because the new SSO system uses UUIDs
to uniquely identify users.

You can choose from these options:

1. Manually re-upload and re-share your files after logging in to your new user
   for the first time.
2. It is possible to transfer files from your previous user to the new user. To
   do so, find your new username. It is visible in Settings -> Sharing behind
   "Your Federated Cloud ID" after you've logged out and in to Nextcloud with
   the new SSO (the part *before* the ``@``).


   1. Exec into the Nextcloud container

      .. code:: console

         $ kubectl exec -n stackspin-apps nc-nextcloud-xxx-xxx -it -- /bin/bash

   2. Change to the www-data user

      .. code:: console

         $ su www-data -s /bin/bash

   3. Repeat this command for each username

      .. code:: console

         $ php occ files:transfer-ownership oas-<old username> <new user ID>

      Note: the files are tranferred to a subfolder in the new user's directory

Depending on when you first installed Nextcloud, the ``setup-apps`` job may fail
during the upgrade. If that happens, execute these commands in order to update
the failing apps to their newest version, and to remove old files that can cause
problems.

.. code:: console

   kubectl exec -n stackspin-apps deployment/nc-nextcloud -- rm -r /var/www/html/custom_apps/onlyoffice
   kubectl exec -n stackspin-apps deployment/nc-nextcloud -- rm -r /var/www/html/custom_apps/sociallogin
   flux suspend hr -n stackspin-apps nextcloud && flux resume hr -n stackspin-apps nextcloud

Rocket.Chat
^^^^^^^^^^^

We replaced Rocket.Chat with `Zulip`_ in this release.
If you want to migrate your Rocket.Chat data to your new `Zulip`_ installation
please refer to `Import from Rocket.Chat`_.

Monitoring
^^^^^^^^^^

The monitoring stack will work after the upgrade, but monitoring data from the
previous version will not be available.

Wekan
^^^^^

In our testing we didn't need to change anything for Wekan to work.

WordPress
^^^^^^^^^

In our testing we didn't need to change anything for WordPress to work.

Upgrading to 0.7.0
~~~~~~~~~~~~~~~~~~

Because of `problems with Helm and secret management
<https://open.greenhost.net/openappstack/openappstack/-/issues/891>`__
we had to move away from using a helm chart for application secrets, and now use
scripts that run during installation to manage secrets. Because we have removed
the ``oas-secrets`` helm chart, Flux will remove the secrets that it has
generated. **It is important that you back up these secrets before switching
from v0.6 to v0.7!**

.. note::
  Before you start, please ensure that you have the right ``yq`` tool installed,
  because you will need it later.  There are two very different versions of
  ``yq``. The one you need is the go based `yq from Mike Farah`_,
  which installs the same binary name as the `python-yq`_ one, while both have
  different command sets.
  The yq needed here can be installed by running ``sudo snap install yq``,
  ``brew install yq`` or with other methods from the
  `yq installation instructions`_.

  If you're unsure which ``yq`` you have installed, look at the output of
  ``yq --help`` and make sure ``eval`` shows up under ``Available Commands:``.


To back-up your secrets, run the following script:

.. code:: bash

   #!/usr/bin/env bash

   mkdir secrets-backup

   kubectl get secret -o yaml -n flux-system  oas-cluster-variables > secrets-backup/oas-cluster-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-wordpress-variables > secrets-backup/oas-wordpress-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-wekan-variables > secrets-backup/oas-wekan-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-single-sign-on-variables > secrets-backup/oas-single-sign-on-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-rocketchat-variables > secrets-backup/oas-rocketchat-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-kube-prometheus-stack-variables > secrets-backup/oas-kube-prometheus-stack-variables.yaml
   kubectl get secret -o yaml -n oas          oas-prometheus-basic-auth > secrets-backup/oas-prometheus-basic-auth.yaml
   kubectl get secret -o yaml -n oas          oas-alertmanager-basic-auth > secrets-backup/oas-alertmanager-basic-auth.yaml
   kubectl get secret -o yaml -n flux-system  oas-oauth-variables > secrets-backup/oas-oauth-variables.yaml
   kubectl get secret -o yaml -n flux-system  oas-nextcloud-variables > secrets-backup/oas-nextcloud-variables.yaml

This script assumes you have all applications enabled. You might get an error
like:

.. code:: console

   Error from server (NotFound): secrets "oas-wekan-variables" not found

This is not a problem, but it *does* mean you need to add an oauth secret for
Wekan to the file ``secrets-backup/oas-oauth-variables.yaml``. Copy one of the
lines under "data:", rename the field to ``wekan_oauth_client_secret`` and enter
a different random password. Make sure to base64 encode it (``echo "<your random
password>" | base64``).

This script creates a directory called ``secrets-backup`` and places the secrets
that have been generated by Helm in it as ``yaml`` files.

Now you can upgrade your cluster by running
``kubectl -n flux-system patch gitrepository openappstack --type merge
-p '{"spec":{"ref":{"branch":"v0.7"}}}'``
or by editing the ``gitrepository`` object manually with
``kubectl -n flux-system edit gitrepository openappstack`` and setting
``spec.ref.branch`` to ``v0.7``.

Flux will now start updating your cluster to version ``0.7``. This process will fail,
because it will remove the secrets that you just backed up. Make
sure that the ``oas-secrets`` helmrelease has been removed by running ``flux get
hr -A``. You might also see that some helmreleases start failing to be installed
because important secrets do not exist anymore.

As soon as the ``oas-secrets`` helmrelease does not exist anymore, you can run
the following script:

.. code:: bash

   #!/usr/bin/env bash

   # Again: make sure you use https://github.com/mikefarah/yq -- install with `snap install yq`
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-wordpress-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-wekan-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-single-sign-on-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-rocketchat-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-kube-prometheus-stack-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-prometheus-basic-auth.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-alertmanager-basic-auth.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-oauth-variables.yaml | kubectl apply -f -
   yq eval 'del(.metadata.annotations,.metadata.labels,.metadata.creationTimestamp,.metadata.resourceVersion,.metadata.uid)' secrets-backup/oas-nextcloud-variables.yaml | kubectl apply -f -

Again this script assumes you have all applications installed. If you get the
following error, you can ignore it:

.. code:: console

   error: error validating "STDIN": error validating data: [apiVersion not set, kind not set]; if you choose to ignore these errors, turn validation off with --validate=false

Now Flux should succeed in finishing the update. Some helmreleases or
kustomizations might have already failed because the secrets did not exist. Once
failed, you can retrigger reconciliation of a kustomization using the commands
``flux reconcile kustomization ...`` or ``flux reconcile helmrelease ...``. This
can take quite a while (over an hour some times), because Flux waits for some
long timeouts before giving up and re-starting a reconciliation.

Potential upgrade issues
^^^^^^^^^^^^^^^^^^^^^^^^

Some errors we've seen during our own upgrade process, and how to solve them:

SSO helm upgrade failed
'''''''''''''''''''''''

.. code::

   oas          single-sign-on          False Helm upgrade failed: template: single-sign-on/templates/secret-oauth2-clients.yaml:9:55: executing "single-sign-on/templates/secret-oauth2-clients.yaml" at <b64enc>: invalid value; expected string  0.2.2     False

This means that the ``single-sign-on`` helmrelease was created with empty oauth
secrets. The secrets will get a value once the ``core`` *kustomization* is
reconciled: ``flux reconcile ks core`` should solve the problem.

If that does not solve the problem, you should check if the secret contains a
value for all the apps:

.. code::

   # kubectl get secret -n flux-system oas-oauth-variables -o yaml
   apiVersion: v1
   data:
     grafana_oauth_client_secret: <redacted>
     nextcloud_oauth_client_secret: <redacted>
     rocketchat_oauth_client_secret: <redacted>
     userpanel_oauth_client_secret: <redacted>
     wekan_oauth_client_secret: <redacted>
     wordpress_oauth_client_secret: <redacted>
   ...

If your secret lacks one of these variables, use ``kubectl edit`` to add them.
You can use any password generator to generate a password for it. Make sure to
base64 encode the data before you enter it in the secret.

Loki upgrade retries exhausted
''''''''''''''''''''''''''''''

While running ``flux get helmrelease -A``, you'll see:

.. code::

    oas          loki                    False   upgrade retries exhausted         2.5.2     False

This happens sometimes because Loki takes a long time to upgrade. Usually it is
solved by running ``flux reconcile hr loki -n oas`` again.

Upgrading to 0.6.0
~~~~~~~~~~~~~~~~~~


A few things are important when upgrading to 0.6.0:

- We now use Flux 2 and the installation procedure has been overhauled. For this
  reason we advice you to set up a completely new cluster.
- Copy your configuration details from ``settings.yaml`` to a new ``.flux.env``.
  See ``install/.flux.env.example`` and the :ref:`installation_overview`
  instructions for more information.

Please `reach out to us`_ if you are using, or plan to use OAS in
production.

Upgrading from 0.4.0 to 0.5.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Unfortunately we can’t ensure a smooth upgrade for this version neither.
Please read the section below on how to do an upgrade by installing a
the new OAS version from scratch after backing up your data.

Upgrading from 0.3.0 to 0.4.0
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is no easy upgrade path from version 0.3.0 to version 0.4.0. As
far as we know, nobody was running OpenAppStack apart from the
developers, so we assume this is not a problem.

If you do need to upgrade, this is how you can migrate your data. Backup
all the data available under ``/var/lib/OpenAppStack/local-storage``,
create a new cluster using the installation instructions, and putting
back the data. This migration procedure might not work perfectly.

Use ``kubectl get pvc -A`` on your old cluster to get a mapping of all
the PVC uuids (and thus their folder names in
``/var/lib/OpenAppStack/local-storage``) to the pods they are bound to.

Then, delete your old OpenAppStack, and install a new one with version
number 0.4.0 or higher. You can upload your backed up data into
``/var/lib/OpenAppStack/local-storage``. All PVCs will have new unique
IDs (and thus different folder names). You have to manually match the
folders from your backup with the new folders.

Additionally, if you want to re-use your old ``settings.yaml`` file,
this data needs to be added to it:

.. code:: yaml

   backup:
     s3:
       # Disabled by default. To enable, change to `true` and configure the
       # settings below. You'll also want to add "velero" to the enabled
       # applications a bit further in this file.
       # Finally, you'll also need to provide access credentials as
       # secrets; see the documentation:
       # https://docs.openappstack.net/en/latest/installation_instructions.html#step-2-optional-cluster-backups-using-velero
       enabled: false
       # URL of S3 service. Please use the principal domain name here, without the
       # bucket name.
       url: "https://store.greenhost.net"
       # Region of S3 service that's used for backups.
       # For some on-premise providers this may be irrelevant, but the S3
       # apparently requires it at some point.
       region: "ceph"
       # Name of the S3 bucket that backups will be stored in.
       # This has to exist already: Velero will not create it for you.
       bucket: "openappstack-backup"
       # Prefix that's added to backup filenames.
       prefix: "test-instance"

   # A whitelist of applications that will be enabled.
   enabled_applications:
     # System components, necessary for the system to function.
     - 'cert-manager'
     - 'letsencrypt-production'
     - 'letsencrypt-staging'
     - 'ingress'
     - 'local-path-provisioner'
     - 'single-sign-on'
     # The backup system Velero is disabled by default, see settings under `backup` above.
     # - 'velero'
     # Applications.
     - 'grafana'
     - 'loki'
     - 'promtail'
     - 'nextcloud'
     - 'prometheus'
     - 'rocketchat'
     - 'wordpress'

Upgrading to 0.3.0
~~~~~~~~~~~~~~~~~~

Upgrading from versions earlier than ``0.3.0`` requires manual
intervention.

-  Move your local ``settings.yml`` file to a different location:

   .. code:: console

      $ cd CLUSTER_DIR
      $ mkdir -p ./group_vars/all/
      $ mv settings.yml ./group_vars/all/

-  `Flux`_ is now used to install and update applications. For that
   reason, we need you to remove all helm charts (WARNING: You will lose
   your data!):

   .. code:: console

      $ helm delete --purge oas-test-cert-manager oas-test-local-storage \
        oas-test-prometheus oas-test-proxy oas-test-files`

   -  After removing all helm charts, you probably also want to remove
      all the ``pvc``\ s that are left behind. Flux will not re-use the
      database PVCs created for these applications. Find all the pvcs by
      running ``kubectl get pvc   --namespace oas-apps`` and
      ``kubectl get pvc --namespace oas``

.. _reach out to us: https://openappstack.net/contact.html
.. _Flux: https://fluxcd.io
.. _yq from Mike Farah: https://mikefarah.github.io/yq
.. _yq installation instructions: https://mikefarah.github.io/yq/#install
.. _python-yq: https://github.com/kislyuk/yq
.. _Zulip: https://zulip.com
.. _Import from Rocket.Chat: https://api.zulip.com/help/import-from-rocketchat