Troubleshooting
===============

If you run into problems, there are a few things you can do to research the
problem. This document describes what you can do.

.. note::
   **cluster$** indicates that the commands should be run as root on your Stackspin machine.
   All other commands need to be run on your *provisioning machine*.

**We would love to hear from you!**
If you have problems, please create an issue
in our `issue tracker <https://open.greenhost.net/groups/stackspin/-/issues>`__
or reach out as described on our `contact page <https://stackspin.net/contact.html>`__.
We want to be in communication with our users,
and we want to help you if you run into problems.

Known issues
------------

If you run into a problem, please check our `issue
tracker <https://open.greenhost.net/groups/stackspin/-/issues>`__ to see if
others have run into the same problem. We might have suggested a workaround or
temporary solution in one of our issues. If your problems is not described in an
issue, please open a new one so we can solve the problems you encounter.

SSH access
----------

You can SSH login to your VPS. Some programs that are available to the root user
on the VPS:

* ``kubectl``, the Kubernetes control program. The root user is connected to the
  cluster automatically.
* ``helm`` is the "Kubernetes package manager". Use i.e. ``helm ls --all-namespaces``
  to see what apps are installed in your cluster. You can also use it to perform
  manual upgrades; see ``helm --help``.
* ``flux`` is the `flux`_ command line tool

.. _flux: https://fluxcd.io

Using kubectl to debug your cluster
-----------------------------------

You can use ``kubectl``, the Kubernetes control program,
to find and manipulate your Kubernetes cluster.
Once you have installed ``kubectl``,
to get access to your cluster with the Stackspin CLI:

.. code:: console

    $ python -m stackspin stackspin.example.org info

Look for these lines in the output:

.. code:: sh

    # To use kubectl with this cluster, copy-paste this in your terminal:
    export KUBECONFIG=/home/you/projects/stackspin/clusters/stackspin.example.org/kube_config_cluster.yml


Copy the whole ``export`` line into your terminal.
**In the same terminal window**, ``kubectl`` will from now on connect to your cluster.

Alternatively, use SSH to log into your machine, and ``kubectl`` will be available there.

Application installation or upgrade failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Application installations and upgrades are managed by `flux`_.
Flux uses `helm-controller`_ to install and upgrade applications
with `helm charts <https://helm.sh/docs/topics/charts/>`__.

An application installed with Flux consists of a *kustomization*.
This resource defines where the information about the application is stored in our Git repository.
The *kustomization* contains a *helmrelease*,
which is an object that represents an installation of a Helm chart.
Read more about the difference between *kustomizations* and *helmreleases*
in the `flux documentation <https://fluxcd.io/docs>`__.

Be aware that there is a difference between `Flux Kustomization objects
<https://fluxcd.io/docs/components/kustomize/kustomization/>`__ and
`Kubernetes kustomizations
<https://kubectl.docs.kubernetes.io/references/kustomize/glossary/#kustomization>`__.
In this section we refer to the Flux kustomizations.

To find out if all *kustomizations* have been applied correctly,
run the following flux command in your cluster or from the provisioning machine:

.. code:: console

   cluster$ flux get kustomizations --all-namespaces

If all your *kustomizations* are in a *Ready* state,
take a look at your *helmreleases*:

.. code:: console

   cluster$ flux get helmreleases -A

If there is an issue, use ``kubectl`` to inspect the respective service,
for example ``nginx``:

.. code:: console

   $ kubectl describe helmrelease -n stackspin nginx

If the error message mentions a problem in a ``HelmChart`` or ``GitRepository``,
you can get information about those objects in a similar fashion:

For git repositories:

.. code:: console

   $ flux get source git

For helm repositories:

.. code:: console

   $ flux get source helm

For helm charts:

.. code:: console

   $ flux get source chart

For more information, use ``flux --help``, or ``flux get --help``.

HelmReleases that have no problems with their sources, but still fail,
can often be fixed by simply suspending and resuming them.
Use these ``flux`` commands:

.. code:: console

   $ flux --namespace NAMESPACE suspend helmrelease NAME
   $ flux --namespace NAMESPACE resume helmrelease NAME

If your HelmRelease is outdated, you can often resolve complications
by telling Flux to *reconcile* them.
This will tell Flux to compare the HelmRelease's current state
with the desired state.

.. code:: console

   cluster$ flux reconcile helmrelease nextcloud

Viewing upgrade history
'''''''''''''''''''''''

To see when an application was updated to which version, you can use the ``helm
history`` command. For example, to see the update history for Zulip, you can run:

.. code:: console

   $ helm history -n stackspin-apps zulip


Debugging on a lower level
~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also debug the ``pods`` that run applications.
To get an overview of all pods, run:

.. code:: console

   $ kubectl get pods --all-namespaces

This will show you all pods.
Check for failing pods by looking at the ``READY`` column.
If you find failing pods, you can access their logs with:

.. code:: console

   $ kubectl --namespace NAMESPACE logs POD

You can also enter the pod's shell, by running:

.. code:: console

   $ kubectl --namespace NAMESPACE exec POD -it -- /bin/sh

.. _helm-controller:  https://fluxcd.io/docs/components/helm/

HTTPS certificates
------------------

Stackspin uses `cert-manager <https://docs.cert-manager.io/en/latest/>`__
to automatically fetch `Let's Encrypt <https://letsencrypt.org/>`__ certificates
for all deployed services.
If you experience invalid SSL certificates,
i.e. your browser warns you when visiting Zulip (https://zulip.stackspin.example.org),
a useful resource for troubleshooting is the official cert-manager
`Troubleshooting Issuing ACME Certificates <https://cert-manager.io/docs/faq/acme/>`__ documentation.
First, try this:

In this example we fix a failed certificate request for *https://chat.stackspin.example.org*.
We will start by checking if ``cert-manager`` is set up correctly.

Is your cluster using the live ACME server ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: console

   $ kubectl get clusterissuers -o yaml | grep 'server:'

Should return `server: https://acme-v02.api.letsencrypt.org/directory`
and not something with the word *staging* in it.

Are all cert-manager pods in the *stackspin* namespace in the *READY* state ?
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: console

   $ kubectl -n cert-manager get pods

Cert-manager uses a "custom resource" to keep track of your certificates, so you
can also check the status of your certificates by running:

This returns all the certificates for all applications on your system. The
command includes example output of healthy certificates.

.. code:: console

   $ kubectl get certificates -A
   NAMESPACE   NAME                           READY   SECRET                         AGE
   stackspin         hydra-public.tls               True    hydra-public.tls               14d
   stackspin         single-sign-on-userpanel.tls   True    single-sign-on-userpanel.tls   14d
   stackspin-apps    stackspin-nextcloud-files            True    stackspin-nextcloud-files            14d
   stackspin-apps    stackspin-nextcloud-office           True    stackspin-nextcloud-office           14d
   stackspin         grafana-tls                    True    grafana-tls                    13d
   stackspin         alertmanager-tls               True    alertmanager-tls               13d
   stackspin         prometheus-tls                 True    prometheus-tls                 13d

If there are problems, you can check for the specific *certificaterequests*:

.. code:: console

   $ kubectl get certificaterequests -A

For even more information, inspect the logs of the *cert-manager* pod:

.. code:: console

   $ kubectl -n stackspin logs -l "app.kubernetes.io/name=cert-manager"

You can ``grep`` for your cluster domain or for any specific subdomain to narrow
down results.

Example
~~~~~~~

Query for failed certificates, -requests, challenges or orders:

.. code:: console

   $ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
   stackspin-apps    certificate.cert-manager.io/stackspin-zulip                 False   stackspin-zulip                 15h
   stackspin-apps    certificaterequest.cert-manager.io/stackspin-zulip-2045852889                 False   15h
   stackspin-apps    challenge.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563-837515681   pending   chat.stackspin.example.org   15h
   stackspin-apps    order.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563                 pending   15h

We see that the zulip certificate resources have been in a bad state for 15 hours.

Show certificate resource status message:

.. code:: console

   $ kubectl -n stackspin-apps get certificate stackspin-zulip -o jsonpath="{.status.conditions[*]['message']}"
   Waiting for CertificateRequest "stackspin-zulip-2045852889" to complete

We see that the `certificate` is waiting for the `certificaterequest`,
let\'s query its status message:

.. code:: console

   $ kubectl -n stackspin-apps get certificaterequest stackspin-zulip-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
   Waiting on certificate issuance from order stackspin-apps/stackspin-zulip-2045852889-1775447563: "pending"

Show the related order resource and look at the status and events:

.. code:: console

   $ kubectl -n stackspin-apps describe order stackspin-zulip-2045852889-1775447563

Show the failed challenge resource reason:

.. code:: console

   $ kubectl -n stackspin-apps get challenge stackspin-zulip-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
   Waiting for http-01 challenge propagation: wrong status code '503', expected '200'

In this example, deleting the challenge fixed the issue and a proper certificate
could get fetched:

.. code:: console

   $ kubectl -n stackspin-apps delete challenges.acme.cert-manager.io stackspin-zulip-2045852889-1775447563-837515681


Common installation failures
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

var substitution failed
'''''''''''''''''''''''

When you execute ``flux get kustomization`` and you see this error:

.. code:: console

   $ flux get kustomization
   var substitution failed for 'kube-prometheus-stack': YAMLToJSON: yaml: line 32: found character that cannot start any token

That can mean that one of your values contains a double quote (``"``) or that
you quoted a value in `.flux.env` during the :ref:`flux_config`. Make sure
that `.flux.env` does not contain any values that are quoted.

If you need to change `.flux.env`, run the following commands:

.. code:: console

   $ kubectl apply -k $CLUSTER_DIR

Afterwards, you can speed up the process that fixes your *kustomization*,
by running the following
(replace *kube-prometheus-stack*
with the *kustomization* mentioned in the error message):

.. code:: console

   $ flux reconcile kustomization kube-prometheus-stack


Purge Stackspin and install from scratch
----------------------------------------

.. warning::

   You will lose all your data!
    This completely destroys Stackspin and takes everything offline.
    If you choose to do this, you will need to re-install Stackspin
    and make sure that your data is stored somewhere
    other than the VPS that runs Stackspin.

If things ever fail beyond possible recovery,
here is how to completely purge a Stackspin installation
to start from scratch:

.. code:: console

   cluster$ /usr/local/bin/k3s-killall.sh
   cluster$ systemctl disable k3s
   cluster$ rm -rf  /var/lib/{rancher,Stackspin,kubelet} /etc/rancher /var/log/{Stackspin,containers,pods} /tmp/k3s /etc/systemd/system/k3s.service
  cluster$ systemctl reboot