Troubleshooting =============== If you run into problems, there are a few things you can do to research the problem. This document describes what you can do. .. note:: **cluster$** indicates that the commands should be run as root on your Stackspin machine. All other commands need to be run on your *provisioning machine*. **We would love to hear from you!** If you have problems, please create an issue in our `issue tracker `__ or reach out as described on our `contact page `__. We want to be in communication with our users, and we want to help you if you run into problems. Known issues ------------ If you run into a problem, please check our `issue tracker `__ to see if others have run into the same problem. We might have suggested a workaround or temporary solution in one of our issues. If your problems is not described in an issue, please open a new one so we can solve the problems you encounter. SSH access ---------- You can SSH login to your VPS. Some programs that are available to the root user on the VPS: * ``kubectl``, the Kubernetes control program. The root user is connected to the cluster automatically. * ``helm`` is the "Kubernetes package manager". Use i.e. ``helm ls --all-namespaces`` to see what apps are installed in your cluster. You can also use it to perform manual upgrades; see ``helm --help``. * ``flux`` is the `flux`_ command line tool .. _flux: https://fluxcd.io Using kubectl to debug your cluster ----------------------------------- You can use ``kubectl``, the Kubernetes control program, to find and manipulate your Kubernetes cluster. Once you have installed ``kubectl``, to get access to your cluster with the Stackspin CLI: .. code:: console $ python -m stackspin stackspin.example.org info Look for these lines in the output: .. code:: sh # To use kubectl with this cluster, copy-paste this in your terminal: export KUBECONFIG=/home/you/projects/stackspin/clusters/stackspin.example.org/kube_config_cluster.yml Copy the whole ``export`` line into your terminal. **In the same terminal window**, ``kubectl`` will from now on connect to your cluster. Alternatively, use SSH to log into your machine, and ``kubectl`` will be available there. Application installation or upgrade failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Application installations and upgrades are managed by `flux`_. Flux uses `helm-controller`_ to install and upgrade applications with `helm charts `__. An application installed with Flux consists of a *kustomization*. This resource defines where the information about the application is stored in our Git repository. The *kustomization* contains a *helmrelease*, which is an object that represents an installation of a Helm chart. Read more about the difference between *kustomizations* and *helmreleases* in the `flux documentation `__. Be aware that there is a difference between `Flux Kustomization objects `__ and `Kubernetes kustomizations `__. In this section we refer to the Flux kustomizations. To find out if all *kustomizations* have been applied correctly, run the following flux command in your cluster or from the provisioning machine: .. code:: console cluster$ flux get kustomizations --all-namespaces If all your *kustomizations* are in a *Ready* state, take a look at your *helmreleases*: .. code:: console cluster$ flux get helmreleases -A If there is an issue, use ``kubectl`` to inspect the respective service, for example ``nginx``: .. code:: console $ kubectl describe helmrelease -n stackspin nginx If the error message mentions a problem in a ``HelmChart`` or ``GitRepository``, you can get information about those objects in a similar fashion: For git repositories: .. code:: console $ flux get source git For helm repositories: .. code:: console $ flux get source helm For helm charts: .. code:: console $ flux get source chart For more information, use ``flux --help``, or ``flux get --help``. HelmReleases that have no problems with their sources, but still fail, can often be fixed by simply suspending and resuming them. Use these ``flux`` commands: .. code:: console $ flux --namespace NAMESPACE suspend helmrelease NAME $ flux --namespace NAMESPACE resume helmrelease NAME If your HelmRelease is outdated, you can often resolve complications by telling Flux to *reconcile* them. This will tell Flux to compare the HelmRelease's current state with the desired state. .. code:: console cluster$ flux reconcile helmrelease nextcloud Viewing upgrade history ''''''''''''''''''''''' To see when an application was updated to which version, you can use the ``helm history`` command. For example, to see the update history for Zulip, you can run: .. code:: console $ helm history -n stackspin-apps zulip Debugging on a lower level ~~~~~~~~~~~~~~~~~~~~~~~~~~ You can also debug the ``pods`` that run applications. To get an overview of all pods, run: .. code:: console $ kubectl get pods --all-namespaces This will show you all pods. Check for failing pods by looking at the ``READY`` column. If you find failing pods, you can access their logs with: .. code:: console $ kubectl --namespace NAMESPACE logs POD You can also enter the pod's shell, by running: .. code:: console $ kubectl --namespace NAMESPACE exec POD -it -- /bin/sh .. _helm-controller: https://fluxcd.io/docs/components/helm/ HTTPS certificates ------------------ Stackspin uses `cert-manager `__ to automatically fetch `Let's Encrypt `__ certificates for all deployed services. If you experience invalid SSL certificates, i.e. your browser warns you when visiting Zulip (https://zulip.stackspin.example.org), a useful resource for troubleshooting is the official cert-manager `Troubleshooting Issuing ACME Certificates `__ documentation. First, try this: In this example we fix a failed certificate request for *https://chat.stackspin.example.org*. We will start by checking if ``cert-manager`` is set up correctly. Is your cluster using the live ACME server ? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: console $ kubectl get clusterissuers -o yaml | grep 'server:' Should return `server: https://acme-v02.api.letsencrypt.org/directory` and not something with the word *staging* in it. Are all cert-manager pods in the *stackspin* namespace in the *READY* state ? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code:: console $ kubectl -n cert-manager get pods Cert-manager uses a "custom resource" to keep track of your certificates, so you can also check the status of your certificates by running: This returns all the certificates for all applications on your system. The command includes example output of healthy certificates. .. code:: console $ kubectl get certificates -A NAMESPACE NAME READY SECRET AGE stackspin hydra-public.tls True hydra-public.tls 14d stackspin single-sign-on-userpanel.tls True single-sign-on-userpanel.tls 14d stackspin-apps stackspin-nextcloud-files True stackspin-nextcloud-files 14d stackspin-apps stackspin-nextcloud-office True stackspin-nextcloud-office 14d stackspin grafana-tls True grafana-tls 13d stackspin alertmanager-tls True alertmanager-tls 13d stackspin prometheus-tls True prometheus-tls 13d If there are problems, you can check for the specific *certificaterequests*: .. code:: console $ kubectl get certificaterequests -A For even more information, inspect the logs of the *cert-manager* pod: .. code:: console $ kubectl -n stackspin logs -l "app.kubernetes.io/name=cert-manager" You can ``grep`` for your cluster domain or for any specific subdomain to narrow down results. Example ~~~~~~~ Query for failed certificates, -requests, challenges or orders: .. code:: console $ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)' stackspin-apps certificate.cert-manager.io/stackspin-zulip False stackspin-zulip 15h stackspin-apps certificaterequest.cert-manager.io/stackspin-zulip-2045852889 False 15h stackspin-apps challenge.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563-837515681 pending chat.stackspin.example.org 15h stackspin-apps order.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563 pending 15h We see that the zulip certificate resources have been in a bad state for 15 hours. Show certificate resource status message: .. code:: console $ kubectl -n stackspin-apps get certificate stackspin-zulip -o jsonpath="{.status.conditions[*]['message']}" Waiting for CertificateRequest "stackspin-zulip-2045852889" to complete We see that the `certificate` is waiting for the `certificaterequest`, let\'s query its status message: .. code:: console $ kubectl -n stackspin-apps get certificaterequest stackspin-zulip-2045852889 -o jsonpath="{.status.conditions[*]['message']}" Waiting on certificate issuance from order stackspin-apps/stackspin-zulip-2045852889-1775447563: "pending" Show the related order resource and look at the status and events: .. code:: console $ kubectl -n stackspin-apps describe order stackspin-zulip-2045852889-1775447563 Show the failed challenge resource reason: .. code:: console $ kubectl -n stackspin-apps get challenge stackspin-zulip-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}' Waiting for http-01 challenge propagation: wrong status code '503', expected '200' In this example, deleting the challenge fixed the issue and a proper certificate could get fetched: .. code:: console $ kubectl -n stackspin-apps delete challenges.acme.cert-manager.io stackspin-zulip-2045852889-1775447563-837515681 Common installation failures ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ var substitution failed ''''''''''''''''''''''' When you execute ``flux get kustomization`` and you see this error: .. code:: console $ flux get kustomization var substitution failed for 'kube-prometheus-stack': YAMLToJSON: yaml: line 32: found character that cannot start any token That can mean that one of your values contains a double quote (``"``) or that you quoted a value in `.flux.env` during the :ref:`flux_config`. Make sure that `.flux.env` does not contain any values that are quoted. If you need to change `.flux.env`, run the following commands: .. code:: console $ kubectl apply -k $CLUSTER_DIR Afterwards, you can speed up the process that fixes your *kustomization*, by running the following (replace *kube-prometheus-stack* with the *kustomization* mentioned in the error message): .. code:: console $ flux reconcile kustomization kube-prometheus-stack Purge Stackspin and install from scratch ---------------------------------------- .. warning:: You will lose all your data! This completely destroys Stackspin and takes everything offline. If you choose to do this, you will need to re-install Stackspin and make sure that your data is stored somewhere other than the VPS that runs Stackspin. If things ever fail beyond possible recovery, here is how to completely purge a Stackspin installation to start from scratch: .. code:: console cluster$ /usr/local/bin/k3s-killall.sh cluster$ systemctl disable k3s cluster$ rm -rf /var/lib/{rancher,Stackspin,kubelet} /etc/rancher /var/log/{Stackspin,containers,pods} /tmp/k3s /etc/systemd/system/k3s.service cluster$ systemctl reboot