Troubleshooting

If you run into problems, there are a few things you can do to research the problem. This document describes what you can do.

Note

cluster$ indicates that the commands should be run as root on your Stackspin machine. All other commands need to be run on your provisioning machine.

We would love to hear from you! If you have problems, please create an issue in our issue tracker or reach out as described on our contact page. We want to be in communication with our users, and we want to help you if you run into problems.

Known issues

If you run into a problem, please check our issue tracker to see if others have run into the same problem. We might have suggested a workaround or temporary solution in one of our issues. If your problems is not described in an issue, please open a new one so we can solve the problems you encounter.

SSH access

You can SSH login to your VPS. Some programs that are available to the root user on the VPS:

  • kubectl, the Kubernetes control program. The root user is connected to the cluster automatically.

  • helm is the “Kubernetes package manager”. Use i.e. helm ls --all-namespaces to see what apps are installed in your cluster. You can also use it to perform manual upgrades; see helm --help.

  • flux is the flux command line tool

Using kubectl to debug your cluster

You can use kubectl, the Kubernetes control program, to find and manipulate your Kubernetes cluster. Once you have installed kubectl, to get access to your cluster with the Stackspin CLI:

$ python -m stackspin stackspin.example.org info

Look for these lines in the output:

# To use kubectl with this cluster, copy-paste this in your terminal:
export KUBECONFIG=/home/you/projects/stackspin/clusters/stackspin.example.org/kube_config_cluster.yml

Copy the whole export line into your terminal. In the same terminal window, kubectl will from now on connect to your cluster.

Alternatively, use SSH to log into your machine, and kubectl will be available there.

Application installation or upgrade failures

Application installations and upgrades are managed by flux. Flux uses helm-controller to install and upgrade applications with helm charts.

An application installed with Flux consists of a kustomization. This resource defines where the information about the application is stored in our Git repository. The kustomization contains a helmrelease, which is an object that represents an installation of a Helm chart. Read more about the difference between kustomizations and helmreleases in the flux documentation.

Be aware that there is a difference between Flux Kustomization objects and Kubernetes kustomizations. In this section we refer to the Flux kustomizations.

To find out if all kustomizations have been applied correctly, run the following flux command in your cluster or from the provisioning machine:

cluster$ flux get kustomizations --all-namespaces

If all your kustomizations are in a Ready state, take a look at your helmreleases:

cluster$ flux get helmreleases -A

If there is an issue, use kubectl to inspect the respective service, for example nginx:

$ kubectl describe helmrelease -n stackspin nginx

If the error message mentions a problem in a HelmChart or GitRepository, you can get information about those objects in a similar fashion:

For git repositories:

$ flux get source git

For helm repositories:

$ flux get source helm

For helm charts:

$ flux get source chart

For more information, use flux --help, or flux get --help.

HelmReleases that have no problems with their sources, but still fail, can often be fixed by simply suspending and resuming them. Use these flux commands:

$ flux --namespace NAMESPACE suspend helmrelease NAME
$ flux --namespace NAMESPACE resume helmrelease NAME

If your HelmRelease is outdated, you can often resolve complications by telling Flux to reconcile them. This will tell Flux to compare the HelmRelease’s current state with the desired state.

cluster$ flux reconcile helmrelease nextcloud

Viewing upgrade history

To see when an application was updated to which version, you can use the helm history command. For example, to see the update history for Zulip, you can run:

$ helm history -n stackspin-apps zulip

Upgrade failed: another operation (install/upgrade/rollback) is in progress

In rare cases, helm upgrades may fail with this status on the HelmRelease:

upgrade failed: another operation (install/upgrade/rollback) is in progress

This appears to be a known flux issue. The workaround is described in the issue comments. It amounts to this (taking metallb as example; you may need to replace the namespace and name of the helm release): first inspect the history of the failing helm release:

$ helm history -n kube-system metallb

You’ll see the failing upgrade here at the end of the list – if not, you’re facing a different problem. Now from this history list, copy down the numerical ID of the last successful deploy of this release, so before the failing upgrade. Suppose that’s 6, then do

$ helm rollback -n kube-system metallb 6

If that finishes successfully, your application is back in a healthy state, though at the previous version, before the failed upgrade. To continue and retry the upgrade, now do

$ flux reconcile hr -n kube-system metallb

If that finishes without errors, the upgrade was now successful. To finish off, you may want to make the kustomization controller aware of this success and have it immediately continue any other upgrades that were pending waiting for this failing one:

$ flux reconcile ks metallb

Debugging on a lower level

You can also debug the pods that run applications. To get an overview of all pods, run:

$ kubectl get pods --all-namespaces

This will show you all pods. Check for failing pods by looking at the READY column. If you find failing pods, you can access their logs with:

$ kubectl --namespace NAMESPACE logs POD

You can also enter the pod’s shell, by running:

$ kubectl --namespace NAMESPACE exec POD -it -- /bin/sh

HTTPS certificates

Stackspin uses cert-manager to automatically fetch Let’s Encrypt certificates for all deployed services. If you experience invalid SSL certificates, i.e. your browser warns you when visiting Zulip (https://zulip.stackspin.example.org), a useful resource for troubleshooting is the official cert-manager Troubleshooting Issuing ACME Certificates documentation. First, try this:

In this example we fix a failed certificate request for https://chat.stackspin.example.org. We will start by checking if cert-manager is set up correctly.

Is your cluster using the live ACME server ?

$ kubectl get clusterissuers -o yaml | grep 'server:'

Should return server: https://acme-v02.api.letsencrypt.org/directory and not something with the word staging in it.

Are all cert-manager pods in the stackspin namespace in the READY state ?

$ kubectl -n cert-manager get pods

Cert-manager uses a “custom resource” to keep track of your certificates, so you can also check the status of your certificates by running:

This returns all the certificates for all applications on your system. The command includes example output of healthy certificates.

$ kubectl get certificates -A
NAMESPACE   NAME                           READY   SECRET                         AGE
stackspin         hydra-public.tls               True    hydra-public.tls               14d
stackspin         single-sign-on-userpanel.tls   True    single-sign-on-userpanel.tls   14d
stackspin-apps    stackspin-nextcloud-files            True    stackspin-nextcloud-files            14d
stackspin-apps    stackspin-nextcloud-office           True    stackspin-nextcloud-office           14d
stackspin         grafana-tls                    True    grafana-tls                    13d
stackspin         alertmanager-tls               True    alertmanager-tls               13d
stackspin         prometheus-tls                 True    prometheus-tls                 13d

If there are problems, you can check for the specific certificaterequests:

$ kubectl get certificaterequests -A

For even more information, inspect the logs of the cert-manager pod:

$ kubectl -n stackspin logs -l "app.kubernetes.io/name=cert-manager"

You can grep for your cluster domain or for any specific subdomain to narrow down results.

Example

Query for failed certificates, -requests, challenges or orders:

$ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
stackspin-apps    certificate.cert-manager.io/stackspin-zulip                 False   stackspin-zulip                 15h
stackspin-apps    certificaterequest.cert-manager.io/stackspin-zulip-2045852889                 False   15h
stackspin-apps    challenge.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563-837515681   pending   chat.stackspin.example.org   15h
stackspin-apps    order.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563                 pending   15h

We see that the zulip certificate resources have been in a bad state for 15 hours.

Show certificate resource status message:

$ kubectl -n stackspin-apps get certificate stackspin-zulip -o jsonpath="{.status.conditions[*]['message']}"
Waiting for CertificateRequest "stackspin-zulip-2045852889" to complete

We see that the certificate is waiting for the certificaterequest, let's query its status message:

$ kubectl -n stackspin-apps get certificaterequest stackspin-zulip-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
Waiting on certificate issuance from order stackspin-apps/stackspin-zulip-2045852889-1775447563: "pending"

Show the related order resource and look at the status and events:

$ kubectl -n stackspin-apps describe order stackspin-zulip-2045852889-1775447563

Show the failed challenge resource reason:

$ kubectl -n stackspin-apps get challenge stackspin-zulip-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
Waiting for http-01 challenge propagation: wrong status code '503', expected '200'

In this example, deleting the challenge fixed the issue and a proper certificate could get fetched:

$ kubectl -n stackspin-apps delete challenges.acme.cert-manager.io stackspin-zulip-2045852889-1775447563-837515681

Common installation failures

var substitution failed

When you execute flux get kustomization and you see this error:

$ flux get kustomization
var substitution failed for 'kube-prometheus-stack': YAMLToJSON: yaml: line 32: found character that cannot start any token

That can mean that one of your values contains a double quote (") or that you quoted a value in .flux.env during the Step 1: Flux configuration. Make sure that .flux.env does not contain any values that are quoted.

If you need to change .flux.env, run the following commands:

$ kubectl apply -k $CLUSTER_DIR

Afterwards, you can speed up the process that fixes your kustomization, by running the following (replace kube-prometheus-stack with the kustomization mentioned in the error message):

$ flux reconcile kustomization kube-prometheus-stack

Purge Stackspin and install from scratch

Warning

You will lose all your data!

This completely destroys Stackspin and takes everything offline. If you choose to do this, you will need to re-install Stackspin and make sure that your data is stored somewhere other than the VPS that runs Stackspin.

If things ever fail beyond possible recovery, here is how to completely purge a Stackspin installation to start from scratch:

cluster$ /usr/local/bin/k3s-killall.sh
cluster$ systemctl disable k3s
cluster$ rm -rf  /var/lib/{rancher,Stackspin,kubelet} /etc/rancher /var/log/{Stackspin,containers,pods} /tmp/k3s /etc/systemd/system/k3s.service
cluster$ systemctl reboot