Troubleshooting
If you run into problems, there are a few things you can do to research the problem. This document describes what you can do.
Note
cluster$ indicates that the commands should be run as root on your Stackspin machine. All other commands need to be run on your provisioning machine.
We would love to hear from you! If you have problems, please create an issue in our issue tracker or reach out as described on our contact page. We want to be in communication with our users, and we want to help you if you run into problems.
Known issues
If you run into a problem, please check our issue tracker to see if others have run into the same problem. We might have suggested a workaround or temporary solution in one of our issues. If your problems is not described in an issue, please open a new one so we can solve the problems you encounter.
SSH access
You can SSH login to your VPS. Some programs that are available to the root user on the VPS:
kubectl
, the Kubernetes control program. The root user is connected to the cluster automatically.helm
is the “Kubernetes package manager”. Use i.e.helm ls --all-namespaces
to see what apps are installed in your cluster. You can also use it to perform manual upgrades; seehelm --help
.flux
is the flux command line tool
Using kubectl to debug your cluster
You can use kubectl
, the Kubernetes control program,
to find and manipulate your Kubernetes cluster.
Once you have installed kubectl
,
to get access to your cluster with the Stackspin CLI:
$ python -m stackspin stackspin.example.org info
Look for these lines in the output:
# To use kubectl with this cluster, copy-paste this in your terminal:
export KUBECONFIG=/home/you/projects/stackspin/clusters/stackspin.example.org/kube_config_cluster.yml
Copy the whole export
line into your terminal.
In the same terminal window, kubectl
will from now on connect to your cluster.
Alternatively, use SSH to log into your machine, and kubectl
will be available there.
Application installation or upgrade failures
Application installations and upgrades are managed by flux. Flux uses helm-controller to install and upgrade applications with helm charts.
An application installed with Flux consists of a kustomization. This resource defines where the information about the application is stored in our Git repository. The kustomization contains a helmrelease, which is an object that represents an installation of a Helm chart. Read more about the difference between kustomizations and helmreleases in the flux documentation.
Be aware that there is a difference between Flux Kustomization objects and Kubernetes kustomizations. In this section we refer to the Flux kustomizations.
To find out if all kustomizations have been applied correctly, run the following flux command in your cluster or from the provisioning machine:
cluster$ flux get kustomizations --all-namespaces
If all your kustomizations are in a Ready state, take a look at your helmreleases:
cluster$ flux get helmreleases -A
If there is an issue, use kubectl
to inspect the respective service,
for example nginx
:
$ kubectl describe helmrelease -n stackspin nginx
If the error message mentions a problem in a HelmChart
or GitRepository
,
you can get information about those objects in a similar fashion:
For git repositories:
$ flux get source git
For helm repositories:
$ flux get source helm
For helm charts:
$ flux get source chart
For more information, use flux --help
, or flux get --help
.
HelmReleases that have no problems with their sources, but still fail,
can often be fixed by simply suspending and resuming them.
Use these flux
commands:
$ flux --namespace NAMESPACE suspend helmrelease NAME
$ flux --namespace NAMESPACE resume helmrelease NAME
If your HelmRelease is outdated, you can often resolve complications by telling Flux to reconcile them. This will tell Flux to compare the HelmRelease’s current state with the desired state.
cluster$ flux reconcile helmrelease nextcloud
Viewing upgrade history
To see when an application was updated to which version, you can use the helm
history
command. For example, to see the update history for Zulip, you can run:
$ helm history -n stackspin-apps zulip
Upgrade failed: another operation (install/upgrade/rollback) is in progress
In rare cases, helm upgrades may fail with this status on the HelmRelease:
upgrade failed: another operation (install/upgrade/rollback) is in progress
This appears to be a known flux issue. The workaround is described in the issue comments. It amounts to this (taking metallb as example; you may need to replace the namespace and name of the helm release): first inspect the history of the failing helm release:
$ helm history -n kube-system metallb
You’ll see the failing upgrade here at the end of the list – if not, you’re facing a different problem. Now from this history list, copy down the numerical ID of the last successful deploy of this release, so before the failing upgrade. Suppose that’s 6, then do
$ helm rollback -n kube-system metallb 6
If that finishes successfully, your application is back in a healthy state, though at the previous version, before the failed upgrade. To continue and retry the upgrade, now do
$ flux reconcile hr -n kube-system metallb
If that finishes without errors, the upgrade was now successful. To finish off, you may want to make the kustomization controller aware of this success and have it immediately continue any other upgrades that were pending waiting for this failing one:
$ flux reconcile ks metallb
Debugging on a lower level
You can also debug the pods
that run applications.
To get an overview of all pods, run:
$ kubectl get pods --all-namespaces
This will show you all pods.
Check for failing pods by looking at the READY
column.
If you find failing pods, you can access their logs with:
$ kubectl --namespace NAMESPACE logs POD
You can also enter the pod’s shell, by running:
$ kubectl --namespace NAMESPACE exec POD -it -- /bin/sh
HTTPS certificates
Stackspin uses cert-manager to automatically fetch Let’s Encrypt certificates for all deployed services. If you experience invalid SSL certificates, i.e. your browser warns you when visiting Zulip (https://zulip.stackspin.example.org), a useful resource for troubleshooting is the official cert-manager Troubleshooting Issuing ACME Certificates documentation. First, try this:
In this example we fix a failed certificate request for https://chat.stackspin.example.org.
We will start by checking if cert-manager
is set up correctly.
Is your cluster using the live ACME server ?
$ kubectl get clusterissuers -o yaml | grep 'server:'
Should return server: https://acme-v02.api.letsencrypt.org/directory and not something with the word staging in it.
Are all cert-manager pods in the stackspin namespace in the READY state ?
$ kubectl -n cert-manager get pods
Cert-manager uses a “custom resource” to keep track of your certificates, so you can also check the status of your certificates by running:
This returns all the certificates for all applications on your system. The command includes example output of healthy certificates.
$ kubectl get certificates -A
NAMESPACE NAME READY SECRET AGE
stackspin hydra-public.tls True hydra-public.tls 14d
stackspin single-sign-on-userpanel.tls True single-sign-on-userpanel.tls 14d
stackspin-apps stackspin-nextcloud-files True stackspin-nextcloud-files 14d
stackspin-apps stackspin-nextcloud-office True stackspin-nextcloud-office 14d
stackspin grafana-tls True grafana-tls 13d
stackspin alertmanager-tls True alertmanager-tls 13d
stackspin prometheus-tls True prometheus-tls 13d
If there are problems, you can check for the specific certificaterequests:
$ kubectl get certificaterequests -A
For even more information, inspect the logs of the cert-manager pod:
$ kubectl -n stackspin logs -l "app.kubernetes.io/name=cert-manager"
You can grep
for your cluster domain or for any specific subdomain to narrow
down results.
Example
Query for failed certificates, -requests, challenges or orders:
$ kubectl get --all-namespaces certificate,certificaterequest,challenge,order | grep -iE '(false|pending)'
stackspin-apps certificate.cert-manager.io/stackspin-zulip False stackspin-zulip 15h
stackspin-apps certificaterequest.cert-manager.io/stackspin-zulip-2045852889 False 15h
stackspin-apps challenge.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563-837515681 pending chat.stackspin.example.org 15h
stackspin-apps order.acme.cert-manager.io/stackspin-zulip-2045852889-1775447563 pending 15h
We see that the zulip certificate resources have been in a bad state for 15 hours.
Show certificate resource status message:
$ kubectl -n stackspin-apps get certificate stackspin-zulip -o jsonpath="{.status.conditions[*]['message']}"
Waiting for CertificateRequest "stackspin-zulip-2045852889" to complete
We see that the certificate is waiting for the certificaterequest, let's query its status message:
$ kubectl -n stackspin-apps get certificaterequest stackspin-zulip-2045852889 -o jsonpath="{.status.conditions[*]['message']}"
Waiting on certificate issuance from order stackspin-apps/stackspin-zulip-2045852889-1775447563: "pending"
Show the related order resource and look at the status and events:
$ kubectl -n stackspin-apps describe order stackspin-zulip-2045852889-1775447563
Show the failed challenge resource reason:
$ kubectl -n stackspin-apps get challenge stackspin-zulip-2045852889-1775447563-837515681 -o jsonpath='{.status.reason}'
Waiting for http-01 challenge propagation: wrong status code '503', expected '200'
In this example, deleting the challenge fixed the issue and a proper certificate could get fetched:
$ kubectl -n stackspin-apps delete challenges.acme.cert-manager.io stackspin-zulip-2045852889-1775447563-837515681
Common installation failures
var substitution failed
When you execute flux get kustomization
and you see this error:
$ flux get kustomization
var substitution failed for 'kube-prometheus-stack': YAMLToJSON: yaml: line 32: found character that cannot start any token
That can mean that one of your values contains a double quote ("
) or that
you quoted a value in .flux.env during the Step 1: Flux configuration. Make sure
that .flux.env does not contain any values that are quoted.
If you need to change .flux.env, run the following commands:
$ kubectl apply -k $CLUSTER_DIR
Afterwards, you can speed up the process that fixes your kustomization, by running the following (replace kube-prometheus-stack with the kustomization mentioned in the error message):
$ flux reconcile kustomization kube-prometheus-stack
Purge Stackspin and install from scratch
Warning
- You will lose all your data!
This completely destroys Stackspin and takes everything offline. If you choose to do this, you will need to re-install Stackspin and make sure that your data is stored somewhere other than the VPS that runs Stackspin.
If things ever fail beyond possible recovery, here is how to completely purge a Stackspin installation to start from scratch:
cluster$ /usr/local/bin/k3s-killall.sh
cluster$ systemctl disable k3s
cluster$ rm -rf /var/lib/{rancher,Stackspin,kubelet} /etc/rancher /var/log/{Stackspin,containers,pods} /tmp/k3s /etc/systemd/system/k3s.service
cluster$ systemctl reboot