Monitoring
For monitoring your Stackspin cluster we included the kube-prometheus-stack helm chart, which bundles the applications Grafana, Prometheus and Alertmanager, and also includes pre-configured Prometheus alerts and Grafana dashboards.
Grafana
Grafana can be accessed by clicking on the Monitoring icon in the Utilities
Section of the dashboard. Use Stackspin single sign-on to login.
Dashboards
Browse through the pre-configured dashboards to explore metrics of your Stackspin cluster. Describing every dashboard would be too much here, reach out for us if you don’t find what you are looking for.
Prometheus
Prometheus can be reached by adding prometheus. in front of your cluster
domain, i.e. https://prometheus.stackspin.example.org. Until we configure single
sign-on for prometheus you need to login using basic auth.
The user name is admin, the password can get retrieved by running
python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth
Alertmanager
Alertmanager can be reached by adding alertmanager. in front of your cluster
domain, i.e. https://alertmanager.stackspin.example.org. Until we configure single
sign-on for prometheus you need to login using basic auth.
The user name is admin, the password can get retrieved by running
python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth
Occasionally it can be convenient to view firing alerts from the command line instead of going through the alertmanager web interface. You can use this command for that (requires kubectl access to your cluster):
kubectl exec -it -n stackspin alertmanager-kube-prometheus-stack-alertmanager-0 -- amtool alert query --alertmanager.url=http://localhost:9093
Email alerts
From time to time you might get email alerts sent by Alertmanager to the email
address you have set in the cluster configuration.
Common alerts include (listed by the alertname references in the email
body):
KubeJobCompletion: A job did not complete successfully. Often happens during initial setup phase. If the alert persists use i.e.
kubectl -n stackspin-apps get jobsto see all jobs in thestackspin-appsnamespace and delete the failed job to silence the alert with i.e.kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460.ReconciliationFailure: A flux helmRelease could not get reconciled successfully. This also happen often during initial setup phase. It can have different root causes though. Use
flux -n stackspin-apps get helmreleasesto view the current state of allhelmReleasesin thestackspin-appsnamespace. In case thehelmReleasein question is stuck in ainstall retries exhaustedorupgrade retries exhaustedstate you can force a reconciliation withflux -n stackspin-apps suspend helmrelease zulip flux -n stackspin-apps resume helmrelease zulip
Depending on the underlying cause this will fix the
helmReleasestate or not. For more information on this issue see helmrelease upgrade retries exhausted regressionKubeClientCertificateExpiration: Kubernetes is being used with a certificate that will expire soon (less than a week). Unfortunately there is currently no easy way to determine which client that is. Most of the time this is a kubernetes-internal cert, and it can be resolved by restarting k3s by running
systemctl restart k3son the node in question. It could also be any pod that’s using the cert that’s automatically injected; in that case restarting the pod will help.