Monitoring

For monitoring your Stackspin cluster we included the kube-prometheus-stack helm chart, which bundles the applications Grafana, Prometheus and Alertmanager, and also includes pre-configured Prometheus alerts and Grafana dashboards.

Grafana

Grafana can be accessed by clicking on the Monitoring icon in the Utilities Section of the dashboard. Use Stackspin single sign-on to login.

Dashboards

Browse through the pre-configured dashboards to explore metrics of your Stackspin cluster. Describing every dashboard would be too much here, reach out for us if you don’t find what you are looking for.

Prometheus

Prometheus can be reached by adding prometheus. in front of your cluster domain, i.e. https://prometheus.stackspin.example.org. Until we configure single sign-on for prometheus you need to login using basic auth. The user name is admin, the password can get retrieved by running

python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth

Alertmanager

Alertmanager can be reached by adding alertmanager. in front of your cluster domain, i.e. https://alertmanager.stackspin.example.org. Until we configure single sign-on for prometheus you need to login using basic auth. The user name is admin, the password can get retrieved by running

python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth

Email alerts

From time to time you might get email alerts sent by Alertmanager to the email address you have set in the cluster configuration. Common alerts include (listed by the alertname references in the email body):

  • KubeJobCompletion: A job did not complete successfully. Often happens during initial setup phase. If the alert persists use i.e. kubectl -n stackspin-apps get jobs to see all jobs in the stackspin-apps namespace and delete the failed job to silence the alert with i.e. kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460.

  • ReconciliationFailure: A flux helmRelease could not get reconciled successfully. This also happen often during initial setup phase. It can have different root causes though. Use flux -n stackspin-apps get helmreleases to view the current state of all helmReleases in the stackspin-apps namespace. In case the helmRelease in question is stuck in a install retries exhausted or upgrade retries exhausted state you can force a reconciliation with

    flux -n stackspin-apps suspend helmrelease zulip
    flux -n stackspin-apps resume helmrelease zulip
    

    Depending on the underlying cause this will fix the helmRelease state or not. For more information on this issue see helmrelease upgrade retries exhausted regression