Monitoring ========== For monitoring your Stackspin cluster we included the kube-prometheus-stack_ helm chart, which bundles the applications Grafana_, Prometheus_ and Alertmanager_, and also includes pre-configured Prometheus alerts and Grafana dashboards. Grafana ------- Grafana can be accessed by clicking on the ``Monitoring`` icon in the ``Utilities`` Section of the dashboard. Use Stackspin single sign-on to login. Dashboards ~~~~~~~~~~ Browse through the pre-configured dashboards to explore metrics of your Stackspin cluster. Describing every dashboard would be too much here, reach out for us if you don't find what you are looking for. Prometheus ---------- Prometheus can be reached by adding ``prometheus.`` in front of your cluster domain, i.e. ``https://prometheus.stackspin.example.org``. Until we `configure single sign-on for prometheus`_ you need to login using basic auth. The user name is ``admin``, the password can get retrieved by running .. code:: python -m stackspin CLUSTERNAME secrets | grep prometheus-basic-auth Alertmanager ------------ Alertmanager can be reached by adding ``alertmanager.`` in front of your cluster domain, i.e. ``https://alertmanager.stackspin.example.org``. Until we `configure single sign-on for prometheus`_ you need to login using basic auth. The user name is ``admin``, the password can get retrieved by running .. code:: python -m stackspin CLUSTERNAME secrets | grep alertmanager-basic-auth Occasionally it can be convenient to view firing alerts from the command line instead of going through the alertmanager web interface. You can use this command for that (requires `kubectl` access to your cluster): .. code:: kubectl exec -it -n stackspin alertmanager-kube-prometheus-stack-alertmanager-0 -- amtool alert query --alertmanager.url=http://localhost:9093 Email alerts ------------ From time to time you might get email alerts sent by Alertmanager_ to the email address you have set in the cluster configuration. Common alerts include (listed by the ``alertname`` references in the email body): * **KubeJobCompletion**: A job did not complete successfully. Often happens during initial setup phase. If the alert persists use i.e. ``kubectl -n stackspin-apps get jobs`` to see all jobs in the ``stackspin-apps`` namespace and delete the failed job to silence the alert with i.e. ``kubectl -n stackspin-apps delete job nc-nextcloud-cron-27444460``. * **ReconciliationFailure**: A `flux helmRelease`_ could not get reconciled successfully. This also happen often during initial setup phase. It can have different root causes though. Use ``flux -n stackspin-apps get helmreleases`` to view the current state of all ``helmReleases`` in the ``stackspin-apps`` namespace. In case the ``helmRelease`` in question is stuck in a ``install retries exhausted`` or ``upgrade retries exhausted`` state you can force a reconciliation with .. code:: flux -n stackspin-apps suspend helmrelease zulip flux -n stackspin-apps resume helmrelease zulip Depending on the underlying cause this will fix the ``helmRelease`` state or not. For more information on this issue see `helmrelease upgrade retries exhausted regression`_ .. _kube-prometheus-stack: https://artifacthub.io/packages/helm/prometheus-community/kube-prometheus-stack .. _Grafana: https://grafana.com .. _Prometheus: https://prometheus.io .. _Alertmanager: https://prometheus.io/docs/alerting/latest/alertmanager .. _configure single sign-on for prometheus: https://open.greenhost.net/stackspin/stackspin/-/issues/371 .. _flux helmRelease: https://fluxcd.io/docs/guides/helmreleases .. _helmrelease upgrade retries exhausted regression: https://github.com/fluxcd/flux2/issues/1878