Maintenance¶
Logging¶
Logs from pods and containers can be read in different ways:
In the cluster filesystem at
/var/log/pods/
or/var/logs/containers/
.Using kubectl logs
Querying aggregated logs with Grafana, see below.
Central log aggregation¶
We use Promtail, Loki and Grafana for easy access of aggregated logs. The Loki documentation is a good starting point how this setup works, and the Using Loki in Grafana gets you started with querying your cluster logs with Grafana.
You will find the Loki Grafana integration on your cluster at https://grafana.oas.example.org/explore together with some generic query examples.
LogQL query examples¶
Please also refer to the LogQL documentation.
Query all aggregated logs (unfortunatly we can’t find a better way of doing this since LogQL always expects a stream label to get queried):
logcli query '{foo!="bar"}'
Query all logs for a keyword:
logcli query '{foo!="bar"} |= "error"'
Query all k8s apps for errors using a regular expression:
logcli query '{job=~".*"} |~ "error|fail|exception|fatal"'
Flux¶
Flux is responsible for installing applications. It uses four controllers:
source-controller
that tracks Helm and Git repositories like https://open.greenhost.net/openappstack/openappstack for updates.kustomize-controller
to deploykustomizations
that often installhelmreleases
.helm-controller
to deploy thehelmreleases
.notification-controller
that is responsible for inbound and outbound flux messages
Query all messages from the source-controller
:
{app="source-controller"}
Query all messages from flux
and helm-controller
:
{app=~"(source-controller|helm-controller)"}
helm-controller
messages containing wordpress
:
{app = "helm-controller"} |= "wordpress"
helm-controller
messages containing wordpress
without
unchanged
events (to only show the installation messages):
{app = "helm-controller"} |= "wordpress" != "unchanged"
Filter out redundant helm-controller
messages:
{ app = "helm-controller" } !~ "(unchanged | event=refreshed | method=Sync | component=checkpoint)"
Debug oauth2 single sign-on with rocketchat:
{container_name=~"(hydra|rocketchat)"}
Query kubernetes events processed by the eventrouter
app containing
warning
:
logcli query '{app="eventrouter"} |~ "warning"'
Cert-manager¶
Cert manager is responsible for requesting Let’s Encrypt TLS certificates.
Query cert-manager
messages containing chat
:
{app="cert-manager"} |= "chat"
Hydra¶
Hydra is the single sign-on system.
Show only warnings and errors from hydra
:
{container_name="hydra"} != "level=info"
Backup¶
On your provisioning machine¶
During the installation process, a cluster config directory is created
on your provisioning machine, located in the top-level sub-directory
clusters
in your clone of the openappstack git repository. Although
these files are not essential for your OpenAppStack cluster to continue
functioning, you may want to back this folder up because it allows easy
access to your cluster.
On your cluster¶
OpenAppStack supports using the program Velero to make backups of your OpenAppStack instance to external storage via the S3 API. See Backups with Velero in the installation instructions for setup details. By default this will make nightly backups of the entire cluster (minus Prometheus data). To make a manual backup, run
cluster$ velero create backup BACKUP_NAME --exclude-namespaces velero --wait
from your VPS. See velero --help
for other commands, and Velero’s
documentation for more information.
Note: in case you want to make an (additional) backup of application
data via alternate means, all persistent volume data of the cluster are
stored in directories under /var/lib/OpenAppStack/local-storage
.
Restore¶
Restore instructions will follow, please reach out to us if you need assistance.
Change the IP of your cluster¶
In case your cluster needs to migrate to another IP, make sure to update
the IP address in /etc/rancher/k3s/k3s.yaml
and, if applicable, your
local kube config and inventory.yml in the cluster directory
clusters/oas.example.org
.
Delete evicted pods¶
In case your cluster disk is full, kubernetes taints the node with
DiskPressure
. Then it tries to evict pods, which is pointless in a single
node setup but can still happen. We have experienced hundreds of pods in
evicted
state that still showed up after DiskPressure
had recovered. See
also the out of resource handling with kubelet documentation.
You can delete all evicted pods with this command:
kubectl get pods --all-namespaces -ojson | jq -r '.items[] | select(.status.reason!=null) | select(.status.reason | contains("Evicted")) | .metadata.name + " " + .metadata.namespace' | xargs -n2 -l bash -c 'kubectl delete pods $0 --namespace=$1'