Setup a Docker Swarm cluster Part V - Monitoring

part-05

Build your own cheap but powerful self-hosted cluster and be free from any SaaS solutions by following this opinionated guide 🎉

This is the Part V of more global topic tutorial. Back to first part for intro.

Note

This part is totally optional, as it’s mainly focused on monitoring. Feel free to skip this part.

Metrics with Prometheus 🔦

Prometheus became the standard de facto for self-hosted monitoring in part thanks to its architecture. It’s a TSDB (Time Series Database) that will poll (aka scrape) standard metrics REST endpoints, provided by the tools to monitor. It’s the case of Traefik, as we have seen in part III. For tools that don’t support it natively, like databases, you’ll find many exporters that will do the job for you.

Prometheus install 💽

I’ll not use GlusterFS volume for storing Prometheus data, because :

1 prometheus instance needed on the master
No critical data, it’s just metrics
No need of backup, as it can be pretty huge

First go to the manager-01 node settings in Portainer inside Swarm Cluster overview, and apply a new label that indicates that this node is the host of Prometheus data.

Prometheus host overview

It’s equivalent of doing :

docker node update --label-add prometheus.data=true manager-01

Then create following config file :

1
global:
2
  scrape_interval: 5s
3

4
scrape_configs:
5
  - job_name: "prometheus"
6
    static_configs:
7
      - targets: ["localhost:9090"]
8

9
  - job_name: "traefik"
10
    static_configs:
11
      - targets: ["traefik_traefik:8080"]

It consists on 2 scrapes job, use targets in order to indicate to Prometheus the /metrics endpoint locations. I configure 5s as interval, that means Prometheus will scrape /metrics endpoints every 5 seconds.

Finally create next stack in Portainer :

1
version: '3.8'
2

3
services:
4

5
  prometheus:
6
    image: prom/prometheus
7
    networks:
8
      - private
9
      - traefik_public
10
    command:
11
      - --config.file=/etc/prometheus/prometheus.yml
12
      - --storage.tsdb.retention.size=5GB
13
      - --storage.tsdb.retention.time=15d
14
    volumes:
15
      - /etc/hosts:/etc/hosts
16
      - /etc/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
17
      - data:/prometheus
18
    deploy:
19
      placement:
20
        constraints:
21
          - node.labels.prometheus.data == true
22
      labels:
23
        - traefik.enable=true
24
        - traefik.http.routers.prometheus.entrypoints=https
25
        - traefik.http.routers.prometheus.middlewares=admin-ip,admin-auth
26
        - traefik.http.services.prometheus.loadbalancer.server.port=9090
27

28
networks:
29
  private:
30
  traefik_public:
31
    external: true
32

33
volumes:
34
  data:

The private network will serve us later for exporters. Next config are useful in order to control the DB usage, as metrics can go up very quickly :

argument	description
`storage.tsdb.retention.size`	The max DB size
`storage.tsdb.retention.time`	The max data retention date

Deploy it and https://prometheus.sw.dockerswarm.rocks should be available after few seconds. Use same traefik credentials for login.

You should now have access to some metrics !

Prometheus graph

In Status > Targets, you should have 2 endpoints enabled, which correspond to above scrape config.

Prometheus targets

Get cluster metrics

We have the monitor brain, now it’s time to have some more relevant metrics data from all containers as well as docker nodes. Its doable thanks to exporters.

cAdvisor from Google which scrape metrics of all running containers
Node exporter for more global node (aka host) level metrics

Before edit above stack, we need to make a specific docker entry point for node exporter that will help us to fetch the original hostname of the docker host machine name. This is because we run node exporter as docker container, which have no clue of original node hostname.

Besides this node exporter (like cAdvisor) work as an agent which must be deployed in global mode. In order to avoid have a file to put on each host, we’ll use the config docker feature availabe in swarm mode.

Go to Configs menu inside Portainer and add a node_exporter_entrypoint config file with next content :

#!/bin/sh -e

NODE_NAME=$(cat /etc/nodename)
echo "node_meta{node_id=\"$NODE_ID\", container_label_com_docker_swarm_node_id=\"$NODE_ID\", node_name=\"$NODE_NAME\"} 1" > /home/node-meta.prom

set -- /bin/node_exporter "$@"

exec "$@"

Portainer configs

It will take the node hostname and create an exploitable data metric for prometheus.

Next we’ll edit our prometheus stack by expanding YML config with 2 additional services :

1
#...
2
  cadvisor:
3
    image: gcr.io/cadvisor/cadvisor:v0.39.3
4
    volumes:
5
      - /:/rootfs:ro
6
      - /var/run:/var/run:rw
7
      - /sys:/sys:ro
8
      - /var/lib/docker/:/var/lib/docker:ro
9
    networks:
10
      - private
11
    command:
12
      - --docker_only
13
      - --housekeeping_interval=5s
14
      - --disable_metrics=disk,diskIO,tcp,udp,percpu,sched,process
15
    deploy:
16
      mode: global
17

18
  node-exporter:
19
    image: quay.io/prometheus/node-exporter:latest
20
    environment:
21
      - NODE_ID={{.Node.ID}}
22
    networks:
23
      - private
24
    volumes:
25
      - /:/host:ro,rslave
26
      - /etc/hostname:/etc/nodename
27
    command:
28
      - --collector.textfile.directory=/home
29
    configs:
30
      - source: node_exporter_entrypoint
31
        target: /docker-entrypoint.sh
32
    entrypoint:
33
      - /bin/sh
34
      - /docker-entrypoint.sh
35
    deploy:
36
      mode: global
37
#...
38

39
# Don't forget to add this lines in the end !
40
configs:
41
  node_exporter_entrypoint:
42
    external: true

Finally, add the 2 next jobs in previous Prometheus config file :

1
#...
2
  - job_name: "cadvisor"
3
    dns_sd_configs:
4
      - names:
5
          - "tasks.cadvisor"
6
        type: "A"
7
        port: 8080
8

9
  - job_name: "node-exporter"
10
    dns_sd_configs:
11
      - names:
12
          - "tasks.node-exporter"
13
        type: "A"
14
        port: 9100
15
#...

The tasks.* is a specific DNS from specific to Docker Swarm which allows multiple communication at once when using global mode, similarly as tcp://tasks.agent:9001 for Portainer.

You need to restart Prometheus service in order to apply above config.

Go back to the Prometheus targets UI in order to confirm the apparition of 2 new targets.

Prometheus targets all

Confirm you fetch the node_meta metric with proper hostnames :

Prometheus targets all

Visualization with Grafana 📈

Okay so now we have plenty metrics from our cluster and containers, but Prometheus UI Graph is a bit rude to use. It’s time to go to the next level.

Redis

Before install Grafana, let’s quickly install a powerful key-value database cache on data-01 :

sudo add-apt-repository ppa:redislabs/redis
sudo apt install -y redis-server
sudo systemctl enable redis-server.service

Now, let’s enable remote connections by disabling protected-mode and local bind address.

1
# bind 127.0.0.1 -::1
2
protected-mode no

Let’s test it quickly from manager-01 :

sudo add-apt-repository ppa:redislabs/redis
sudo apt install -y redis-tools
redis-cli -h data-01

Grafana install 💽

As always, it’s just a Swarm stack to deploy ! Like N8N, we’ll use a proper real production database and production cache.

First connect to pgAdmin and create new grafana user and database. Don’t forget to tick Can login? in Privileges tab, and set grafana as owner on database creation.

Create storage folder with :

sudo mkdir /mnt/storage-pool/grafana
sudo chown -R 472:472 /mnt/storage-pool/grafana

Next create new following stack :

1
version: '3.8'
2

3
services:
4
  grafana:
5
    image: grafana/grafana:latest
6
    environment:
7
      GF_SERVER_DOMAIN: grafana.sw.dockerswarm.rocks
8
      GF_SERVER_ROOT_URL: https://grafana.sw.dockerswarm.rocks
9
      GF_DATABASE_TYPE: postgres
10
      GF_DATABASE_HOST: data-01:5432
11
      GF_DATABASE_NAME: grafana
12
      GF_DATABASE_USER: grafana
13
      GF_DATABASE_PASSWORD:
14
      GF_REMOTE_CACHE_TYPE: redis
15
      GF_REMOTE_CACHE_CONNSTR: addr=data-01:6379,pool_size=100,db=0,ssl=false
16
    volumes:
17
      - /etc/hosts:/etc/hosts
18
      - /mnt/storage-pool/grafana:/var/lib/grafana
19
    networks:
20
      - traefik_public
21
    deploy:
22
      labels:
23
        - traefik.enable=true
24
        - traefik.http.routers.grafana.entrypoints=https
25
        - traefik.http.routers.grafana.middlewares=admin-ip
26
        - traefik.http.services.grafana.loadbalancer.server.port=3000
27
      placement:
28
        constraints:
29
          - node.role == manager
30

31
networks:
32
  traefik_public:
33
    external: true

Set proper GF_DATABASE_PASSWORD and deploy. Database migration should be automatic (don’t hesitate to check inside pgAdmin). Go to https://grafana.sw.dockerswarm.rocks and login as admin / admin.

Grafana home

Docker Swarm dashboard

For best show-case scenario of Grafana, let’s import an existing dashboard suited for complete Swarm monitor overview.

First we need to add Prometheus as main metrics data source. Go to Configuration > Data source menu and click on Add data source. Select Prometheus and set the internal docker prometheus URL, which should be http://prometheus:9090. A successful message should appear when saving.

Grafana prometheus datasource

Then go to Create > Import, load 11939 as dashboard ID, and select Prometheus source and woha!

Grafana home

The Available Disk Space metrics card should indicate N/A because not properly configured for Hetzner disks. Just edit the card and change the PromQL inside Metrics browser field by replacing device="rootfs", mountpoint="/" by device="/dev/sda1", mountpoint="/host".

External node, MySQL and PostgreSQL exports

We have done for the cluster metrics part but what about the external data-01 host and databases ? Just more exporters of course !

Node exporter for data

For node exporter, we have no other choice to install it locally as a service binary, so we must go through old fashion install.

wget https://github.com/prometheus/node_exporter/releases/download/v1.3.1/node_exporter-1.3.1.linux-amd64.tar.gz
tar xzf node_exporter-1.3.1.linux-amd64.tar.gz
rm node_exporter-1.3.1.linux-amd64.tar.gz
sudo mv node_exporter-1.3.1.linux-amd64/node_exporter /usr/local/bin/
rm -r node_exporter-1.3.1.linux-amd64/

Create a new systemd file service :

1
[Unit]
2
Description=Node Exporter
3

4
[Service]
5
Type=simple
6
ExecStart=/usr/local/bin/node_exporter
7

8
[Install]
9
WantedBy=default.target

Then enable the service and check status :

sudo systemctl enable node-exporter.service
sudo systemctl start node-exporter.service
sudo systemctl status node-exporter.service

Exporter for databases

For MySQL, we need to create a specific exporter user. Do sudo mysql and execute following SQL (replace *** by your password) :

1
CREATE USER 'exporter'@'10.0.0.0/8' IDENTIFIED BY '***' WITH MAX_USER_CONNECTIONS 3;
2
GRANT PROCESS, REPLICATION CLIENT, SELECT ON *.* TO 'exporter'@'10.0.0.0/8';

Then we just have to expand the above prometheus stack description with 2 new exporter services, one for MySQL, and other for PostgreSQL :

1
#...
2
  mysql-exporter:
3
    image: prom/mysqld-exporter
4
    environment:
5
      DATA_SOURCE_NAME: exporter:${MYSQL_PASSWORD}@(data-01:3306)/
6
    networks:
7
      - private
8
    volumes:
9
      - /etc/hosts:/etc/hosts
10
    deploy:
11
      placement:
12
        constraints:
13
          - node.role == manager
14

15
  postgres-exporter:
16
    image: quay.io/prometheuscommunity/postgres-exporter
17
    environment:
18
      DATA_SOURCE_URI: data-01:5432/postgres?sslmode=disable
19
      DATA_SOURCE_USER: swarm
20
      DATA_SOURCE_PASS: ${POSTGRES_PASSWORD}
21
    networks:
22
      - private
23
    volumes:
24
      - /etc/hosts:/etc/hosts
25
    deploy:
26
      placement:
27
        constraints:
28
          - node.role == manager
29
#...

Set proper MYSQL_PASSWORD and POSTGRES_PASSWORD environment variables and deploy the stack. Be sure that the 2 new services have started.

Configure Prometheus

Expand the prometheus config with 3 new jobs :

1
#...
2
- job_name: "node-exporter-data-01"
3
    static_configs:
4
      - targets: ["data-01:9100"]
5

6
  - job_name: "mysql-exporter-data-01"
7
    static_configs:
8
      - targets: ["mysql-exporter:9104"]
9

10
  - job_name: "postgres-exporter-data-01"
11
    static_configs:
12
      - targets: ["postgres-exporter:9187"]
13
#...

Then restart Prometheus service and go back to targets to check you have all new data-01 endpoints.

Prometheus targets data

Grafana dashboards for data

Now it’s time to end this by some new optimized dashboards for data metrics !

It’s simple just import the next dashboards, with Prometheus as data source, same as previous for Docker Swarm dashboard :

1860 : For Node exporter
7362 : For MySQL exporter
9628 : For PostgreSQL exporter

Nothing more to do !

Node Dashboard

Prometheus targets data

MySQL Dashboard

Prometheus targets data

PostgreSQL Dashboard

Prometheus targets data

4th check ✅

We’ve done all the monitoring part with installation of DB times series, exports and UI visualization.

What about logging and tracing, which are another essential aspects for production analyzing and debugging ? We’ll see that in the next part.