Setup Validator Metrics

Setup Validator Metrics - InfluxDB.

Hayek Monitoring Environment

For monitoring our validator, we use Telegraf, a lightweight metrics collection agent. It runs directly on the validator nodes and gathers various hardware metrics such as:

  • CPU performance

  • NVMe health and usage

  • Network traffic

  • RAM usage

For validator-specific metrics (such as block production, vote credits, identity balance, etc.), we rely on the Stakeconomy scripts. All collected metrics are sent to an external time-series database powered by InfluxDB.

Alerting System

We use Watchtower for monitoring the validator's health across the Solana cluster. Watchtower runs on a separate machine and continuously checks validator status. If it detects any issues (delinquency, low balance), it sends alerts through multiple channels such as:

  • Telegraf

  • Discord

Hardware Alerts

For hardware-related alerts, we rely on Grafana Alerts. These are configured to notify us when metrics exceed defined thresholds, including:

  • High CPU usage

  • High memory usage

  • NVMe disks reaching critical usage levels

This setup ensures both the performance and reliability of our validator are actively monitored and issues are promptly addressed.

Setup Grafana

You can install it yourself or you can use a provider template such as Vultr, which is easy by selecting the server, operating system and at the marketplace center find Grafana. If you prefer to install Grafana you can use the official guide at https://grafana.com/docs/grafana/latest/setup-grafana/installation/ else you can use Grafana Cloud if you don't want to pay for a private server, you have to be aware Grafana Cloud has some retention metrics limitations https://grafana.com/docs/grafana-cloud/

Once your Grafana is running you need to open port 3000 in your firewall

UFW

ufw allow 3000/tcp
ufw reload

Else for proper monitoring system you need to add an SSL certificate to your Grafana Server

You can use an auto-signed certificate or much better you can use a free certificate through Let's Encrypt

Enable SSL

Install Certbot

apt install certbot

For NGINX

apt install python3-certbot-nginx

Get Certificate

certbot --nginx -d your-domain.com -d www.yourdomain.com --email [email protected] --agree-tos --no-eff-email

For Apache

apt install python3-certbot-apache

Get Certificate

certbot --apache -d your-domain.com -d www.yourdomain.com --email [email protected] --agree-tos --no-eff-email

After getting the certificates you need to add them to Grafana, you must go to Grafana folder configuration and add the certificates path

nano /etc/grafana/grafana.ini
## locate the certificates lines and add / edit the Let's Encrypt certificates
cert_file = /etc/letsencrypt/live/yourdomain.com/fullchain.pem
cert_key = /etc/letsencrypt/live/yourdomain.com/privkey.pem

You need to make sure the Grafana user has the read privileges over these files, for that identify which user Grafana is using for running the systemd service

systemctl show grafana-server -p User
###output message
#User=grafana

You need to grant read privileges for that user for certificates

chmod root:grafana /etc/letsencrypt/live/yourdomain.com/{fullchain.pem,privkey.pem}
chmod 640 root:grafana /etc/letsencrypt/live/yourdomain.com/{fullchain.pem,privkey.pem}

Restart Grafana Service

systemctl restart grafana-server

Check your Grafana https://yourdomain.com:3000

If you install Grafana through provider templates such as Vultr they will provide you the credentials.

If you used the self installation see the Grafana docs link above.

Setup InfluxDB

InfluxDB will receive metrics from the Telegraf agent installed on the validator servers as well as from other sources.

Installation

For DEB-based platforms (e.g. Ubuntu, Debian), add the InfluxData repository with the following commands:

wget -q https://repos.influxdata.com/influxdata-archive_compat.key
echo '393e8779c89ac8d958f81f942f9ad7fb82a25e133faddaf92e15b16e6ac9ce4c influxdata-archive_compat.key' | sha256sum -c && cat influxdata-archive_compat.key | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg > /dev/null
echo 'deb [signed-by=/etc/apt/trusted.gpg.d/influxdata-archive_compat.gpg] https://repos.influxdata.com/debian stable main' | sudo tee /etc/apt/sources.list.d/influxdata.list

Update package lists and install InfluxDB:

sudo apt-get update
sudo apt-get install influxdb -y

Start InfluxDB and enable it to run at system startup:

sudo systemctl enable influxdb
sudo systemctl start influxdb

Access to InfluxDB

Connect to the InfluxDB shell:

influx

Or, if you need to connect with SSL (for self-signed or invalid certificates):

influx -ssl -unsafeSsl

Create Databases

Our setup includes three main databases:

  1. Validator Metrics Database: Receives metrics from Telegraf agents installed on validator servers.

  2. Monitoring Box Metrics Database: Collects metrics from a separate monitoring system.

  3. Solana Block Production Database: Tracks block production statistics from Solana validators.

For each database, follow these steps:

create database <database_name>
use <database_name>

Create Users

For each database, create a user and grant appropriate permissions:

create user <username> with password '<password>'
grant all on <database_name> to <username>

Setup Watchtower

The watchtower is recommended to be installed in a separate box. We use watchtower for monitoring and alerting identity keys for Mainnet and Testnet. Critical metrics such as Identity balance and validator health are checked every minute.

Prerequisites

  • Solana CLI < URL Solana CLI Docs >

  • Python # Used for monitoring scripts

  • Telegram Group

  • Discord WEBHOOK

Installation

  1. Install Solana CLI.

  2. Create a service for agave watchtower. We recommend one service for each identity (Mainnet, Testnet, Debug).

nano /etc/systemd/system/agave-watchtower-mainnet.service
[Unit]
Description=Agave Watchtower Monitoring Service (Mainnet)
After=network.target

[Service]
ExecStart=/usr/local/bin/agave-watchtower \
  --url https://api.mainnet-beta.solana.com \
  --validator-identity [PUBKEY] \
  --interval 60 \                          
  --monitor-active-stake \
  --minimum-validator-identity-balance 5 \ # Minimum threshold for identity balance
  --rpc-timeout 30 \
  --name-suffix "server-name" \
  --unhealthy-threshold 1 \
  --ignore-http-bad-gateway
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target
  1. Enable the service

systemctl enable agave-watchtower-mainnet.service
  1. Start the service

systemctl start agave-watchtower-mainnet.service

At this point we don't send alerts to Telegram and Discord yet. We want to capture the metrics for agave watchtower and format them with Python. Then, through another service, we send the alerts to Discord and Telegram.

Create Formatting Service for Mainnet

  1. Create the Python script

nano /usr/local/bin/solana-alert-formatter-mainnet.py
  1. Script Execution Rights

chmod +x /usr/local/bin/solana-alert-formatter-mainnet.py
  1. Create a service

nano /etc/systemd/system/solana-alert-formatter-mainnet.service
[Unit]
Description=Agave Watchtower Alert Formatter
After=agave-watchtower-mainnet.service  # This means this service only starts after the agave-watchtower-mainnet.service has been activated
Wants=agave-watchtower-mainnet.service

[Service]
ExecStart=/usr/bin/python3 /usr/local/bin/solana-alert-formatter-mainnet.py
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

This script is charged to send the alert to DISCORD and TELEGRAM Channels. Inside this script we have some variables we need to be aware of:

# Alert Intervals (in seconds)
VALIDATOR_DELINQUENT_ALERT_INTERVAL = 180  # Time between delinquent alerts (3 minutes)
LOW_BALANCE_ALERT_INTERVAL = 60 # Time between alerts when the validator has low balance
SUGGESTED_BALANCE = 10   # Suggested balance in SOL for identity accounts
# Enable/Disable specific alert types
ENABLE_DELINQUENT_ALERTS = True      # Set to False to disable validator delinquent alerts
ENABLE_RECOVERY_ALERTS = True        # Set to False to disable validator recovery alerts
ENABLE_LOW_BALANCE_ALERTS = True     # Set to False to disable low balance alerts
ENABLE_BALANCE_RECOVERY_ALERTS = True # Set to False to disable balance recovery alerts
# Communication platforms
ENABLE_DISCORD = True
ENABLE_TELEGRAM = True 
# Discord configuration
DISCORD_WEBHOOK = "webhooks_url"
# Telegram configuration
TELEGRAM_BOT_TOKEN = ""
TELEGRAM_CHAT_ID = ""

Entire Script < URL >

  1. Enable the service

systemctl enable solana-alert-formatter-mainnet.service
  1. Start the service

systemctl start solana-alert-formatter-mainnet.service

Services Maintenance and Monitoring

We must be aware that every time a service is changed, we need to reload the daemon and then restart the service.

Reload/Restart Systemd

systemctl daemon-reload
systemctl restart <SERVICE>

Checking Service Logs

To monitor the services and troubleshoot issues, use these commands:

systemctl status <SERVICE>
journalctl -u <SERVICE> -f

Additional Configurations

Repeat the process for each identity (Mainnet, Testnet, Debug) by creating separate services with appropriate configurations for each environment.

Setup Metrics

We pull metrics from several sources such as Stakewiz, Solana API, Solana CLI, Jpool, etc.

Validator and Block Production Metrics

Create a service:

nano /etc/systemd/system/validator-metrics.service
[Unit]
Description=Stakewiz Validator Metrics Sender (every 2 minutes)
After=network-online.target
Wants=network-online.target

[Service]
ExecStart=/bin/bash -c 'while true; do /usr/local/bin/send_validator_metrics.sh & /usr/local/bin/send_block_metrics_v6.sh; wait; sleep 120; done'
Restart=always
RestartSec=5
User=root

[Install]
WantedBy=multi-user.target

The script "send_block_metrics_v6.sh" will send the metrics to a separate database which is only dedicated for block production metrics. This script collects the metrics for block production through Solana CLI and also collects epoch information:

solana block <blocknumber>
solana epoch-info

Configuration Variables

Here are some variables you should be aware of for this script:

# You can choose which networks to analyze by changing these variables to true/false
PROCESS_MAINNET=true
PROCESS_TESTNET=false  # Change to true if you want to process testnet
PROCESS_DEBUG=false    # Change to true if you want to process debug

# MAINNET
MAINNET_VOTE_ACCOUNT="<VOTEKEY>"
MAINNET_IDENTITY_KEY="<PUBKEY>"
MAINNET_RPC_API="https://api.mainnet-beta.solana.com"
MAINNET_HOST="<SERVERNAME>"

# ===== INFLUXDB CONFIGURATION =====
INFLUX_URL="https://influxdb-server-url:8086"
INFLUX_DB="validator_blocks"
INFLUX_USER="<DB_USER>"
INFLUX_PASS="<DB_PASSWORD>"  

# ===== PATH TO SOLANA BIN =====
SOLANA_BIN="/root/.local/share/solana/install/active_release/bin/solana"

This script "/usr/local/bin/send_validator_metrics.sh" obtains metrics from Solana clusters API and Solana CLI, else pulls metrics from stakewiz API.

Here are some variables you should be aware of for this script:

MAINNET_VOTE_ACCOUNT="<VOTEKEY>"
MAINNET_IDENTITY_KEY="<PUBKEY>"
MAINNET_RPC_API="https://api.mainnet-beta.solana.com"
MAINNET_HOST="<SERVERNAME>"
MAINNET_STAKEWIZ_ENABLED=true # If false don't pull metrics from stakewiz API 
MAINNET_GOSSIP_ENABLED=true 

# ===== INFLUXDB CONFIGURATION =====
INFLUX_URL="https://validator.secu.one:8086"
INFLUX_DB="<INFLUX_DATABASE>"
INFLUX_USER="<DB_USER>"
INFLUX_PASS="<DB_PASSWORD>"

# ===== ABSOLUTE PATH TO SOLANA BIN =====
SOLANA_BIN="/root/.local/share/solana/install/active_release/bin/solana"

Script Execution Rights

chmod +x /usr/local/bin/send_block_metrics_v6.sh
chmod +x /usr/local/bin/send_validator_metrics.sh

Setup JPool Rank Fetcher

JPool doesn't have an API which we can use to get metrics and scores, so we had to use scraping methods to get the metrics. JPool doesn't use Cloudflare Turnstile for captcha challenge, so we are able to get these metrics.

First we need to create a Python virtual environment to install some dependencies.

Install Python Venv

sudo apt update
sudo apt install python3 python3-venv

Create Virtual Environment

To isolate Python dependencies, create a virtual environment:

sudo python3 -m venv /root/venv

Activate virtual environment

source /root/venv/bin/activate

Install dependencies

Install the required Python packages inside the virtual environment:

/root/venv/bin/pip install flask playwright
# Install browser playwright, which is necessary for scraping 
/root/venv/bin/python -m playwright install

Deactivate Environment

deactivate

Create the service

nano /etc/systemd/system/tvc-api.service
[Unit]
Description=TVC Rank API with Flask and Playwright
After=network.target

[Service]
User=root
ExecStart=/root/venv/bin/python /usr/local/bin/get_tvc_rank.py
Restart=always
RestartSec=5
Environment=PYTHONUNBUFFERED=1

[Install]
WantedBy=multi-user.target

Script Execution Rights

chmod +x /usr/local/bin/get_tvc_rank.py

Enable the service

systemctl enable tvc-api.service

Start the service

systemctl start tvc-api.service

Last updated

Was this helpful?