Setup Validator Metrics
For monitoring our validator, we use Telegraf, a lightweight metrics collection agent. It runs directly on the validator nodes and gathers various hardware metrics such as:
CPU performance
NVMe health and usage
Network traffic
RAM usage
For validator-specific metrics (such as block production, vote credits, identity balance, etc.), we rely on the Stakeconomy scripts. All collected metrics are sent to an external time-series database powered by InfluxDB.
Validator Alerts
We use Watchtower for monitoring the validator's health across the Solana cluster. Watchtower runs on a separate machine and continuously checks validator status. If it detects any issues (delinquency, low balance), it sends alerts through multiple channels such as:
Telegraf
Discord
Setup Watchtower
The watchtower is recommended to be installed in a separate box. We use watchtower for monitoring and alerting identity keys for Mainnet and Testnet. Critical metrics such as Identity balance and validator health are checked every minute.
Prerequisites
Solana CLI < URL Solana CLI Docs >
Python # Used for monitoring scripts
Telegram Group
Discord WEBHOOK
Installation
Install Solana CLI.
Create a service for agave watchtower. We recommend one service for each identity (Mainnet, Testnet, Debug).
Enable the service
Start the service
At this point we don't send alerts to Telegram and Discord yet. We want to capture the metrics for agave watchtower and format them with Python. Then, through another service, we send the alerts to Discord and Telegram.
Formatting Service
Create formatting service for Mainnet.
Create the Python script
Script Execution Rights
Create a service
This script is charged to send the alert to DISCORD and TELEGRAM Channels. Inside this script we have some variables we need to be aware of:
Entire Script < URL >
Enable the service
Start the service
Hardware Alerts
For hardware-related alerts, we rely on Grafana Alerts. These are configured to notify us when metrics exceed defined thresholds, including:
High CPU usage
High memory usage
NVMe disks reaching critical usage levels
This setup ensures both the performance and reliability of our validator are actively monitored and issues are promptly addressed.
Setup Grafana
You can install it yourself or you can use a provider template such as Vultr, which is easy by selecting the server, operating system and at the marketplace center find Grafana. If you prefer to install Grafana you can use the official guide at https://grafana.com/docs/grafana/latest/setup-grafana/installation/ else you can use Grafana Cloud if you don't want to pay for a private server, you have to be aware Grafana Cloud has some retention metrics limitations https://grafana.com/docs/grafana-cloud/
Once your Grafana is running you need to open port 3000 in your firewall
UFW
Else for proper monitoring system you need to add an SSL certificate to your Grafana Server. You can use an auto-signed certificate or much better you can use a free certificate through Let's Encrypt. You can enable SSL like this:
If you are using NGINX, use this:
If you are using Apache, use this:
After getting the certificates you need to add them to Grafana, you must go to Grafana folder configuration and add the certificates path
You need to make sure the Grafana user has the read privileges over these files, for that identify which user Grafana is using for running the systemd service
You need to grant read privileges for that user for certificates
Restart Grafana Service
Check your Grafana https://yourdomain.com:3000
If you install Grafana through provider templates such as Vultr they will provide you the credentials.
If you used the self installation see the Grafana docs link above.
Setup InfluxDB
InfluxDB will receive metrics from the Telegraf agent installed on the validator servers as well as from other sources. For DEB-based platforms (e.g. Ubuntu, Debian), add the InfluxData repository with the following commands:
Update package lists and install InfluxDB:
Start InfluxDB and enable it to run at system startup:
Connect to the InfluxDB shell:
Or, if you need to connect with SSL (for self-signed or invalid certificates):
Create Databases
Our setup includes three main databases:
Validator Metrics Database: Receives metrics from Telegraf agents installed on validator servers.
Monitoring Box Metrics Database: Collects metrics from a separate monitoring system.
Solana Block Production Database: Tracks block production statistics from Solana validators.
For each database, follow these steps:
Create Users
For each database, create a user and grant appropriate permissions:
Services Maintenance
We must be aware that every time a service is changed, we need to reload the daemon and then restart the service.
Reload/Restart Systemd
Checking Service Logs
To monitor the services and troubleshoot issues, use these commands:
Additional Configurations
Repeat the process for each identity (Mainnet, Testnet, Debug) by creating separate services with appropriate configurations for each environment.
Setup Metrics
We pull metrics from several sources such as Stakewiz, Solana API, Solana CLI, Jpool, etc.
Validator Metrics
Create a service:
The script "send_block_metrics_v6.sh" will send the metrics to a separate database which is only dedicated for block production metrics. This script collects the metrics for block production through Solana CLI and also collects epoch information:
Configuration Variables
Here are some variables you should be aware of for this script:
This script "/usr/local/bin/send_validator_metrics.sh" obtains metrics from Solana clusters API and Solana CLI, else pulls metrics from stakewiz API.
Here are some variables you should be aware of for this script:
Script Execution Rights
Setup JPool Rank Fetcher
JPool doesn't have an API which we can use to get metrics and scores, so we had to use scraping methods to get the metrics. JPool doesn't use Cloudflare Turnstile for captcha challenge, so we are able to get these metrics.
First we need to create a Python virtual environment to install some dependencies.
To isolate Python dependencies, create a virtual environment:
Install the required Python packages inside the virtual environment:
Deactivate Environment
Last updated
Was this helpful?