Below is Step 3 only: a deep, student-ready lab guide for creating Grafana 13.x alerts using Graphite + Telegraf Linux metrics.
This guide assumes you already completed:
Step 1: Explored Graphite metrics in Grafana Explore Step 2: Created Linux Monitoring Dashboard Datasource: Graphite Metrics prefix: telegraf Host: linux-demo Dashboard: Linux Monitoring – Graphite Telegraf
Very Important Metric Validation
Based on the actual metrics collected in your Graphite storage, this lab uses the following metric structure:
telegraf.linux-demo..
Examples:
telegraf.linux-demo.cpu.usage_active telegraf.linux-demo.mem.used_percent telegraf.linux-demo.disk.used_percent telegraf.linux-demo.system.load1 telegraf.linux-demo.processes.zombies
Do not use the older assumed paths below in this lab:
telegraf.linux-demo.cpu.cpu-total.usage_active telegraf.linux-demo.disk.*.used_percent telegraf.linux-demo.net.eth0.bytes_recv telegraf.linux-demo.net.ens5.bytes_recv
Reason:
Your current Telegraf Graphite output is storing metrics without CPU tag, disk mountpoint tag, and network interface tag in the Graphite path.
Grafana alert rules are based on three major parts: a query that selects data, a condition/threshold that decides when to fire, and evaluation settings that decide how often and how long the condition must be true. Grafana’s alerting model uses query, reduce expression, threshold expression, evaluation interval, and pending period.
Step 1
Step 2
Step 3: Create Alerts in Grafana 13.x for Graphite Linux Metrics
Lab Objective
By the end of this lab, students will create alerts for:
- High CPU Usage
- High Memory Usage
- High Disk Usage
- High Load Average
- Zombie Processes Detected
Students will also learn:
- What an alert rule is
- What Reduce expression does
- What Threshold expression does
- How to configure evaluation interval
- How to configure contact points
- How to route alert notifications
- How to test alerts safely
- How to troubleshoot No Data and Error states
3.1 Alerting Architecture
Your current monitoring flow is:
Linux Host ↓ Telegraf ↓ Graphite Carbon ↓ Whisper Storage ↓ Graphite Web ↓ Grafana Datasource ↓ Grafana Alert Rule ↓ Contact Point ↓ Email / Slack / Webhook / Teams
In simple words:
Telegraf collects metrics. Graphite stores metrics. Grafana queries Graphite. Grafana evaluates alert rules. Grafana sends notifications when rules fire.
3.2 Important Alerting Concepts
Alert Rule
An alert rule is the main object in Grafana Alerting.
It contains:
- Query
- Expression
- Condition
- Evaluation interval
- Pending period
- Labels
- Notification settings
Grafana supports alert rules that evaluate queries and expressions over time, and alert rules are the central component of the alerting system.
Query
The query gets data from Graphite.
Correct example for this lab:
telegraf.linux-demo.cpu.usage_active
This returns CPU active usage time-series data.
Incorrect example for this lab:
telegraf.linux-demo.cpu.cpu-total.usage_active
This does not work with your current collected metrics because cpu-total is not present in your Graphite metric path.
Reduce Expression
Grafana alerting cannot directly alert on a full time-series graph. It normally reduces a time series into a single value.
Common reducer functions:
ReducerMeaning LastMost recent value MeanAverage value over selected range MaxMaximum value over selected range MinMinimum value over selected range
For most infrastructure alerts, use:
Mean
or:
Last
Recommended for this lab:
MetricReducer CPUMean MemoryLast DiskLast LoadMean Zombie ProcessesLast
Threshold Expression
Threshold checks whether the reduced value crosses a limit.
Example:
WHEN B IS ABOVE 80
Meaning:
If the reduced CPU value is above 80, alert should fire.
Evaluation Group
Evaluation group controls how often Grafana checks the alert.
For this lab:
Every 1 minute
Pending Period
Pending period controls how long the condition must remain true before firing.
Example:
For 2 minutes
Meaning:
The condition must be true for 2 continuous minutes before Grafana sends alert.
This prevents alerts from firing due to short temporary spikes.
No Data and Error Handling
Every alert rule should define what happens if Grafana gets no data or query errors.
Recommended beginner setup:
StateRecommended Setting No DataNo Data ErrorError
For production, teams often tune this based on monitoring maturity.
3.3 Recommended Alert Rules for This Lab
Use these thresholds for demo and learning.
Alert NameQueryWarning/Critical Threshold High CPU Usagetelegraf.linux-demo.cpu.usage_activeAbove 80% High Memory Usagetelegraf.linux-demo.mem.used_percentAbove 85% High Disk Usagetelegraf.linux-demo.disk.used_percentAbove 85% High Load Averagetelegraf.linux-demo.system.load1Above 2 Zombie Processestelegraf.linux-demo.processes.zombiesAbove 0
For student demos, you can temporarily lower thresholds to trigger alerts quickly.
Example:
CPU threshold: 20% Memory threshold: 40% Disk threshold: slightly below current disk usage
After testing, restore proper thresholds.
3.4 Create a Contact Point
Before creating alert rules, configure where Grafana should send notifications.
Grafana contact points contain the configuration for sending alert notifications, such as email, Slack, webhook, Microsoft Teams, and other integrations.
Steps
Go to:
Alerting → Contact points
Click:
Add contact point
Name:
student-demo-email
Integration type:
Add email address:
For your own lab, use your real email.
Click:
Test
Then:
Save contact point
If Email Is Not Configured
In many self-hosted Grafana setups, email will not work unless SMTP is configured.
For learning, use a Webhook contact point instead.
Example contact point:
Name: student-demo-webhook Type: Webhook URL: https://webhook.site/your-generated-url
You can open webhook.site, copy the unique URL, and paste it into Grafana.
This is very useful for students because they can see alert payloads immediately without configuring SMTP.
3.5 Create Notification Policy
Notification policies decide how alerts are routed to contact points. Notification policies route alert instances to contact points using label matchers and can group multiple alerts together to reduce noise.
For this beginner lab, keep it simple.
Go to:
Alerting → Notification policies
Edit the default policy.
Set default contact point:
student-demo-email
or:
student-demo-webhook
Save.
This means all alerts without custom routing will go to that contact point.
3.6 Create Folder for Alert Rules
Go to:
Dashboards → New folder
Create folder:
Linux Monitoring Alerts
Why?
Grafana-managed alert rules are usually stored under folders. This keeps alert rules organized.
3.7 Alert Rule 1: High CPU Usage
Purpose
This alert fires when CPU usage remains high.
Go to Alert Rules
Navigate to:
Alerting → Alert rules
Click:
New alert rule
Step A: Rule Name
Use:
Linux – High CPU Usage
Step B: Select Data Source Query
In Query A, select datasource:
Graphite
Query:
telegraf.linux-demo.cpu.usage_active
Optional display alias if using the query in Explore or dashboard:
alias(telegraf.linux-demo.cpu.usage_active, ‘CPU Active %’)
For alerting, the plain metric path is usually cleaner:
telegraf.linux-demo.cpu.usage_active
Set relative time range:
From: 5m To: now
Step C: Add Reduce Expression
Click:
Add expression
Expression type:
Reduce
Input:
A
Function:
Mean
Name it:
B
Meaning:
Grafana takes CPU usage data from the last 5 minutes and calculates the average.
Step D: Add Threshold Expression
Click:
Add expression
Expression type:
Threshold
Input:
B
Condition:
IS ABOVE 80
Name it:
C
Set alert condition to:
C
Step E: Evaluation Behavior
Set:
Folder: Linux Monitoring Alerts Evaluation group: linux-alerts-1m Evaluate every: 1m Pending period: 2m
Meaning:
Grafana checks every 1 minute. If CPU average stays above 80% for 2 minutes, alert fires.
Step F: Configure No Data and Error
Use:
No data state: No Data Error state: Error
Step G: Labels
Add labels:
severity = warning team = training service = linux host = linux-demo metric = cpu
Labels help route and organize alerts.
Step H: Annotations
Add summary:
High CPU usage detected on linux-demo
Add description:
CPU active usage has been above 80% for more than 2 minutes. Check running processes, recent deployments, load average, and application workload.
Add runbook URL if you have one:
Step I: Save Rule
Click:
Save rule and exit
Test CPU Alert
For demo, temporarily change threshold from:
80
to:
20
Then run:
stress –cpu 2 –timeout 180
Watch:
Alerting → Alert rules
Expected states:
Normal → Pending → Firing
After testing, restore threshold to:
80
3.8 Alert Rule 2: High Memory Usage
Rule Name
Linux – High Memory Usage
Query A
Datasource:
Graphite
Query:
telegraf.linux-demo.mem.used_percent
Optional display alias:
alias(telegraf.linux-demo.mem.used_percent, ‘Memory Used %’)
Relative time range:
From: 5m To: now
Reduce Expression B
Type: Reduce Input: A Function: Last
Why Last?
Memory usage is already a stable gauge. The most recent value is usually meaningful.
Threshold Expression C
Type: Threshold Input: B Condition: IS ABOVE 85
Evaluation
Evaluate every: 1m Pending period: 3m
Labels
severity = warning team = training service = linux host = linux-demo metric = memory
Annotations
Summary:
High memory usage detected on linux-demo
Description:
Memory usage is above 85%. Check memory-heavy processes, application memory growth, cache usage, and swap activity.
Save
Click:
Save rule and exit
Optional Memory Test
To test memory safely, avoid exhausting the server.
Install stress if needed:
apt install -y stress
Run a small memory test:
stress –vm 1 –vm-bytes 256M –timeout 120
For small EC2 instances, reduce memory:
stress –vm 1 –vm-bytes 128M –timeout 120
For demo triggering, temporarily reduce threshold to a value slightly below current memory usage.
3.9 Alert Rule 3: High Disk Usage
Important Note
In this specific lab, the current Graphite metric path does not include disk mountpoint as a separate node.
Correct collected metric:
telegraf.linux-demo.disk.used_percent
Incorrect query for this lab:
highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)
Why the older query is incorrect:
Your Graphite storage does not contain paths like:
telegraf.linux-demo.disk.root.used_percent telegraf.linux-demo.disk.var.used_percent telegraf.linux-demo.disk.tmp.used_percent
It contains:
telegraf.linux-demo.disk.used_percent
So alerting should use the exact collected metric:
telegraf.linux-demo.disk.used_percent
Rule Name
Linux – High Disk Usage
Query A
Datasource:
Graphite
Query:
telegraf.linux-demo.disk.used_percent
Optional display alias:
alias(telegraf.linux-demo.disk.used_percent, ‘Disk Used %’)
Relative time range:
From: 5m To: now
Reduce Expression B
Type: Reduce Input: A Function: Last
Why Last?
Disk usage percentage is a gauge. The latest value is usually the most useful value for alerting.
Threshold Expression C
Type: Threshold Input: B Condition: IS ABOVE 85
Evaluation
Evaluate every: 1m Pending period: 5m
Disk issues usually do not need second-by-second alerting. A 5-minute pending period avoids noisy alerts.
Labels
severity = warning team = training service = linux host = linux-demo metric = disk
Annotations
Summary:
High disk usage detected on linux-demo
Description:
Disk usage is above 85%. Check large files, logs, Docker images, temporary files, and application data growth.
Save
Click:
Save rule and exit
Disk Alert Test
Create a temporary file:
dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5
Check dashboard.
Remove after test:
rm -f /tmp/disk-alert-test.img
For a large disk, this may not trigger the alert. For demo, temporarily reduce threshold near current disk usage.
3.10 Alert Rule 4: High Load Average
Rule Name
Linux – High Load Average
Query A
Datasource:
Graphite
Query:
telegraf.linux-demo.system.load1
Optional display alias:
alias(telegraf.linux-demo.system.load1, ‘Load 1m’)
Relative time range:
From: 5m To: now
Reduce Expression B
Type: Reduce Input: A Function: Mean
Threshold Expression C
For demo, use:
IS ABOVE 2
For production, threshold depends on CPU count.
Simple teaching rule:
If load average is consistently higher than CPU core count, investigate.
For a 2-vCPU system:
Warning: above 2 Critical: above 4
Evaluation
Evaluate every: 1m Pending period: 3m
Labels
severity = warning team = training service = linux host = linux-demo metric = load
Annotations
Summary:
High load average detected on linux-demo
Description:
1-minute load average is above the expected threshold. Check CPU usage, blocked processes, disk I/O wait, and application workload.
Save
Click:
Save rule and exit
Test Load Alert
Run:
stress –cpu 2 –timeout 180
Watch the alert state.
If it does not trigger, temporarily reduce threshold:
IS ABOVE 0.5
After testing, restore:
IS ABOVE 2
3.11 Alert Rule 5: Zombie Processes Detected
Rule Name
Linux – Zombie Processes Detected
Query A
Datasource:
Graphite
Query:
telegraf.linux-demo.processes.zombies
Optional display alias:
alias(telegraf.linux-demo.processes.zombies, ‘Zombie Processes’)
Relative time range:
From: 5m To: now
Reduce Expression B
Type: Reduce Input: A Function: Last
Threshold Expression C
Type: Threshold Input: B Condition: IS ABOVE 0
Meaning:
If zombie process count is greater than 0, alert fires.
Evaluation
Evaluate every: 1m Pending period: 1m
Labels
severity = warning team = training service = linux host = linux-demo metric = process
Annotations
Summary:
Zombie process detected on linux-demo
Description:
One or more zombie processes exist. Check parent processes and application process management.
Save
Click:
Save rule and exit
3.12 Link Alert Rules to Dashboard Panels
Grafana allows alert rules to be linked to dashboard panels so users can see alert state directly on panels.
For each alert rule:
Go to:
Alerting → Alert rules
Open the alert rule.
Find:
Configure notification message
or dashboard/panel link section.
Click:
Link dashboard and panel
Select dashboard:
Linux Monitoring – Graphite Telegraf
Select matching panel:
Alert RuleDashboard Panel Linux – High CPU UsageCPU Usage % Linux – High Memory UsageMemory Used % Linux – High Disk UsageDisk Usage % Linux – High Load AverageSystem Load Average Linux – Zombie Processes DetectedZombie Processes
Save the rule.
3.13 Verify Alert Rule State
Go to:
Alerting → Alert rules
You should see rule states.
Common states:
StateMeaning NormalCondition is false PendingCondition is true but pending period has not completed FiringCondition is true and pending period completed No DataQuery returned no data ErrorQuery or datasource failed PausedRule evaluation is disabled
Expected first state:
Normal
If you lowered thresholds for testing, you may see:
Pending
then:
Firing
3.14 Testing All Alerts Safely
CPU Test
stress –cpu 2 –timeout 180
Expected:
CPU alert may go Pending/Firing if threshold is low enough.
Memory Test
stress –vm 1 –vm-bytes 256M –timeout 120
For small instances:
stress –vm 1 –vm-bytes 128M –timeout 120
Expected:
Memory panel increases. Memory alert may fire if threshold is low enough.
Disk Test
dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5
Remove test file:
rm -f /tmp/disk-alert-test.img
Expected:
Disk usage may increase depending on disk size.
Load Test
stress –cpu 2 –timeout 180
Expected:
Load average increases gradually.
Zombie Test
Zombie process creation is not recommended for beginner labs unless you provide controlled code.
For training, test this alert by temporarily changing threshold:
IS ABOVE -1
Because normal zombie count is usually 0, this makes the alert condition true.
After testing, restore:
IS ABOVE 0
3.15 Recommended Alert Rule Summary Table
Alert RuleQueryReducerThresholdEvaluatePending High CPU Usagetelegraf.linux-demo.cpu.usage_activeMean> 801m2m High Memory Usagetelegraf.linux-demo.mem.used_percentLast> 851m3m High Disk Usagetelegraf.linux-demo.disk.used_percentLast> 851m5m High Load Averagetelegraf.linux-demo.system.load1Mean> 21m3m Zombie Processestelegraf.linux-demo.processes.zombiesLast> 01m1m
3.16 Recommended Labels and Annotations
Use consistent labels for all rules.
Labels
team = training service = linux host = linux-demo datasource = graphite
Per alert:
metric = cpu metric = memory metric = disk metric = load metric = process
Severity:
severity = warning
For production, you can add:
environment = demo
or:
environment = production
Annotation Template
Use this structure:
summary = Short human-readable alert title description = What happened, why it matters, and first troubleshooting steps runbook_url = Link to internal runbook
Example:
summary = High CPU usage detected on linux-demo description = CPU active usage has been above 80% for more than 2 minutes. Check top processes, load average, recent deployments, and application traffic.
3.17 Common Student Mistakes
Mistake 1: Using old assumed CPU metric path
Incorrect:
telegraf.linux-demo.cpu.cpu-total.usage_active
Correct:
telegraf.linux-demo.cpu.usage_active
Reason:
Your current Graphite storage does not include cpu-total in the metric path.
Mistake 2: Using old assumed disk wildcard path
Incorrect:
telegraf.linux-demo.disk.*.used_percent
Incorrect for this lab:
highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)
Correct:
telegraf.linux-demo.disk.used_percent
Reason:
Your current Graphite storage does not include mountpoint as a separate path node.
Mistake 3: Forgetting Reduce expression
Grafana needs a single value for threshold comparison.
Correct flow:
A = Graphite query B = Reduce A C = Threshold B Condition = C
Mistake 4: Wrong datasource
Make sure Query A uses:
Graphite
not Prometheus, Loki, or TestData.
Mistake 5: Wrong metric path
Check in Explore first:
telegraf.linux-demo.cpu.usage_active
If no data, discover with:
telegraf.*
Or run this command on Graphite server:
docker exec -it graphite find /opt/graphite/storage/whisper/telegraf -type f | sed ‘s#/opt/graphite/storage/whisper/##; s#/#.#g; s#.wsp$##’ | sort
Mistake 6: Time range too short
If query range is too short, Grafana may not get enough datapoints.
Use:
From: 5m To: now
Mistake 7: Expecting notification without contact point
Alert rule can fire but no message will be received if contact point/notification policy is not configured.
Check:
Alerting → Contact points Alerting → Notification policies
3.18 Troubleshooting Guide
Problem: Alert rule shows No Data
Check Graphite query in Explore:
telegraf.linux-demo.cpu.usage_active
Check Graphite directly:
curl “http://localhost:8080/render?target=telegraf.linux-demo.cpu.usage_active&from=-10min&format=json“
Check Telegraf logs:
docker logs -f telegraf
Check Graphite files:
docker exec -it graphite find /opt/graphite/storage/whisper/telegraf -type f | head
Problem: Alert stays Normal during test
Check these:
- Is threshold too high?
- Is pending period too long?
- Is query returning expected value?
- Is time range set to 5m?
- Did you save the rule?
For demo, temporarily lower threshold.
Problem: Alert fires but no notification arrives
Check:
- Contact point test works
- Notification policy routes to contact point
- Alert has not been muted
- Alert is not grouped/delayed
- SMTP/webhook/Slack configuration is valid
Problem: Disk alert does not show mount name
In this lab, disk mountpoint is not available in the Graphite metric path because of the current Telegraf Graphite template.
Current available disk alert query:
telegraf.linux-demo.disk.used_percent
If you want mount-level disk metrics later, change the Telegraf Graphite output template to include mountpoint tags and regenerate metrics. But for this lab, use the actual collected metric path above.
3.19 Lab Completion Checklist
Students should complete:
TaskCompleted Created contact point☐ Tested contact point☐ Configured notification policy☐ Created alert folder☐ Created CPU alert☐ Created memory alert☐ Created disk alert☐ Created load alert☐ Created zombie process alert☐ Added labels☐ Added annotations☐ Linked alerts to dashboard panels☐ Tested at least one alert☐ Restored production-like thresholds after testing☐
3.20 Final Student Summary
At the end of this lab, students should understand:
Grafana can create alert rules from Graphite metrics. A good alert rule uses Query → Reduce → Threshold. Evaluation interval controls how often the rule runs. Pending period prevents noisy alerts from short spikes. Contact points define where notifications are sent. Notification policies define how alerts are routed. Labels help organize and route alerts. Annotations make alerts understandable for humans. Alert rules should be tested safely before production use. Metric paths must match the real Graphite storage structure. In this lab, the correct metric pattern is telegraf.linux-demo…
The final alerting setup is:
Graphite metric ↓ Grafana alert query ↓ Reduce expression ↓ Threshold expression ↓ Evaluation group ↓ Alert state ↓ Notification policy ↓ Contact point
This completes the full student workflow:
Step 1: Explore metrics Step 2: Build dashboard Step 3: Create alerts