Grafana Lab with Graphite Datasource metrics – Alerts

Uncategorized

Below is Step 3 only: a deep, student-ready lab guide for creating Grafana 13.x alerts using Graphite + Telegraf Linux metrics.

This guide assumes you already completed:

Step 1: Explored Graphite metrics in Grafana Explore
Step 2: Created Linux Monitoring Dashboard
Datasource: Graphite
Metrics prefix: telegraf
Host: linux-demo
Dashboard: Linux Monitoring - Graphite Telegraf

Grafana alert rules are based on three major parts: a query that selects data, a condition/threshold that decides when to fire, and evaluation settings that decide how often and how long the condition must be true. Grafana’s current alerting docs define alert rules this way and explain that evaluation includes an evaluation group, pending period, and keep-firing behavior. (Grafana Labs)


Step 1

Step 2

Step 3: Create Alerts in Grafana 13.x for Graphite Linux Metrics

Lab Objective

By the end of this lab, students will create alerts for:

1. High CPU Usage
2. High Memory Usage
3. High Disk Usage
4. High Load Average
5. Zombie Processes Detected

Students will also learn:

1. What an alert rule is
2. What Reduce expression does
3. What Threshold expression does
4. How to configure evaluation interval
5. How to configure contact points
6. How to route alert notifications
7. How to test alerts safely
8. How to troubleshoot No Data and Error states

3.1 Alerting Architecture

Your current monitoring flow is:

Linux Host
   ↓
Telegraf
   ↓
Graphite Carbon
   ↓
Whisper Storage
   ↓
Graphite Web
   ↓
Grafana Datasource
   ↓
Grafana Alert Rule
   ↓
Contact Point
   ↓
Email / Slack / Webhook / Teams

In simple words:

Telegraf collects metrics.
Graphite stores metrics.
Grafana queries Graphite.
Grafana evaluates alert rules.
Grafana sends notifications when rules fire.

3.2 Important Alerting Concepts

Alert Rule

An alert rule is the main object in Grafana Alerting.

It contains:

1. Query
2. Expression
3. Condition
4. Evaluation interval
5. Pending period
6. Labels
7. Notification settings

Grafana supports alert rules that evaluate queries and expressions over time, and alert rules are the central component of the alerting system. (Grafana Labs)


Query

The query gets data from Graphite.

Example:

telegraf.linux-demo.cpu.cpu-total.usage_active

This returns CPU usage time-series data.


Reduce Expression

Grafana alerting cannot directly alert on a full time-series graph. It normally reduces a time series into a single value.

Common reducer functions:

ReducerMeaning
LastMost recent value
MeanAverage value over selected range
MaxMaximum value over selected range
MinMinimum value over selected range

For most infrastructure alerts, use:

Mean

or:

Last

Recommended for this lab:

MetricReducer
CPUMean
MemoryLast
DiskMax
LoadMean
Zombie ProcessesLast

Threshold Expression

Threshold checks whether the reduced value crosses a limit.

Example:

WHEN B IS ABOVE 80

Meaning:

If the reduced CPU value is above 80, alert should fire.

Evaluation Group

Evaluation group controls how often Grafana checks the alert.

For this lab:

Every 1 minute

Pending Period

Pending period controls how long the condition must remain true before firing.

Example:

For 2 minutes

Meaning:

The condition must be true for 2 continuous minutes before Grafana sends alert.

This prevents alerts from firing due to short temporary spikes.


No Data and Error Handling

Every alert rule should define what happens if Grafana gets no data or query errors.

Recommended beginner setup:

StateRecommended Setting
No DataNo Data
ErrorError

For production, teams often tune this based on monitoring maturity.


3.3 Recommended Alert Rules for This Lab

Use these thresholds for demo and learning.

Alert NameQueryWarning/Critical Threshold
High CPU Usagetelegraf.linux-demo.cpu.cpu-total.usage_activeAbove 80%
High Memory Usagetelegraf.linux-demo.mem.used_percentAbove 85%
High Disk UsagehighestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)Above 85%
High Load Averagetelegraf.linux-demo.system.load1Above 2
Zombie Processestelegraf.linux-demo.processes.zombiesAbove 0

For student demos, you can temporarily lower thresholds to trigger alerts quickly.

Example:

CPU threshold: 20%
Memory threshold: 40%

After testing, restore proper thresholds.


3.4 Create a Contact Point

Before creating alert rules, configure where Grafana should send notifications.

Grafana contact points contain the configuration for sending alert notifications, such as email, Slack, webhook, Microsoft Teams, and other integrations. (Grafana Labs)

Steps

Go to:

Alerting → Contact points

Click:

Add contact point

Name:

student-demo-email

Integration type:

Email

Add email address:

student@example.com

For your own lab, use your real email.

Click:

Test

Then:

Save contact point

If Email Is Not Configured

In many self-hosted Grafana setups, email will not work unless SMTP is configured.

For learning, use a Webhook contact point instead.

Example contact point:

Name: student-demo-webhook
Type: Webhook
URL: https://webhook.site/your-generated-url

You can open webhook.site, copy the unique URL, and paste it into Grafana.

This is very useful for students because they can see alert payloads immediately without configuring SMTP.


3.5 Create Notification Policy

Notification policies decide how alerts are routed to contact points. Grafana documentation explains that notification policies route alert instances to contact points using label matchers and can group multiple alerts together to reduce noise. (Grafana Labs)

For this beginner lab, keep it simple.

Go to:

Alerting → Notification policies

Edit the default policy.

Set default contact point:

student-demo-email

or:

student-demo-webhook

Save.

This means all alerts without custom routing will go to that contact point.


3.6 Create Folder for Alert Rules

Go to:

Dashboards → New folder

Create folder:

Linux Monitoring Alerts

Why?

Grafana-managed alert rules are usually stored under folders. This keeps alert rules organized.


3.7 Alert Rule 1: High CPU Usage

Purpose

This alert fires when CPU usage remains high.

Go to Alert Rules

Navigate to:

Alerting → Alert rules

Click:

New alert rule

Step A: Rule Name

Use:

Linux - High CPU Usage

Step B: Select Data Source Query

In Query A, select datasource:

Graphite

Query:

alias(telegraf.linux-demo.cpu.cpu-total.usage_active, 'CPU Active %')

Set relative time range:

From: 5m
To: now

Step C: Add Reduce Expression

Click:

Add expression

Expression type:

Reduce

Input:

A

Function:

Mean

Name it:

B

Meaning:

Grafana takes CPU data from the last 5 minutes and calculates the average.

Step D: Add Threshold Expression

Click:

Add expression

Expression type:

Threshold

Input:

B

Condition:

IS ABOVE 80

Name it:

C

Set alert condition to:

C

Step E: Evaluation Behavior

Set:

Folder: Linux Monitoring Alerts
Evaluation group: linux-alerts-1m
Evaluate every: 1m
Pending period: 2m

Meaning:

Grafana checks every 1 minute.
If CPU average stays above 80% for 2 minutes, alert fires.

Step F: Configure No Data and Error

Use:

No data state: No Data
Error state: Error

Step G: Labels

Add labels:

severity = warning
team = training
service = linux
host = linux-demo
metric = cpu

Labels help route and organize alerts.


Step H: Annotations

Add summary:

High CPU usage detected on linux-demo

Add description:

CPU active usage has been above 80% for more than 2 minutes. Check running processes, recent deployments, load average, and application workload.

Add runbook URL if you have one:

https://example.com/runbooks/high-cpu

Step I: Save Rule

Click:

Save rule and exit

Test CPU Alert

For demo, temporarily change threshold from:

80

to:

20

Then run:

stress --cpu 2 --timeout 180

Watch:

Alerting → Alert rules

Expected states:

Normal → Pending → Firing

After testing, restore threshold to:

80

3.8 Alert Rule 2: High Memory Usage

Rule Name

Linux - High Memory Usage

Query A

Datasource:

Graphite

Query:

alias(telegraf.linux-demo.mem.used_percent, 'Memory Used %')

Relative time range:

From: 5m
To: now

Reduce Expression B

Type: Reduce
Input: A
Function: Last

Why Last?

Memory usage is already a stable gauge. The most recent value is usually meaningful.

Threshold Expression C

Type: Threshold
Input: B
Condition: IS ABOVE 85

Evaluation

Evaluate every: 1m
Pending period: 3m

Labels

severity = warning
team = training
service = linux
host = linux-demo
metric = memory

Annotations

Summary:

High memory usage detected on linux-demo

Description:

Memory usage is above 85%. Check memory-heavy processes, application memory growth, cache usage, and swap activity.

Save

Click:

Save rule and exit

Optional Memory Test

To test memory safely, avoid exhausting the server.

Install stress if needed:

apt install -y stress

Run a small memory test:

stress --vm 1 --vm-bytes 256M --timeout 120

For small EC2 instances, reduce memory:

stress --vm 1 --vm-bytes 128M --timeout 120

For demo triggering, temporarily reduce threshold to a value slightly below current memory usage.


3.9 Alert Rule 3: High Disk Usage

Important Note

Disk metrics may return multiple filesystems. For alerting, use a query that returns the highest disk usage.

Graphite query:

highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)

This selects the disk series with the highest current disk usage.


Rule Name

Linux - High Disk Usage

Query A

Datasource:

Graphite

Query:

highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)

Relative time range:

From: 5m
To: now

Reduce Expression B

Type: Reduce
Input: A
Function: Max

Why Max?

For disk, the safest logic is:

Alert if any disk/mount point is too full.

Threshold Expression C

Type: Threshold
Input: B
Condition: IS ABOVE 85

Evaluation

Evaluate every: 1m
Pending period: 5m

Disk issues usually do not need second-by-second alerting. A 5-minute pending period avoids noisy alerts.


Labels

severity = warning
team = training
service = linux
host = linux-demo
metric = disk

Annotations

Summary:

High disk usage detected on linux-demo

Description:

One or more filesystems are above 85% usage. Check large files, logs, Docker images, temporary files, and application data growth.

Save

Click:

Save rule and exit

Disk Alert Test

Create a temporary file:

dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5

Check dashboard.

Remove after test:

rm -f /tmp/disk-alert-test.img

For a large disk, this may not trigger the alert. For demo, temporarily reduce threshold near current disk usage.


3.10 Alert Rule 4: High Load Average

Rule Name

Linux - High Load Average

Query A

Datasource:

Graphite

Query:

alias(telegraf.linux-demo.system.load1, 'Load 1m')

Relative time range:

From: 5m
To: now

Reduce Expression B

Type: Reduce
Input: A
Function: Mean

Threshold Expression C

For demo, use:

IS ABOVE 2

For production, threshold depends on CPU count.

Simple teaching rule:

If load average is consistently higher than CPU core count, investigate.

For a 2-vCPU system:

Warning: above 2
Critical: above 4

Evaluation

Evaluate every: 1m
Pending period: 3m

Labels

severity = warning
team = training
service = linux
host = linux-demo
metric = load

Annotations

Summary:

High load average detected on linux-demo

Description:

1-minute load average is above the expected threshold. Check CPU usage, blocked processes, disk I/O wait, and application workload.

Save

Click:

Save rule and exit

Test Load Alert

Run:

stress --cpu 2 --timeout 180

Watch the alert state.

If it does not trigger, temporarily reduce threshold:

IS ABOVE 0.5

After testing, restore:

IS ABOVE 2

3.11 Alert Rule 5: Zombie Processes Detected

Rule Name

Linux - Zombie Processes Detected

Query A

Datasource:

Graphite

Query:

alias(telegraf.linux-demo.processes.zombies, 'Zombie Processes')

Relative time range:

From: 5m
To: now

Reduce Expression B

Type: Reduce
Input: A
Function: Last

Threshold Expression C

Type: Threshold
Input: B
Condition: IS ABOVE 0

Meaning:

If zombie process count is greater than 0, alert fires.

Evaluation

Evaluate every: 1m
Pending period: 1m

Labels

severity = warning
team = training
service = linux
host = linux-demo
metric = process

Annotations

Summary:

Zombie process detected on linux-demo

Description:

One or more zombie processes exist. Check parent processes and application process management.

Save

Click:

Save rule and exit

3.12 Link Alert Rules to Dashboard Panels

Grafana allows alert rules to be linked to dashboard panels so users can see alert state directly on panels. Grafana’s docs explain that you can link an alert rule to an existing dashboard and panel while configuring the notification message section. (Grafana Labs)

For each alert rule:

Go to:

Alerting → Alert rules

Open the alert rule.

Find:

Configure notification message

or dashboard/panel link section.

Click:

Link dashboard and panel

Select dashboard:

Linux Monitoring - Graphite Telegraf

Select matching panel:

Alert RuleDashboard Panel
Linux – High CPU UsageCPU Usage %
Linux – High Memory UsageMemory Used %
Linux – High Disk UsageDisk Usage %
Linux – High Load AverageSystem Load Average
Linux – Zombie Processes DetectedZombie Processes

Save the rule.


3.13 Verify Alert Rule State

Go to:

Alerting → Alert rules

You should see rule states.

Common states:

StateMeaning
NormalCondition is false
PendingCondition is true but pending period has not completed
FiringCondition is true and pending period completed
No DataQuery returned no data
ErrorQuery or datasource failed
PausedRule evaluation is disabled

Expected first state:

Normal

If you lowered thresholds for testing, you may see:

Pending

then:

Firing

3.14 Testing All Alerts Safely

CPU Test

stress --cpu 2 --timeout 180

Expected:

CPU alert may go Pending/Firing if threshold is low enough.

Memory Test

stress --vm 1 --vm-bytes 256M --timeout 120

For small instances:

stress --vm 1 --vm-bytes 128M --timeout 120

Expected:

Memory panel increases.
Memory alert may fire if threshold is low enough.

Disk Test

dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5

Remove test file:

rm -f /tmp/disk-alert-test.img

Expected:

Disk usage may increase depending on disk size.

Load Test

stress --cpu 2 --timeout 180

Expected:

Load average increases gradually.

Zombie Test

Zombie process creation is not recommended for beginner labs unless you provide controlled code.

For training, test this alert by temporarily changing threshold:

IS ABOVE -1

Because normal zombie count is usually 0, this makes the alert condition true.

After testing, restore:

IS ABOVE 0

3.15 Recommended Alert Rule Summary Table

Alert RuleQueryReducerThresholdEvaluatePending
High CPU Usagetelegraf.linux-demo.cpu.cpu-total.usage_activeMean> 801m2m
High Memory Usagetelegraf.linux-demo.mem.used_percentLast> 851m3m
High Disk UsagehighestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)Max> 851m5m
High Load Averagetelegraf.linux-demo.system.load1Mean> 21m3m
Zombie Processestelegraf.linux-demo.processes.zombiesLast> 01m1m

3.16 Recommended Labels and Annotations

Use consistent labels for all rules.

Labels

team = training
service = linux
host = linux-demo
datasource = graphite

Per alert:

metric = cpu
metric = memory
metric = disk
metric = load
metric = process

Severity:

severity = warning

For production, you can add:

environment = demo

or:

environment = production

Annotation Template

Use this structure:

summary = Short human-readable alert title
description = What happened, why it matters, and first troubleshooting steps
runbook_url = Link to internal runbook

Example:

summary = High CPU usage detected on linux-demo
description = CPU active usage has been above 80% for more than 2 minutes. Check top processes, load average, recent deployments, and application traffic.

3.17 Common Student Mistakes

Mistake 1: Alert rule query returns multiple series

Example:

telegraf.linux-demo.disk.*.used_percent

This may return many filesystems.

Fix for beginner lab:

highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)

Mistake 2: Forgetting Reduce expression

Grafana needs a single value for threshold comparison.

Correct flow:

A = Graphite query
B = Reduce A
C = Threshold B
Condition = C

Mistake 3: Wrong datasource

Make sure Query A uses:

Graphite

not Prometheus, Loki, or TestData.


Mistake 4: Wrong metric path

Check in Explore first:

telegraf.linux-demo.cpu.cpu-total.usage_active

If no data, discover with:

telegraf.*

Mistake 5: Time range too short

If query range is too short, Grafana may not get enough datapoints.

Use:

From: 5m
To: now

Mistake 6: Expecting notification without contact point

Alert rule can fire but no message will be received if contact point/notification policy is not configured.

Check:

Alerting → Contact points
Alerting → Notification policies

3.18 Troubleshooting Guide

Problem: Alert rule shows No Data

Check Graphite query in Explore:

telegraf.linux-demo.cpu.cpu-total.usage_active

Check Graphite directly:

curl "http://localhost:8080/render?target=telegraf.linux-demo.cpu.cpu-total.usage_active&from=-10min&format=json"

Check Telegraf logs:

docker logs -f telegraf

Check Graphite files:

docker exec -it graphite find /opt/graphite/storage/whisper/telegraf -type f | head

Problem: Alert stays Normal during test

Check these:

1. Is threshold too high?
2. Is pending period too long?
3. Is query returning expected value?
4. Is time range set to 5m?
5. Did you save the rule?

For demo, temporarily lower threshold.


Problem: Alert fires but no notification arrives

Check:

1. Contact point test works
2. Notification policy routes to contact point
3. Alert has not been muted
4. Alert is not grouped/delayed
5. SMTP/webhook/Slack configuration is valid

Problem: Disk alert not showing mount name

For beginner lab, the alert checks highest disk usage but may not clearly show which mount caused it.

Better dashboard query:

aliasByNode(telegraf.linux-demo.disk.*.used_percent, 3)

Better alert query for single filesystem:

telegraf.linux-demo.disk.root.used_percent

Use the exact filesystem name discovered in Explore.


3.19 Lab Completion Checklist

Students should complete:

TaskCompleted
Created contact point
Tested contact point
Configured notification policy
Created alert folder
Created CPU alert
Created memory alert
Created disk alert
Created load alert
Created zombie process alert
Added labels
Added annotations
Linked alerts to dashboard panels
Tested at least one alert
Restored production-like thresholds after testing

3.20 Final Student Summary

At the end of this lab, students should understand:

Grafana can create alert rules from Graphite metrics.
A good alert rule uses Query → Reduce → Threshold.
Evaluation interval controls how often the rule runs.
Pending period prevents noisy alerts from short spikes.
Contact points define where notifications are sent.
Notification policies define how alerts are routed.
Labels help organize and route alerts.
Annotations make alerts understandable for humans.
Alert rules should be tested safely before production use.

The final alerting setup is:

Graphite metric
   ↓
Grafana alert query
   ↓
Reduce expression
   ↓
Threshold expression
   ↓
Evaluation group
   ↓
Alert state
   ↓
Notification policy
   ↓
Contact point

This completes the full student workflow:

Step 1: Explore metrics
Step 2: Build dashboard
Step 3: Create alerts