Grafana Lab with Graphite Datasource metrics – Alerts

Uncategorized

Below is Step 3 only: a deep, student-ready lab guide for creating Grafana 13.x alerts using Graphite + Telegraf Linux metrics.

This guide assumes you already completed:

Step 1: Explored Graphite metrics in Grafana Explore Step 2: Created Linux Monitoring Dashboard Datasource: Graphite Metrics prefix: telegraf Host: linux-demo Dashboard: Linux Monitoring – Graphite Telegraf

Very Important Metric Validation

Based on the actual metrics collected in your Graphite storage, this lab uses the following metric structure:

telegraf.linux-demo..

Examples:

telegraf.linux-demo.cpu.usage_active telegraf.linux-demo.mem.used_percent telegraf.linux-demo.disk.used_percent telegraf.linux-demo.system.load1 telegraf.linux-demo.processes.zombies

Do not use the older assumed paths below in this lab:

telegraf.linux-demo.cpu.cpu-total.usage_active telegraf.linux-demo.disk.*.used_percent telegraf.linux-demo.net.eth0.bytes_recv telegraf.linux-demo.net.ens5.bytes_recv

Reason:

Your current Telegraf Graphite output is storing metrics without CPU tag, disk mountpoint tag, and network interface tag in the Graphite path.

Grafana alert rules are based on three major parts: a query that selects data, a condition/threshold that decides when to fire, and evaluation settings that decide how often and how long the condition must be true. Grafana’s alerting model uses query, reduce expression, threshold expression, evaluation interval, and pending period.

Step 1

Step 2

Step 3: Create Alerts in Grafana 13.x for Graphite Linux Metrics

Lab Objective

By the end of this lab, students will create alerts for:

  1. High CPU Usage
  2. High Memory Usage
  3. High Disk Usage
  4. High Load Average
  5. Zombie Processes Detected

Students will also learn:

  1. What an alert rule is
  2. What Reduce expression does
  3. What Threshold expression does
  4. How to configure evaluation interval
  5. How to configure contact points
  6. How to route alert notifications
  7. How to test alerts safely
  8. How to troubleshoot No Data and Error states

3.1 Alerting Architecture

Your current monitoring flow is:

Linux Host ↓ Telegraf ↓ Graphite Carbon ↓ Whisper Storage ↓ Graphite Web ↓ Grafana Datasource ↓ Grafana Alert Rule ↓ Contact Point ↓ Email / Slack / Webhook / Teams

In simple words:

Telegraf collects metrics. Graphite stores metrics. Grafana queries Graphite. Grafana evaluates alert rules. Grafana sends notifications when rules fire.

3.2 Important Alerting Concepts

Alert Rule

An alert rule is the main object in Grafana Alerting.

It contains:

  1. Query
  2. Expression
  3. Condition
  4. Evaluation interval
  5. Pending period
  6. Labels
  7. Notification settings

Grafana supports alert rules that evaluate queries and expressions over time, and alert rules are the central component of the alerting system.

Query

The query gets data from Graphite.

Correct example for this lab:

telegraf.linux-demo.cpu.usage_active

This returns CPU active usage time-series data.

Incorrect example for this lab:

telegraf.linux-demo.cpu.cpu-total.usage_active

This does not work with your current collected metrics because cpu-total is not present in your Graphite metric path.

Reduce Expression

Grafana alerting cannot directly alert on a full time-series graph. It normally reduces a time series into a single value.

Common reducer functions:

ReducerMeaning LastMost recent value MeanAverage value over selected range MaxMaximum value over selected range MinMinimum value over selected range

For most infrastructure alerts, use:

Mean

or:

Last

Recommended for this lab:

MetricReducer CPUMean MemoryLast DiskLast LoadMean Zombie ProcessesLast

Threshold Expression

Threshold checks whether the reduced value crosses a limit.

Example:

WHEN B IS ABOVE 80

Meaning:

If the reduced CPU value is above 80, alert should fire.

Evaluation Group

Evaluation group controls how often Grafana checks the alert.

For this lab:

Every 1 minute

Pending Period

Pending period controls how long the condition must remain true before firing.

Example:

For 2 minutes

Meaning:

The condition must be true for 2 continuous minutes before Grafana sends alert.

This prevents alerts from firing due to short temporary spikes.

No Data and Error Handling

Every alert rule should define what happens if Grafana gets no data or query errors.

Recommended beginner setup:

StateRecommended Setting No DataNo Data ErrorError

For production, teams often tune this based on monitoring maturity.

3.3 Recommended Alert Rules for This Lab

Use these thresholds for demo and learning.

Alert NameQueryWarning/Critical Threshold High CPU Usagetelegraf.linux-demo.cpu.usage_activeAbove 80% High Memory Usagetelegraf.linux-demo.mem.used_percentAbove 85% High Disk Usagetelegraf.linux-demo.disk.used_percentAbove 85% High Load Averagetelegraf.linux-demo.system.load1Above 2 Zombie Processestelegraf.linux-demo.processes.zombiesAbove 0

For student demos, you can temporarily lower thresholds to trigger alerts quickly.

Example:

CPU threshold: 20% Memory threshold: 40% Disk threshold: slightly below current disk usage

After testing, restore proper thresholds.

3.4 Create a Contact Point

Before creating alert rules, configure where Grafana should send notifications.

Grafana contact points contain the configuration for sending alert notifications, such as email, Slack, webhook, Microsoft Teams, and other integrations.

Steps

Go to:

Alerting → Contact points

Click:

Add contact point

Name:

student-demo-email

Integration type:

Email

Add email address:

student@example.com

For your own lab, use your real email.

Click:

Test

Then:

Save contact point

If Email Is Not Configured

In many self-hosted Grafana setups, email will not work unless SMTP is configured.

For learning, use a Webhook contact point instead.

Example contact point:

Name: student-demo-webhook Type: Webhook URL: https://webhook.site/your-generated-url

You can open webhook.site, copy the unique URL, and paste it into Grafana.

This is very useful for students because they can see alert payloads immediately without configuring SMTP.

3.5 Create Notification Policy

Notification policies decide how alerts are routed to contact points. Notification policies route alert instances to contact points using label matchers and can group multiple alerts together to reduce noise.

For this beginner lab, keep it simple.

Go to:

Alerting → Notification policies

Edit the default policy.

Set default contact point:

student-demo-email

or:

student-demo-webhook

Save.

This means all alerts without custom routing will go to that contact point.

3.6 Create Folder for Alert Rules

Go to:

Dashboards → New folder

Create folder:

Linux Monitoring Alerts

Why?

Grafana-managed alert rules are usually stored under folders. This keeps alert rules organized.

3.7 Alert Rule 1: High CPU Usage

Purpose

This alert fires when CPU usage remains high.

Go to Alert Rules

Navigate to:

Alerting → Alert rules

Click:

New alert rule

Step A: Rule Name

Use:

Linux – High CPU Usage

Step B: Select Data Source Query

In Query A, select datasource:

Graphite

Query:

telegraf.linux-demo.cpu.usage_active

Optional display alias if using the query in Explore or dashboard:

alias(telegraf.linux-demo.cpu.usage_active, ‘CPU Active %’)

For alerting, the plain metric path is usually cleaner:

telegraf.linux-demo.cpu.usage_active

Set relative time range:

From: 5m To: now

Step C: Add Reduce Expression

Click:

Add expression

Expression type:

Reduce

Input:

A

Function:

Mean

Name it:

B

Meaning:

Grafana takes CPU usage data from the last 5 minutes and calculates the average.

Step D: Add Threshold Expression

Click:

Add expression

Expression type:

Threshold

Input:

B

Condition:

IS ABOVE 80

Name it:

C

Set alert condition to:

C

Step E: Evaluation Behavior

Set:

Folder: Linux Monitoring Alerts Evaluation group: linux-alerts-1m Evaluate every: 1m Pending period: 2m

Meaning:

Grafana checks every 1 minute. If CPU average stays above 80% for 2 minutes, alert fires.

Step F: Configure No Data and Error

Use:

No data state: No Data Error state: Error

Step G: Labels

Add labels:

severity = warning team = training service = linux host = linux-demo metric = cpu

Labels help route and organize alerts.

Step H: Annotations

Add summary:

High CPU usage detected on linux-demo

Add description:

CPU active usage has been above 80% for more than 2 minutes. Check running processes, recent deployments, load average, and application workload.

Add runbook URL if you have one:

https://example.com/runbooks/high-cpu

Step I: Save Rule

Click:

Save rule and exit

Test CPU Alert

For demo, temporarily change threshold from:

80

to:

20

Then run:

stress –cpu 2 –timeout 180

Watch:

Alerting → Alert rules

Expected states:

Normal → Pending → Firing

After testing, restore threshold to:

80

3.8 Alert Rule 2: High Memory Usage

Rule Name

Linux – High Memory Usage

Query A

Datasource:

Graphite

Query:

telegraf.linux-demo.mem.used_percent

Optional display alias:

alias(telegraf.linux-demo.mem.used_percent, ‘Memory Used %’)

Relative time range:

From: 5m To: now

Reduce Expression B

Type: Reduce Input: A Function: Last

Why Last?

Memory usage is already a stable gauge. The most recent value is usually meaningful.

Threshold Expression C

Type: Threshold Input: B Condition: IS ABOVE 85

Evaluation

Evaluate every: 1m Pending period: 3m

Labels

severity = warning team = training service = linux host = linux-demo metric = memory

Annotations

Summary:

High memory usage detected on linux-demo

Description:

Memory usage is above 85%. Check memory-heavy processes, application memory growth, cache usage, and swap activity.

Save

Click:

Save rule and exit

Optional Memory Test

To test memory safely, avoid exhausting the server.

Install stress if needed:

apt install -y stress

Run a small memory test:

stress –vm 1 –vm-bytes 256M –timeout 120

For small EC2 instances, reduce memory:

stress –vm 1 –vm-bytes 128M –timeout 120

For demo triggering, temporarily reduce threshold to a value slightly below current memory usage.

3.9 Alert Rule 3: High Disk Usage

Important Note

In this specific lab, the current Graphite metric path does not include disk mountpoint as a separate node.

Correct collected metric:

telegraf.linux-demo.disk.used_percent

Incorrect query for this lab:

highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)

Why the older query is incorrect:

Your Graphite storage does not contain paths like:

telegraf.linux-demo.disk.root.used_percent telegraf.linux-demo.disk.var.used_percent telegraf.linux-demo.disk.tmp.used_percent

It contains:

telegraf.linux-demo.disk.used_percent

So alerting should use the exact collected metric:

telegraf.linux-demo.disk.used_percent

Rule Name

Linux – High Disk Usage

Query A

Datasource:

Graphite

Query:

telegraf.linux-demo.disk.used_percent

Optional display alias:

alias(telegraf.linux-demo.disk.used_percent, ‘Disk Used %’)

Relative time range:

From: 5m To: now

Reduce Expression B

Type: Reduce Input: A Function: Last

Why Last?

Disk usage percentage is a gauge. The latest value is usually the most useful value for alerting.

Threshold Expression C

Type: Threshold Input: B Condition: IS ABOVE 85

Evaluation

Evaluate every: 1m Pending period: 5m

Disk issues usually do not need second-by-second alerting. A 5-minute pending period avoids noisy alerts.

Labels

severity = warning team = training service = linux host = linux-demo metric = disk

Annotations

Summary:

High disk usage detected on linux-demo

Description:

Disk usage is above 85%. Check large files, logs, Docker images, temporary files, and application data growth.

Save

Click:

Save rule and exit

Disk Alert Test

Create a temporary file:

dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5

Check dashboard.

Remove after test:

rm -f /tmp/disk-alert-test.img

For a large disk, this may not trigger the alert. For demo, temporarily reduce threshold near current disk usage.

3.10 Alert Rule 4: High Load Average

Rule Name

Linux – High Load Average

Query A

Datasource:

Graphite

Query:

telegraf.linux-demo.system.load1

Optional display alias:

alias(telegraf.linux-demo.system.load1, ‘Load 1m’)

Relative time range:

From: 5m To: now

Reduce Expression B

Type: Reduce Input: A Function: Mean

Threshold Expression C

For demo, use:

IS ABOVE 2

For production, threshold depends on CPU count.

Simple teaching rule:

If load average is consistently higher than CPU core count, investigate.

For a 2-vCPU system:

Warning: above 2 Critical: above 4

Evaluation

Evaluate every: 1m Pending period: 3m

Labels

severity = warning team = training service = linux host = linux-demo metric = load

Annotations

Summary:

High load average detected on linux-demo

Description:

1-minute load average is above the expected threshold. Check CPU usage, blocked processes, disk I/O wait, and application workload.

Save

Click:

Save rule and exit

Test Load Alert

Run:

stress –cpu 2 –timeout 180

Watch the alert state.

If it does not trigger, temporarily reduce threshold:

IS ABOVE 0.5

After testing, restore:

IS ABOVE 2

3.11 Alert Rule 5: Zombie Processes Detected

Rule Name

Linux – Zombie Processes Detected

Query A

Datasource:

Graphite

Query:

telegraf.linux-demo.processes.zombies

Optional display alias:

alias(telegraf.linux-demo.processes.zombies, ‘Zombie Processes’)

Relative time range:

From: 5m To: now

Reduce Expression B

Type: Reduce Input: A Function: Last

Threshold Expression C

Type: Threshold Input: B Condition: IS ABOVE 0

Meaning:

If zombie process count is greater than 0, alert fires.

Evaluation

Evaluate every: 1m Pending period: 1m

Labels

severity = warning team = training service = linux host = linux-demo metric = process

Annotations

Summary:

Zombie process detected on linux-demo

Description:

One or more zombie processes exist. Check parent processes and application process management.

Save

Click:

Save rule and exit

3.12 Link Alert Rules to Dashboard Panels

Grafana allows alert rules to be linked to dashboard panels so users can see alert state directly on panels.

For each alert rule:

Go to:

Alerting → Alert rules

Open the alert rule.

Find:

Configure notification message

or dashboard/panel link section.

Click:

Link dashboard and panel

Select dashboard:

Linux Monitoring – Graphite Telegraf

Select matching panel:

Alert RuleDashboard Panel Linux – High CPU UsageCPU Usage % Linux – High Memory UsageMemory Used % Linux – High Disk UsageDisk Usage % Linux – High Load AverageSystem Load Average Linux – Zombie Processes DetectedZombie Processes

Save the rule.

3.13 Verify Alert Rule State

Go to:

Alerting → Alert rules

You should see rule states.

Common states:

StateMeaning NormalCondition is false PendingCondition is true but pending period has not completed FiringCondition is true and pending period completed No DataQuery returned no data ErrorQuery or datasource failed PausedRule evaluation is disabled

Expected first state:

Normal

If you lowered thresholds for testing, you may see:

Pending

then:

Firing

3.14 Testing All Alerts Safely

CPU Test

stress –cpu 2 –timeout 180

Expected:

CPU alert may go Pending/Firing if threshold is low enough.

Memory Test

stress –vm 1 –vm-bytes 256M –timeout 120

For small instances:

stress –vm 1 –vm-bytes 128M –timeout 120

Expected:

Memory panel increases. Memory alert may fire if threshold is low enough.

Disk Test

dd if=/dev/zero of=/tmp/disk-alert-test.img bs=100M count=5

Remove test file:

rm -f /tmp/disk-alert-test.img

Expected:

Disk usage may increase depending on disk size.

Load Test

stress –cpu 2 –timeout 180

Expected:

Load average increases gradually.

Zombie Test

Zombie process creation is not recommended for beginner labs unless you provide controlled code.

For training, test this alert by temporarily changing threshold:

IS ABOVE -1

Because normal zombie count is usually 0, this makes the alert condition true.

After testing, restore:

IS ABOVE 0

3.15 Recommended Alert Rule Summary Table

Alert RuleQueryReducerThresholdEvaluatePending High CPU Usagetelegraf.linux-demo.cpu.usage_activeMean> 801m2m High Memory Usagetelegraf.linux-demo.mem.used_percentLast> 851m3m High Disk Usagetelegraf.linux-demo.disk.used_percentLast> 851m5m High Load Averagetelegraf.linux-demo.system.load1Mean> 21m3m Zombie Processestelegraf.linux-demo.processes.zombiesLast> 01m1m

3.16 Recommended Labels and Annotations

Use consistent labels for all rules.

Labels

team = training service = linux host = linux-demo datasource = graphite

Per alert:

metric = cpu metric = memory metric = disk metric = load metric = process

Severity:

severity = warning

For production, you can add:

environment = demo

or:

environment = production

Annotation Template

Use this structure:

summary = Short human-readable alert title description = What happened, why it matters, and first troubleshooting steps runbook_url = Link to internal runbook

Example:

summary = High CPU usage detected on linux-demo description = CPU active usage has been above 80% for more than 2 minutes. Check top processes, load average, recent deployments, and application traffic.

3.17 Common Student Mistakes

Mistake 1: Using old assumed CPU metric path

Incorrect:

telegraf.linux-demo.cpu.cpu-total.usage_active

Correct:

telegraf.linux-demo.cpu.usage_active

Reason:

Your current Graphite storage does not include cpu-total in the metric path.

Mistake 2: Using old assumed disk wildcard path

Incorrect:

telegraf.linux-demo.disk.*.used_percent

Incorrect for this lab:

highestCurrent(telegraf.linux-demo.disk.*.used_percent, 1)

Correct:

telegraf.linux-demo.disk.used_percent

Reason:

Your current Graphite storage does not include mountpoint as a separate path node.

Mistake 3: Forgetting Reduce expression

Grafana needs a single value for threshold comparison.

Correct flow:

A = Graphite query B = Reduce A C = Threshold B Condition = C

Mistake 4: Wrong datasource

Make sure Query A uses:

Graphite

not Prometheus, Loki, or TestData.

Mistake 5: Wrong metric path

Check in Explore first:

telegraf.linux-demo.cpu.usage_active

If no data, discover with:

telegraf.*

Or run this command on Graphite server:

docker exec -it graphite find /opt/graphite/storage/whisper/telegraf -type f | sed ‘s#/opt/graphite/storage/whisper/##; s#/#.#g; s#.wsp$##’ | sort

Mistake 6: Time range too short

If query range is too short, Grafana may not get enough datapoints.

Use:

From: 5m To: now

Mistake 7: Expecting notification without contact point

Alert rule can fire but no message will be received if contact point/notification policy is not configured.

Check:

Alerting → Contact points Alerting → Notification policies

3.18 Troubleshooting Guide

Problem: Alert rule shows No Data

Check Graphite query in Explore:

telegraf.linux-demo.cpu.usage_active

Check Graphite directly:

curl “http://localhost:8080/render?target=telegraf.linux-demo.cpu.usage_active&from=-10min&format=json

Check Telegraf logs:

docker logs -f telegraf

Check Graphite files:

docker exec -it graphite find /opt/graphite/storage/whisper/telegraf -type f | head

Problem: Alert stays Normal during test

Check these:

  1. Is threshold too high?
  2. Is pending period too long?
  3. Is query returning expected value?
  4. Is time range set to 5m?
  5. Did you save the rule?

For demo, temporarily lower threshold.

Problem: Alert fires but no notification arrives

Check:

  1. Contact point test works
  2. Notification policy routes to contact point
  3. Alert has not been muted
  4. Alert is not grouped/delayed
  5. SMTP/webhook/Slack configuration is valid

Problem: Disk alert does not show mount name

In this lab, disk mountpoint is not available in the Graphite metric path because of the current Telegraf Graphite template.

Current available disk alert query:

telegraf.linux-demo.disk.used_percent

If you want mount-level disk metrics later, change the Telegraf Graphite output template to include mountpoint tags and regenerate metrics. But for this lab, use the actual collected metric path above.

3.19 Lab Completion Checklist

Students should complete:

TaskCompleted Created contact point☐ Tested contact point☐ Configured notification policy☐ Created alert folder☐ Created CPU alert☐ Created memory alert☐ Created disk alert☐ Created load alert☐ Created zombie process alert☐ Added labels☐ Added annotations☐ Linked alerts to dashboard panels☐ Tested at least one alert☐ Restored production-like thresholds after testing☐

3.20 Final Student Summary

At the end of this lab, students should understand:

Grafana can create alert rules from Graphite metrics. A good alert rule uses Query → Reduce → Threshold. Evaluation interval controls how often the rule runs. Pending period prevents noisy alerts from short spikes. Contact points define where notifications are sent. Notification policies define how alerts are routed. Labels help organize and route alerts. Annotations make alerts understandable for humans. Alert rules should be tested safely before production use. Metric paths must match the real Graphite storage structure. In this lab, the correct metric pattern is telegraf.linux-demo…

The final alerting setup is:

Graphite metric ↓ Grafana alert query ↓ Reduce expression ↓ Threshold expression ↓ Evaluation group ↓ Alert state ↓ Notification policy ↓ Contact point

This completes the full student workflow:

Step 1: Explore metrics Step 2: Build dashboard Step 3: Create alerts