What is Graphite monitoring tool?

Uncategorized

Graphite is an open-source time-series monitoring system used to collect, store, query, and visualize metrics from servers, applications, databases, and infrastructure.

In simple words:

Graphite stores numerical metrics over time so you can see trends, performance, and problems.

Example metrics Graphite can store:

CPU usage
Memory usage
Disk usage
Network traffic
Application request count
API response time
Error count
Database query time

A sample Graphite metric looks like this:

linux.server1.cpu.usage 72.5 1710000000

Meaning:

PartMeaning
linux.server1.cpu.usageMetric name/path
72.5Metric value
1710000000Timestamp

How Graphite works

flowchart TD
    A[Server / Application] --> B[Metric Collector]
    B --> C[Carbon]
    C --> D[Whisper Storage]
    D --> E[Graphite Web / API]
    E --> F[Grafana Dashboard]
flowchart TD
    A[Server / Application] --> B[Metric Collector]
    B --> C[Carbon]
    C --> D[Whisper Storage]
    D --> E[Graphite Web / API]
    E --> F[Grafana Dashboard]

Main components

ComponentRole
CarbonReceives metrics and writes them to storage
WhisperStores time-series metrics on disk
Graphite WebProvides UI and API to query metrics
GrafanaOften used to create dashboards and alerts from Graphite data

Where Graphite is used

Graphite is commonly used for:

Infrastructure monitoring
Linux server monitoring
Application performance metrics
Capacity planning
Historical trend analysis
Grafana dashboard backend

Graphite vs Grafana

ToolPurpose
GraphiteStores and serves metrics
GrafanaVisualizes metrics and creates dashboards/alerts

So the simplest explanation is:

Graphite is the metrics backend. Grafana is the visualization frontend.

Below are the main components used in Graphite monitoring stack, with a clear data flow diagram and Mermaid architecture diagram.

1. Graphite Components Overview

Graphite is mainly made of these core components:

ComponentPurpose
Application / Server / ScriptGenerates metrics like CPU usage, memory, disk, request count, latency, etc.
StatsD / CollectD / Telegraf / Carbon AgentCollects metrics from servers or applications and forwards them to Graphite.
CarbonReceives metrics, processes them, and writes them into Whisper storage.
Carbon ReceiverListens for incoming metrics over TCP/UDP, usually on port 2003, 2004, or 2023.
Carbon CacheTemporarily stores incoming metrics in memory before writing them to disk.
Carbon RelayOptional component used to forward, split, or replicate metrics to multiple Carbon nodes.
Carbon AggregatorOptional component used to aggregate metrics before storing them.
WhisperGraphite’s time-series database file format. Stores metrics on disk.
Graphite WebWeb application used to query, render, and visualize metrics.
Graphite API / Render APIAllows tools like Grafana to query Graphite metrics.
GrafanaVisualization and dashboarding tool that connects to Graphite as a data source.
AlertsUsually handled in Grafana, not directly by classic Graphite.

2. Simple Graphite Flow

Application / Server
        |
        v
Metric Collector
(StatsD / CollectD / Telegraf / Custom Script)
        |
        v
Carbon
        |
        v
Whisper Storage
        |
        v
Graphite Web / Graphite API
        |
        v
Grafana Dashboard / Alerts

3. Detailed Graphite Architecture Flow

flowchart TD
    A[Application / Linux Server / Service] --> B[Metric Collector]

    B --> B1[StatsD]
    B --> B2[CollectD]
    B --> B3[Telegraf]
    B --> B4[Custom Script]

    B1 --> C[Carbon Receiver]
    B2 --> C
    B3 --> C
    B4 --> C

    C --> D[Carbon Cache]

    D --> E[Whisper Storage]

    E --> F[Graphite Web]

    F --> G[Graphite Render API]

    G --> H[Grafana]

    H --> I[Dashboards]
    H --> J[Alerts]


4. Graphite Components with Ports

ComponentCommon PortDescription
Carbon plaintext receiver2003Receives metrics in plain text format.
Carbon pickle receiver2004Receives metrics in Python pickle format.
Carbon cache query port7002Used internally for cache queries.
Graphite Web80, 8080, or customWeb UI and API access.
Grafana3000Grafana web interface.
StatsD8125/UDPReceives application metrics and forwards to Graphite.

5. Metric Flow Example

Example metric:

linux.server1.cpu.usage 72.5 1710000000

This means:

PartMeaning
linux.server1.cpu.usageMetric path/name
72.5Metric value
1710000000Unix timestamp

Flow:

Linux Server
  sends metric:
  linux.server1.cpu.usage 72.5 1710000000

Carbon receives it on port 2003

Carbon Cache stores it temporarily

Whisper writes it into .wsp file

Graphite Web reads it

Grafana displays it in dashboard

6. Full Graphite + Grafana Monitoring Flow

flowchart LR
    subgraph Sources["Metric Sources"]
        A1[Linux Servers]
        A2[Applications]
        A3[Databases]
        A4[Network Devices]
        A5[Custom Scripts]
    end

    subgraph Collectors["Metric Collectors"]
        B1[StatsD]
        B2[CollectD]
        B3[Telegraf]
        B4[Diamond]
    end

    subgraph Graphite["Graphite Stack"]
        C1[Carbon Receiver]
        C2[Carbon Cache]
        C3[Carbon Relay Optional]
        C4[Carbon Aggregator Optional]
        D1[Whisper Storage]
        E1[Graphite Web]
        E2[Graphite Render API]
    end

    subgraph Visualization["Visualization and Alerting"]
        F1[Grafana]
        F2[Dashboards]
        F3[Alerts]
    end

    A1 --> B2
    A2 --> B1
    A3 --> B3
    A4 --> B3
    A5 --> C1

    B1 --> C1
    B2 --> C1
    B3 --> C1
    B4 --> C1

    C1 --> C2
    C2 --> D1

    C1 --> C3
    C3 --> C2

    C1 --> C4
    C4 --> C2

    D1 --> E1
    E1 --> E2
    E2 --> F1
    F1 --> F2
    F1 --> F3


7. How Each Component Works

Application / Server

This is the original source of metrics.

Examples:

CPU usage
Memory usage
Disk usage
Network traffic
HTTP request count
API latency
Error count
Database query time

The server itself does not usually write directly to Whisper. It sends metrics to Graphite through a collector or script.


Metric Collector

Collectors gather metrics and send them to Carbon.

Common collectors:

CollectorUse Case
StatsDApplication-level metrics such as counters, timers, gauges.
CollectDLinux system metrics such as CPU, memory, disk, network.
TelegrafModern metrics agent with many plugins.
DiamondOlder Python-based Graphite collector.
Custom ScriptsSimple scripts that push metrics directly to Carbon.

Example using shell:

echo "linux.server1.cpu.usage 75 $(date +%s)" | nc 127.0.0.1 2003

Carbon

Carbon is the ingestion engine of Graphite.

It receives metrics and writes them to Whisper.

Carbon has multiple sub-components:

carbon-cache
carbon-relay
carbon-aggregator

Carbon Cache

Carbon Cache receives metrics and keeps them briefly in memory before writing them to disk.

Main responsibilities:

Receive metric data
Buffer data in memory
Apply storage schema
Write data to Whisper files
Serve recent cached data to Graphite Web

Carbon Relay

Carbon Relay is optional.

It is used when Graphite is scaled across multiple servers.

Main uses:

Forward metrics to multiple Carbon caches
Shard metrics across multiple storage nodes
Replicate metrics for high availability
Route metrics based on rules

Example:

server1 metrics -> carbon-cache-1
server2 metrics -> carbon-cache-2
app metrics    -> carbon-cache-3

Carbon Aggregator

Carbon Aggregator is optional.

It aggregates many metrics before storage.

Example:

app.server1.requests.count
app.server2.requests.count
app.server3.requests.count

Can be aggregated into:

app.all.requests.count

Useful for reducing query complexity and storage volume.


Whisper

Whisper is Graphite’s storage engine.

It stores each metric as a .wsp file.

Example metric:

linux.server1.cpu.usage

May be stored as:

/opt/graphite/storage/whisper/linux/server1/cpu/usage.wsp

Whisper is similar to RRD storage. It stores fixed-size time-series data based on retention rules.

Example retention:

10s:6h,1m:7d,10m:5y

Meaning:

RetentionMeaning
10s:6hKeep 10-second data for 6 hours
1m:7dKeep 1-minute data for 7 days
10m:5yKeep 10-minute data for 5 years

Graphite Web

Graphite Web is the web application and API layer.

It allows users and tools to:

Search metrics
Query metrics
Render graphs
Apply functions
Expose Render API

Example Graphite Render API:

/render?target=linux.server1.cpu.usage&from=-1h&format=json

Grafana

Grafana is commonly used as the visualization layer for Graphite.

Grafana connects to:

http://<graphite-server>

or

http://<graphite-server>:8080

Grafana uses Graphite’s API to search metrics and build dashboards.

Typical Grafana panels:

CPU Usage
Memory Usage
Disk Usage
Network In/Out
Load Average
Process Count
Application Request Rate
Error Rate
Latency

8. Graphite Data Flow with Example

sequenceDiagram
    participant App as Application / Linux Server
    participant Agent as StatsD / CollectD / Telegraf
    participant Carbon as Carbon Receiver
    participant Cache as Carbon Cache
    participant Whisper as Whisper Storage
    participant Web as Graphite Web API
    participant Grafana as Grafana

    App->>Agent: Generate metrics
    Agent->>Carbon: Send metric over TCP/UDP
    Carbon->>Cache: Accept and buffer metric
    Cache->>Whisper: Write metric to .wsp file
    Grafana->>Web: Query metric target
    Web->>Whisper: Read historical data
    Web->>Cache: Read recent cached data
    Web-->>Grafana: Return time-series data
    Grafana->>Grafana: Render dashboard / alert

9. End-to-End Example

Suppose a Linux server reports CPU usage.

Step 1: Metric is generated

linux.web01.cpu.usage 65 1710000000

Step 2: Metric is sent to Carbon

echo "linux.web01.cpu.usage 65 $(date +%s)" | nc graphite-server 2003

Step 3: Carbon receives it

Carbon plaintext receiver listens on port 2003

Step 4: Carbon writes to Whisper

/opt/graphite/storage/whisper/linux/web01/cpu/usage.wsp

Step 5: Graphite Web exposes it

http://graphite-server/render?target=linux.web01.cpu.usage&from=-1h

Step 6: Grafana visualizes it

Grafana panel query:

linux.web01.cpu.usage

10. Graphite Stack in One Diagram

flowchart TB
    A[Metric Sources] --> B[Metric Collection Layer]
    B --> C[Carbon Ingestion Layer]
    C --> D[Whisper Storage Layer]
    D --> E[Graphite Query Layer]
    E --> F[Grafana Visualization Layer]

    A1[Linux CPU, Memory, Disk, Network] --> A
    A2[Application Counters, Timers, Gauges] --> A
    A3[Database Metrics] --> A
    A4[Custom Business Metrics] --> A

    B1[StatsD] --> B
    B2[CollectD] --> B
    B3[Telegraf] --> B
    B4[Custom Scripts] --> B

    C1[carbon-cache] --> C
    C2[carbon-relay] --> C
    C3[carbon-aggregator] --> C

    D1[Whisper .wsp Files] --> D

    E1[Graphite Web UI] --> E
    E2[Graphite Render API] --> E
    E3[Graphite Functions] --> E

    F1[Grafana Dashboards] --> F
    F2[Grafana Explore] --> F
    F3[Grafana Alerts] --> F

11. Important Point for Students

Graphite itself is mainly responsible for:

Receiving metrics
Storing metrics
Querying metrics
Rendering metric data

Grafana is mainly responsible for:

Beautiful dashboards
Explore UI
Alert rules
Notification channels
Dashboard sharing

So in modern monitoring labs:

Graphite = Metrics backend
Grafana = Visualization and alerting frontend

12. Final Summary

Metric Source
    ↓
Collector or Agent
    ↓
Carbon Receiver
    ↓
Carbon Cache
    ↓
Whisper Storage
    ↓
Graphite Web / Render API
    ↓
Grafana
    ↓
Dashboard / Alert

In simple terms:

Graphite collects and stores time-series metrics. Grafana reads those metrics from Graphite and converts them into dashboards and alerts.

StatsD collects application and system metrics sent by apps, scripts, services, or servers over UDP/TCP. It does not collect logs or traces by default. It mainly collects numeric time-series data.

Types of data collected using StatsD

Metric typeWhat it meansExample use case
CounterCounts how many times something happenedNumber of API requests, login attempts, errors
GaugeCurrent value at a point in timeMemory usage, queue size, active users
TimerMeasures how long something takesAPI response time, DB query duration
Histogram / DistributionMeasures value distributionRequest latency percentiles like p95, p99
SetCounts unique valuesUnique users, unique IPs, unique sessions

Examples

1. Counter data

Used to count events.

api.requests:1|c
login.success:1|c
login.failed:1|c
payment.errors:1|c

Meaning:

api.requests increased by 1
login.success increased by 1
payment.errors increased by 1

Common use cases:

Total HTTP requests
Total errors
Total signups
Total orders
Total failed payments

2. Gauge data

Used to send the current value.

queue.size:45|g
memory.used:712|g
active.users:128|g
disk.used.percent:67|g

Common use cases:

CPU usage
Memory usage
Disk usage
Queue depth
Active connections
Number of running jobs

3. Timer data

Used to measure duration.

api.response_time:245|ms
db.query_time:38|ms
cache.lookup_time:4|ms

Meaning:

API response took 245 ms
Database query took 38 ms
Cache lookup took 4 ms

Common use cases:

API latency
Database query latency
External API call duration
File upload time
Job execution time

4. Histogram / distribution data

Used to understand spread of values.

Example:

request.size:2048|h
response.size:5120|h
order.amount:499|h

Common use cases:

Request payload size
Response size
Order amount
Latency distribution
File size distribution

Depending on backend support, this can help calculate:

avg
min
max
p50
p90
p95
p99

5. Set data

Used to count unique values.

unique.users:raj@example.com|s
unique.ips:192.168.1.10|s
unique.sessions:abc123|s

Common use cases:

Unique users
Unique sessions
Unique visitors
Unique IP addresses

Real-world examples of StatsD metrics

For a web application:

web.requests:1|c
web.errors:1|c
web.response_time:180|ms
web.active_users:35|g
web.unique_visitors:visitor123|s

For Linux/server monitoring:

system.cpu.usage:72|g
system.memory.used:8045|g
system.disk.used_percent:61|g
system.loadavg.1min:2.4|g

For business monitoring:

orders.created:1|c
orders.failed:1|c
cart.checkout_time:3200|ms
payment.amount:1299|h
active.subscriptions:840|g

In simple words

StatsD collects data like:

How many times something happened?
What is the current value?
How long did something take?
What is the distribution of values?
How many unique things happened?

So, StatsD is mainly used for:

Application performance monitoring
Infrastructure metrics
Business metrics
Custom service metrics
Dashboarding in Graphite/Grafana
Alerting on abnormal behavior