Skip to main content

Collecting Custom Metrics on AKS with Telegraf

· 13 min read
Diego casati
Microsoft App Innovation Global Blackbelt team

What if you need to collect your own custom metrics from workloads or nodes in AKS, but don't want to run a full monitoring stack? In this post, we will discuss how to integrate custom metrics into Azure's managed monitoring stack with minimal setup using Telegraf DaemonSet, for flexible metric collection, Azure Monitor managed service for Prometheus, for scraping and storage, and Azure Managed Grafana for visualization and alerting.

By default, AKS and Azure Monitor give you a rich set of out-of-the-box insights: CPU and memory utilization, pod restarts, node health, and Kubernetes control plane metrics. But many teams need more visibility into what’s happening inside their workloads — for example:

  • Application-specific metrics such as API request latency or queue depth
  • Custom business metrics like transactions per second or user sessions
  • System-level data such as network interface stats, disk I/O, or custom log counters

Traditionally, enabling this kind of deep observability required deploying and managing a full Prometheus stack — configuring storage, scaling scrapers, and handling upgrades. That adds operational complexity, especially when all you need is a few targeted custom metrics. This is where Azure Monitor managed service for Prometheus comes in — it takes care of high availability, storage, and scaling, so you can focus entirely on defining the metrics that matter most. And by using Telegraf as a lightweight collector, you can easily publish custom metrics from your workloads or nodes directly into your managed monitoring environment, with no self-managed Prometheus servers required.

While our example uses network metrics, the same pattern applies to any custom data source you want to monitor in AKS. If you want to take this example one step further, we have a hands-on experience with the AKS Labs: Advanced Observability Concepts and the Observability with Managed Prometheus and Managed Grafana at the Microsoft Reactor.

A common question we hear from Kubernetes users is:

"How can I scrape a specific set of custom metrics from my cluster without adding too much operational overhead?"

Before we dive into the setup, let’s look at why this approach is so effective for extending AKS monitoring.

Why This Matters

If you run workloads on AKS, you already get a solid baseline of metrics out of the box — node and pod resource usage, cluster health, and control plane telemetry through Azure Monitor. But those built-in signals don’t always tell the full story.

Engineers often need visibility into what’s actually happening inside their workloads or on the host — things like:

  • Network throughput or packet drops on specific interfaces.
  • Application-level metrics like queue depth or request latency.
  • Custom counters from scripts, logs, or local daemons.

You could deploy a full Prometheus stack to get those insights, but that means managing storage, scaling scrapers, maintaining alert rules, and patching over time. For many teams, that’s more operational effort than it’s worth — especially when you just need a handful of custom metrics.

This approach combines Telegraf, Azure Monitor managed service for Prometheus, and Azure Managed Grafana to bridge that gap. Telegraf runs as a DaemonSet, collecting metrics from every node (or from any command or script you define) and exposing them in Prometheus format. Azure Monitor managed service for Prometheus then handles a) scraping, b) scaling, and c) storage — so there’s no local Prometheus to manage — and Grafana provides dashboards and alerting without extra infrastructure.

The result is a lightweight, fully managed way to extend AKS observability with exactly the metrics you care about, using standard open-source tools and Azure’s managed services.

Solution at a Glance

Here’s the workflow we’re setting up:

+-----------------+    +------------------+    +-----------------+ 
| AKS Nodes | | Azure Managed | | Azure Managed |
| | | Prometheus | | Grafana |
| +-------------+ | | | | |
| | Telegraf | |--->| Scrapes via |--->| Dashboards & |
| | DaemonSet | | | PodMonitor | | Alerting |
| |:2112/metrics| | | | | |
| +-------------+ | +------------------+ +-----------------+
+-----------------+
  • Telegraf DaemonSet: runs on every node and collects your custom metrics.
  • Prometheus PodMonitor: automatically scrapes those metrics endpoints.
  • Azure Managed Grafana: visualizes and alerts on metrics without extra servers.

Understanding the solution

For our example, we will create a custom collection of the following metrics for each network interface using ip -s link:

MetricTypeDescription
network_interface_stats_mtugaugeMaximum Transmission Unit
network_interface_stats_rx_bytescounterReceived bytes
network_interface_stats_rx_packetscounterReceived packets
network_interface_stats_rx_errorscounterReceive errors
network_interface_stats_rx_droppedcounterReceived packets dropped
network_interface_stats_rx_missedcounterReceived packets missed (true missed field from ip -s link)
network_interface_stats_rx_multicastcounterReceived multicast packets
network_interface_stats_tx_bytescounterTransmitted bytes
network_interface_stats_tx_packetscounterTransmitted packets
network_interface_stats_tx_errorscounterTransmission errors
network_interface_stats_tx_droppedcounterTransmitted packets dropped
network_interface_stats_tx_carriercounterCarrier errors
network_interface_stats_tx_collisionscounterCollision errors

Each metric includes the following labels:

  • cluster: AKS cluster identifier
  • environment: Environment tag (configurable)
  • host: Node hostname
  • hostname: Node hostname (duplicate for compatibility)
  • interface: Network interface name (eth0, eth1, etc.)
  • state: Interface operational state

Setup your environment variables and placeholders

In these next steps, we will set up a new AKS cluster, an Azure Managed Grafana instance, and an Azure Monitor Workspace.

export RG_NAME="rg-telegraf-on-aks"
export LOCATION="westus3"

# Azure Kubernetes Service Cluster
export AKS_CLUSTER_NAME="telegraf-on-aks"

# Azure Managed Grafana
export GRAFANA_NAME="aks-blog-${RANDOM}"

# Azure Monitor Workspace
export AZ_MONITOR_WORKSPACE_NAME="telegraf-on-aks"

Next, let's create our solution:

# Create resource group
az group create --name ${RG_NAME} --location ${LOCATION}

# Create an Azure Monitor Workspace
az monitor account create \
--resource-group ${RG_NAME} \
--location ${LOCATION} \
--name ${AZ_MONITOR_WORKSPACE_NAME}

# Get the Azure Monitor Workspace ID
AZ_MONITOR_WORKSPACE_ID=$(az monitor account show \
--resource-group ${RG_NAME} \
--name ${AZ_MONITOR_WORKSPACE_NAME} \
--query id -o tsv)

Create a Grafana instance. The Azure CLI extension for Azure Managed Grafana (amg) will be used for this.

# Add the Azure Managed Grafana extension to az cli:
az extension add --name amg

# Create an Azure Managed Grafana instance:
az grafana create \
--name ${GRAFANA_NAME} \
--resource-group $RG_NAME \
--location $LOCATION

# Once created, save the Grafana resource ID
GRAFANA_RESOURCE_ID=$(az grafana show \
--name ${GRAFANA_NAME} \
--resource-group ${RG_NAME} \
--query id -o tsv)

We can now create the cluster, passing both the 'grafana-resource-id' and 'azure-monitor-workspace-resource-id' during cluster creation:

# Create the AKS cluster
az aks create \
--name ${AKS_CLUSTER_NAME} \
--resource-group ${RG_NAME} \
--node-count 1 \
--enable-managed-identity \
--enable-azure-monitor-metrics \
--grafana-resource-id ${GRAFANA_RESOURCE_ID} \
--azure-monitor-workspace-resource-id ${AZ_MONITOR_WORKSPACE_ID}

# Get the cluster credentials
az aks get-credentials \
--name ${AKS_CLUSTER_NAME} \
--resource-group ${RG_NAME}

Verify that the PodMonitor CRD is now available in your cluster

# Check if PodMonitor CRD exists
kubectl get crd | grep podmonitor

# Expected output (Azure Monitor managed service for Prometheus):
# podmonitors.azmonitoring.coreos.com 2025-07-23T19:12:02Z

Deploying the solution

We’ll deploy a single YAML manifest that contains:

  • ConfigMap for Telegraf config + your custom metric script (parse_ip_stats.sh)
  • DaemonSet to run Telegraf on each node
  • ServiceAccount, Service, and PodMonitor
  1. Create the Telegraf Configuration

    First, let's create the main Telegraf configuration:

    cat <<EOF > 01-configmap.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: telegraf-config
    namespace: default
    data:
    telegraf.conf: |
    [global_tags]
    environment = "aks"
    cluster = "aks-telegraf"

    [agent]
    interval = "30s"
    round_interval = true
    metric_batch_size = 1000
    metric_buffer_limit = 10000
    collection_jitter = "5s"
    flush_interval = "30s"
    flush_jitter = "5s"
    precision = ""
    hostname = "\$HOSTNAME"
    omit_hostname = false

    # Custom script to parse ip -s link output
    [[inputs.exec]]
    commands = ["/usr/local/bin/parse_ip_stats.sh"]
    timeout = "10s"
    data_format = "influx"
    name_override = "network_interface_stats"

    # Prometheus metrics output
    [[outputs.prometheus_client]]
    listen = ":2112"
    metric_version = 2
    path = "/metrics"
    expiration_interval = "60s"
    collectors_exclude = ["gocollector", "process"]
    EOF
  2. Create the Network Parsing Script

    Now create the ConfigMap containing our custom script that parses network interface statistics:

    cat <<EOF > 02-scripts-configmap.yaml
    apiVersion: v1
    kind: ConfigMap
    metadata:
    name: telegraf-scripts
    namespace: default
    data:
    parse_ip_stats.sh: |
    #!/bin/bash
    # Script to parse ip -s link output and convert to InfluxDB line protocol

    # Get the current timestamp in nanoseconds
    timestamp=\$(date +%s%N)
    hostname=\$(hostname)

    # Parse ip -s link output for network statistics
    ip -s link | awk -v ts="\$timestamp" -v host="\$hostname" '
    BEGIN {
    interface = "";
    state = "";
    mtu = 0;
    }

    # Parse interface line (e.g., "2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...")
    /^[0-9]+:/ {
    # Extract interface name (handle both regular and @ notation)
    if (match(\$0, /^[0-9]+: ([^:@]+)/)) {
    interface_match = substr(\$0, RSTART, RLENGTH);
    # Remove the number and colon prefix, then trim spaces
    gsub(/^[0-9]+: */, "", interface_match);
    interface = interface_match;
    }

    # Extract state from flags
    if (match(\$0, /<[^>]+>/)) {
    flags = substr(\$0, RSTART+1, RLENGTH-2);
    if (index(flags, "UP")) {
    state = "up";
    } else {
    state = "down";
    }
    }

    # Extract MTU
    if (match(\$0, /mtu [0-9]+/)) {
    mtu_str = substr(\$0, RSTART+4, RLENGTH-4);
    mtu = mtu_str + 0;
    }
    }

    # Parse RX line header (RX: bytes packets errors dropped missed mcast)
    /^[[:space:]]*RX:.*bytes.*packets.*errors.*dropped.*missed.*mcast/ {
    getline; # Get the next line with the actual numbers
    gsub(/^[[:space:]]+/, ""); # Remove leading spaces
    n = split(\$0, rx_fields);
    if (n >= 6) {
    rx_bytes = rx_fields[1];
    rx_packets = rx_fields[2];
    rx_errors = rx_fields[3];
    rx_dropped = rx_fields[4];
    rx_missed = rx_fields[5];
    rx_multicast = rx_fields[6];
    }
    }

    # Parse TX line header (TX: bytes packets errors dropped carrier collsns)
    /^[[:space:]]*TX:.*bytes.*packets.*errors.*dropped.*carrier.*collsns/ {
    getline; # Get the next line with the actual numbers
    gsub(/^[[:space:]]+/, ""); # Remove leading spaces
    n = split(\$0, tx_fields);
    if (n >= 6 && interface != "" && interface != "lo") {
    tx_bytes = tx_fields[1];
    tx_packets = tx_fields[2];
    tx_errors = tx_fields[3];
    tx_dropped = tx_fields[4];
    tx_carrier = tx_fields[5];
    tx_collisions = tx_fields[6];

    # Output metrics after processing both RX and TX (skip loopback)
    printf "network_interface_stats,interface=%s,hostname=%s,state=\"%s\" ", interface, host, state;
    printf "mtu=%si,", mtu;
    printf "rx_bytes=%si,rx_packets=%si,rx_errors=%si,rx_dropped=%si,rx_missed=%si,rx_multicast=%si,", rx_bytes, rx_packets, rx_errors, rx_dropped, rx_missed, rx_multicast;
    printf "tx_bytes=%si,tx_packets=%si,tx_errors=%si,tx_dropped=%si,tx_carrier=%si,tx_collisions=%si ", tx_bytes, tx_packets, tx_errors, tx_dropped, tx_carrier, tx_collisions;
    printf "%s\n", ts;
    }
    }
    '
    EOF
  3. Create the Service Account

    Create a service account for RBAC permissions:

    cat <<EOF > 03-serviceaccount.yaml
    apiVersion: v1
    kind: ServiceAccount
    metadata:
    name: telegraf-sa
    namespace: default
    EOF
  4. Create the DaemonSet

    Now create the main DaemonSet that runs Telegraf on each node:

    cat <<EOF > 04-daemonset.yaml
    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
    name: telegraf
    namespace: default
    labels:
    app: telegraf
    spec:
    selector:
    matchLabels:
    app: telegraf
    template:
    metadata:
    labels:
    app: telegraf
    spec:
    serviceAccountName: telegraf-sa
    hostNetwork: true
    hostPID: true
    tolerations:
    - key: node-role.kubernetes.io/master
    operator: Exists
    effect: NoSchedule
    - key: node-role.kubernetes.io/control-plane
    operator: Exists
    effect: NoSchedule
    containers:
    - name: telegraf
    image: telegraf:1.28
    env:
    - name: HOSTNAME
    valueFrom:
    fieldRef:
    fieldPath: spec.nodeName
    ports:
    - name: prometheus
    containerPort: 2112
    protocol: TCP
    securityContext:
    privileged: true
    runAsUser: 0
    volumeMounts:
    - name: telegraf-config
    mountPath: /etc/telegraf
    - name: telegraf-scripts
    mountPath: /scripts
    - name: proc
    mountPath: /host/proc
    readOnly: true
    - name: sys
    mountPath: /host/sys
    readOnly: true
    - name: var-run-docker
    mountPath: /var/run/docker.sock
    readOnly: true
    resources:
    requests:
    memory: "64Mi"
    cpu: "100m"
    limits:
    memory: "128Mi"
    cpu: "200m"
    command:
    - /bin/bash
    - -c
    - |
    # Install iproute2 if not present
    if ! command -v ip > /dev/null 2>&1; then
    apt-get update && apt-get install -y iproute2
    fi

    # Copy the parsing script to the expected location
    cp /scripts/parse_ip_stats.sh /usr/local/bin/parse_ip_stats.sh
    chmod +x /usr/local/bin/parse_ip_stats.sh

    # Start telegraf
    exec telegraf --config /etc/telegraf/telegraf.conf
    volumes:
    - name: telegraf-config
    configMap:
    name: telegraf-config
    defaultMode: 0755
    - name: telegraf-scripts
    configMap:
    name: telegraf-scripts
    defaultMode: 0755
    - name: proc
    hostPath:
    path: /proc
    - name: sys
    hostPath:
    path: /sys
    - name: var-run-docker
    hostPath:
    path: /var/run/docker.sock
    terminationGracePeriodSeconds: 30
    EOF
  5. Create the Service

    Create a service to expose the Prometheus metrics endpoint:

    cat <<EOF > 05-service.yaml
    apiVersion: v1
    kind: Service
    metadata:
    name: telegraf-metrics
    namespace: default
    labels:
    app: telegraf
    spec:
    selector:
    app: telegraf
    ports:
    - name: prometheus
    port: 2112
    targetPort: 2112
    protocol: TCP
    type: ClusterIP
    EOF
  6. Create the PodMonitor

    Finally, create the PodMonitor that tells Azure Monitor managed service for Prometheus to scrape our metrics:

    cat <<EOF > 06-podmonitor.yaml
    apiVersion: azmonitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
    name: telegraf-podmonitor
    namespace: default
    labels:
    app: telegraf
    spec:
    selector:
    matchLabels:
    app: telegraf
    podMetricsEndpoints:
    - port: prometheus
    interval: 30s
    path: /metrics
    EOF
  7. Deploy All Components

    Now deploy all the components in order:

    kubectl apply -f 01-configmap.yaml
    kubectl apply -f 02-scripts-configmap.yaml
    kubectl apply -f 03-serviceaccount.yaml
    kubectl apply -f 04-daemonset.yaml
    kubectl apply -f 05-service.yaml
    kubectl apply -f 06-podmonitor.yaml
  8. Verification

    After a minute or two, verify everything is running:

    kubectl get daemonset telegraf
    kubectl get pods -l app=telegraf
    kubectl get service telegraf-metrics
    kubectl get podmonitor telegraf-podmonitor
  9. Validate the Metrics

    You can check if the new metrics are now being collected correctly, by forwarding the telegraf-metrics service port locally and then by running curl against it:

    kubectl port-forward svc/telegraf-metrics 2112:2112 &
    curl http://localhost:2112/metrics | head -20

    Sample output:

    # HELP network_interface_stats_rx_bytes Telegraf collected metric
    # TYPE network_interface_stats_rx_bytes untyped
    network_interface_stats_rx_bytes{interface="eth0",host="aks-node-1"} 16876971289

    Great! At this point we know that our collection is working. Next we will look into how to visualize these new metrics in Grafana.

Visualize in Grafana

You can now go to your new Azure Managed Grafana instance and try some queries. To get the URL for your Azure Managed Grafana, run the following command:

GRAFANA_UI=$(az grafana show \
--name ${GRAFANA_NAME} \
--resource-group ${RG_NAME} \
--query "properties.endpoint" -o tsv)

echo "Your Azure Managed Grafana is accessible at: $GRAFANA_UI"

Now that you know the URL, open Azure Managed Grafana and go to the Drilldown tab.

Explore

Make sure the Data Source is Managed_Prometheus_telegraf-on-aks.

Datasource

Try to search for network_interface_ metrics. You should see all of the new metrics that are being collected by Telegraf

Drilldown

Next you can create table panels with Instant queries for top-N views or time series panels for trends over time. Here are some suggestions on metrics:

# Network throughput by node
sum(rate(network_interface_stats_rx_bytes[5m])) by (host)

# Top interfaces by traffic
topk(10, network_interface_stats_tx_bytes)

# Packet drops
sum(rate(network_interface_stats_rx_dropped[5m])) by (interface)

Cleaning up

To remove these resources, you can run this command:

az group delete --name ${RG_NAME} --yes --no-wait

Conclusion

In this post, we saw an approach to integrating custom metrics into Azure’s managed monitoring stack with minimal setup using Telegraf DaemonSet, for flexible metric collection, Azure Monitor managed service for Prometheus, for scraping and storage, and Azure Managed Grafana for visualization and alerting.

While our example used network metrics, the same pattern applies to any custom data source you want to monitor in AKS. If you want to take this example one step further, we have a hands-on experience with the AKS Labs: Advanced Observability Concepts and the Observability with Managed Prometheus and Managed Grafana at the Microsoft Reactor.