Monitoring

Observability for Kubernetes: Metrics, Logs, and Traces

Overview

Monitoring is crucial for operating Kubernetes clusters effectively. This section covers the key monitoring concepts and tools for Kubernetes observability.

Three Pillars of Observability

Pillar	Purpose	Tools
Metrics	Numerical time-series data	Prometheus, Metrics Server
Logs	Event records and debugging	Loki, ELK, Fluentd
Traces	Request paths through systems	Jaeger, Tempo, OpenTelemetry

Monitoring Stack

┌─────────────────────────────────────────────────────────────────┐
│                     Monitoring Stack                            │
│                                                                 │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Applications & Services                       │  │
│  │  • Emit Metrics • Generate Logs • Propagate Traces     │  │
│  └───────────────────────────────────────────────────────────┘  │
│                           ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Collection Agents                             │  │
│  │  • Metrics Exporter • Log Collector • Trace Agent       │  │
│  └───────────────────────────────────────────────────────────┘  │
│                           ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Storage & Backend                            │  │
│  │  • Prometheus (Metrics) • Loki (Logs) • Tempo (Traces)   │  │
│  └───────────────────────────────────────────────────────────┘  │
│                           ↓                                    │
│  ┌───────────────────────────────────────────────────────────┐  │
│  │              Visualization & Alerting                      │  │
│  │  • Grafana (Dashboards) • AlertManager (Alerts)         │  │
│  └───────────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘

Quick Start

Install Metrics Server

bash

# Check if metrics-server is installed
kubectl get pods -n kube-system | grep metrics-server

# Install (if not present)
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Verify
kubectl top nodes
kubectl top pods

Install Prometheus Stack

bash

# Add kube-prometheus-stack repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install
helm install prometheus prometheus-community/kube-prometheus-stack -n monitoring --create-namespace

# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open http://localhost:3000
# Default credentials: admin/prom-operator

Metrics

Key Metrics to Monitor

Type	Metrics
Cluster	Node health, resource usage, pod counts
Pod	CPU, Memory, Restarts, Uptime
Application	Custom business metrics (requests, latency)
Network	Traffic, Errors, Latency

Using Metrics Server

bash

# Check node resource usage
kubectl top nodes

# Check pod resource usage
kubectl top pods -A

# Check pods in namespace
kubectl top pods -n kube-system

Using Prometheus

bash

# Port forward to Prometheus
kubectl port-forward -n monitoring svc/prometheus-k8s 9090:9090

# Query metrics (PromQL)
# Example queries:
# - CPU usage: rate(container_cpu_usage_seconds_total[5m])
# - Memory usage: container_memory_working_set_bytes
# - Pod restarts: rate(kube_pod_container_status_restarts_total[1h])

Logging

Kubernetes Logging Architecture

Application Logs
       ↓
stdout/stderr (Container)
       ↓
 kubelet (Node)
       ↓
Log Collector (Fluentd/Fluent Bit)
       ↓
Central Logging (Loki/ELK)
       ↓
Visualization (Grafana/Kibana)

View Pod Logs

bash

# View logs
kubectl logs <pod-name>

# Follow logs (stream)
kubectl logs -f <pod-name>

# View logs from previous container
kubectl logs <pod-name> --previous

# View logs for all pods in deployment
kubectl logs -l app=nginx --tail=100 -f

Logs from Multiple Pods

bash

# Logs from all replicas
kubectl logs -l app=ml-model --tail=50

# Logs with timestamps
kubectl logs -f <pod-name> --timestamps=true

# Logs since time
kubectl logs --since-time=2025-01-15T10:00:00Z <pod-name>

Tracing

Distributed Tracing Concepts

Concept	Description
Trace	End-to-end journey of a request
Span	Single operation within a trace
Trace ID	Unique identifier for a trace
Span ID	Unique identifier for a span

Installing Jaeger

bash

# Install Jaeger operator
kubectl create namespace observability
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/crds/jaegertracing.io_jaegers_crd.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/service_account.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/role_binding.yaml
kubectl apply -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/main/deploy/operator.yaml

# Create Jaeger instance
kubectl apply -f - <<EOF
apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: jaeger
  namespace: observability
spec:
  strategy: allInOne
  allInOne:
    image: jaegertracing/all-in-one:latest
  ui:
    options:
      logLevel: info
EOF

# Access UI
kubectl port-forward -n observability svc/jaeger-query 16686:16686
# Open http://localhost:16686

Alerts

AlertManager

bash

# Check alerts
kubectl get prometheus -n monitoring

# Port forward
kubectl port-forward -n monitoring svc/prometheus-k8s 9093:9093

Alert Example

yaml

# Example alert rule
groups:
- name: ml-model-alerts
  rules:
  - alert: HighErrorRate
    expr: rate(http_requests_total{status="500"}[5m]) > 0.05
    for: 5m
    labels:
      severity: critical
      team: ml-ops
    annotations:
      summary: "High error rate on ML model API"
      description: "Error rate is {{ $value }} errors/sec"

Dashboards

Key Dashboards to Create

Cluster Overview - Node health, resource usage
Pod Overview - Pod status, restarts, resource usage
Application Metrics - Custom business metrics
ML Model Metrics - Inference latency, throughput, error rates

Sample Grafana Queries

promql

# CPU Usage by Pod
sum(rate(container_cpu_usage_seconds_total{namespace="ml-apps"}[5m])) by (pod)

# Memory Usage by Pod
sum(container_memory_working_set_bytes{namespace="ml-apps"}) by (pod)

# Pod Restart Count
increase(kube_pod_container_status_restarts_total{namespace="ml-apps"}[1h])

# HTTP Request Rate
rate(http_requests_total{namespace="ml-apps"}[5m])

# P95 Latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Best Practices

Enable metrics server for basic resource monitoring
Use labels effectively for grouping metrics
Set up alerts for critical failures
Create dashboards for quick visualization
Centralize logs for analysis
Implement tracing for microservices
Monitor the monitoring stack itself

Monitoring Tools Comparison

Tool	Purpose	Complexity
Metrics Server	Basic resource metrics	Low
Prometheus	Full metrics stack	Medium
Grafana	Visualization	Low
Loki	Log aggregation	Medium
Jaeger	Distributed tracing	High
Elastic Stack	Logs + Metrics	High

Next Steps

Install Metrics Server: Metrics Server Docs
Deploy Prometheus Stack: kube-prometheus-stack
Create Dashboards: Grafana Dashboards

Additional Resources

Return to: Overview | K8s for MLOps

Monitoring ​

Overview ​

Three Pillars of Observability ​

Monitoring Stack ​

Quick Start ​

Install Metrics Server ​

Install Prometheus Stack ​

Metrics ​

Key Metrics to Monitor ​

Using Metrics Server ​

Using Prometheus ​

Logging ​

Kubernetes Logging Architecture ​

View Pod Logs ​

Logs from Multiple Pods ​

Tracing ​

Distributed Tracing Concepts ​

Installing Jaeger ​

Alerts ​

AlertManager ​

Alert Example ​

Dashboards ​

Key Dashboards to Create ​

Sample Grafana Queries ​

Best Practices ​

Monitoring Tools Comparison ​

Next Steps ​

Additional Resources ​

Monitoring

Overview

Three Pillars of Observability

Monitoring Stack

Quick Start

Install Metrics Server

Install Prometheus Stack

Metrics

Key Metrics to Monitor

Using Metrics Server

Using Prometheus

Logging

Kubernetes Logging Architecture

View Pod Logs

Logs from Multiple Pods

Tracing

Distributed Tracing Concepts

Installing Jaeger

Alerts

AlertManager

Alert Example

Dashboards

Key Dashboards to Create

Sample Grafana Queries

Best Practices

Monitoring Tools Comparison

Next Steps

Additional Resources