Skip to content

Observability with Grafana & Loki

Warmwind runs hundreds of ephemeral AI agent Pods, streams VNC frames over WebSockets, and processes job queues through BullMQ -- all of which can fail silently without proper observability. This article covers the three pillars (metrics with Prometheus, logs with Loki, traces with OpenTelemetry/Tempo), wiring each into a NestJS backend with real code. It then builds Grafana dashboards for the metrics that matter (active agents, frame rate, API error rate, connection pool health, queue depth, memory per pod), configures alerting rules with PagerDuty/Slack integration, defines SLOs and error budgets for agent session reliability, and includes a full Grafana provisioning configuration.


Glossary

Prometheus -- A pull-based time-series database. It scrapes HTTP /metrics endpoints at configured intervals and stores samples as {metric_name}{labels} value timestamp. Loki -- Grafana's log aggregation system. Unlike Elasticsearch, Loki indexes only labels (not full text), making it cheaper to operate. Logs are queried with LogQL. Tempo -- Grafana's distributed tracing backend. Stores trace spans and correlates them with logs and metrics via trace IDs. OpenTelemetry (OTel) -- A vendor-neutral observability framework providing APIs and SDKs for metrics, logs, and traces. The standard instrumentation layer for modern backends. LogQL -- Loki's query language. Combines label matchers ({app="api"}) with pipeline stages (| json | status >= 500). Counter -- A Prometheus metric type that only goes up (e.g., total HTTP requests). Use rate() to compute per-second values. Histogram -- A Prometheus metric that samples observations into configurable buckets (e.g., request duration). Enables percentile computation via histogram_quantile(). Gauge -- A Prometheus metric that can go up or down (e.g., active connections, queue depth, memory usage). Span -- A single unit of work in a distributed trace. Spans have a start time, duration, parent span ID, and key-value attributes. Trace context -- Metadata (trace ID, span ID, trace flags) propagated across service boundaries via HTTP headers (traceparent) or message metadata. SLO -- Service Level Objective; a target for a measurable reliability metric (e.g., "99.5% of agent sessions complete without error"). Error budget -- The acceptable amount of unreliability within an SLO period. If your SLO is 99.5%, your error budget is 0.5% of total sessions. Structured logging -- Emitting log lines as JSON objects with typed fields (level, message, traceId, sessionId) instead of unstructured text. Enables machine parsing and label extraction.


1. The Three Pillars

graph LR
    APP["NestJS Application"] -->|"/metrics endpoint<br/>prom-client"| PROM["Prometheus<br/>(Metrics)"]
    APP -->|"JSON stdout<br/>pino logger"| LOKI["Loki<br/>(Logs)"]
    APP -->|"OTLP gRPC<br/>@opentelemetry/sdk"| TEMPO["Tempo<br/>(Traces)"]
    PROM --> GRAFANA["Grafana<br/>(Dashboards + Alerts)"]
    LOKI --> GRAFANA
    TEMPO --> GRAFANA
    GRAFANA -->|"alerting rules"| PD["PagerDuty"]
    GRAFANA -->|"alerting rules"| SLACK["Slack"]
Pillar Tool What It Answers Warmwind Example
Metrics Prometheus "How much?" / "How fast?" Active agents: 142, API P99: 45ms, BullMQ waiting: 7
Logs Loki "What happened?" agent-xyz failed: OOM killed at 4096MB RSS
Traces Tempo "Where did time go?" HTTP request -> GraphQL resolver -> TypeORM query -> BullMQ job -> Agent Pod creation (2.3s total, 1.8s in K8s API)

Metrics tell you THAT something is wrong; logs tell you WHAT; traces tell you WHERE

If you have used journalctl to debug systemd services and perf / strace to profile processes, think of Prometheus as the system-wide sar equivalent, Loki as centralized journalctl with a query language, and Tempo as distributed strace that follows a request across process boundaries.


2. Prometheus Metrics in NestJS

2.1 Setup with prom-client

The prom-client library is the standard Prometheus client for Node.js. It exposes a /metrics endpoint that Prometheus scrapes.

// metrics.module.ts
import { Module, Global } from '@nestjs/common';
import { MetricsController } from './metrics.controller';
import { MetricsService } from './metrics.service';

@Global()
@Module({
  controllers: [MetricsController],
  providers: [MetricsService],
  exports: [MetricsService],
})
export class MetricsModule {}
// metrics.controller.ts
import { Controller, Get, Header } from '@nestjs/common';
import { register } from 'prom-client';

@Controller('metrics')
export class MetricsController {
  @Get()
  @Header('Content-Type', register.contentType)
  async getMetrics(): Promise<string> {
    return register.metrics();
  }
}
// metrics.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import {
  Counter,
  Histogram,
  Gauge,
  collectDefaultMetrics,
  register,
} from 'prom-client';

@Injectable()
export class MetricsService implements OnModuleInit {
  // ── HTTP Metrics ───────────────────────────────────────────
  readonly httpRequestsTotal: Counter;
  readonly httpRequestDuration: Histogram;

  // ── Agent Session Metrics ──────────────────────────────────
  readonly activeAgentSessions: Gauge;
  readonly agentSessionsCreatedTotal: Counter;
  readonly agentSessionDuration: Histogram;

  // ── VNC Streaming Metrics ──────────────────────────────────
  readonly vncFrameRate: Gauge;
  readonly vncDroppedFramesTotal: Counter;
  readonly vncClientBufferBytes: Gauge;

  // ── BullMQ Metrics ─────────────────────────────────────────
  readonly bullmqWaitingJobs: Gauge;
  readonly bullmqActiveJobs: Gauge;
  readonly bullmqCompletedTotal: Counter;
  readonly bullmqFailedTotal: Counter;
  readonly bullmqJobDuration: Histogram;

  // ── Database Metrics ───────────────────────────────────────
  readonly pgPoolActiveConnections: Gauge;
  readonly pgPoolIdleConnections: Gauge;
  readonly pgQueryDuration: Histogram;

  // ── WebSocket Metrics ──────────────────────────────────────
  readonly wsActiveConnections: Gauge;
  readonly wsMessagesTotal: Counter;

  constructor() {
    this.httpRequestsTotal = new Counter({
      name: 'http_requests_total',
      help: 'Total number of HTTP requests',
      labelNames: ['method', 'route', 'status_code'],
    });

    this.httpRequestDuration = new Histogram({
      name: 'http_request_duration_seconds',
      help: 'HTTP request duration in seconds',
      labelNames: ['method', 'route', 'status_code'],
      buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
    });

    this.activeAgentSessions = new Gauge({
      name: 'agent_sessions_active',
      help: 'Number of currently active agent sessions',
      labelNames: ['tenant'],
    });

    this.agentSessionsCreatedTotal = new Counter({
      name: 'agent_sessions_created_total',
      help: 'Total agent sessions created',
      labelNames: ['tenant', 'resource_tier'],
    });

    this.agentSessionDuration = new Histogram({
      name: 'agent_session_duration_seconds',
      help: 'Duration of agent sessions in seconds',
      labelNames: ['tenant', 'outcome'],
      buckets: [30, 60, 120, 300, 600, 1800, 3600],
    });

    this.vncFrameRate = new Gauge({
      name: 'vnc_frame_rate_fps',
      help: 'Current VNC frame rate per session',
      labelNames: ['session_id'],
    });

    this.vncDroppedFramesTotal = new Counter({
      name: 'vnc_dropped_frames_total',
      help: 'VNC frames dropped due to client backpressure',
      labelNames: ['session_id'],
    });

    this.vncClientBufferBytes = new Gauge({
      name: 'vnc_client_buffer_bytes',
      help: 'WebSocket send buffer size per client',
      labelNames: ['session_id'],
    });

    this.bullmqWaitingJobs = new Gauge({
      name: 'bullmq_waiting_jobs',
      help: 'Number of jobs waiting in each queue',
      labelNames: ['queue'],
    });

    this.bullmqActiveJobs = new Gauge({
      name: 'bullmq_active_jobs',
      help: 'Number of jobs currently being processed',
      labelNames: ['queue'],
    });

    this.bullmqCompletedTotal = new Counter({
      name: 'bullmq_completed_total',
      help: 'Total completed BullMQ jobs',
      labelNames: ['queue'],
    });

    this.bullmqFailedTotal = new Counter({
      name: 'bullmq_failed_total',
      help: 'Total failed BullMQ jobs',
      labelNames: ['queue'],
    });

    this.bullmqJobDuration = new Histogram({
      name: 'bullmq_job_duration_seconds',
      help: 'BullMQ job processing duration',
      labelNames: ['queue', 'job_name'],
      buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 120],
    });

    this.pgPoolActiveConnections = new Gauge({
      name: 'pg_pool_active_connections',
      help: 'Active PostgreSQL connections in the pool',
    });

    this.pgPoolIdleConnections = new Gauge({
      name: 'pg_pool_idle_connections',
      help: 'Idle PostgreSQL connections in the pool',
    });

    this.pgQueryDuration = new Histogram({
      name: 'pg_query_duration_seconds',
      help: 'PostgreSQL query duration in seconds',
      labelNames: ['query_type'],
      buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
    });

    this.wsActiveConnections = new Gauge({
      name: 'ws_active_connections',
      help: 'Active WebSocket connections',
      labelNames: ['namespace'],
    });

    this.wsMessagesTotal = new Counter({
      name: 'ws_messages_total',
      help: 'Total WebSocket messages sent and received',
      labelNames: ['namespace', 'direction'],
    });
  }

  onModuleInit(): void {
    // Collect default Node.js metrics (event loop lag, heap, GC, etc.)
    collectDefaultMetrics({ register, prefix: 'warmwind_' });
  }
}

2.2 HTTP Interceptor for Request Metrics

// metrics.interceptor.ts
import {
  Injectable,
  NestInterceptor,
  ExecutionContext,
  CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { MetricsService } from './metrics.service';

@Injectable()
export class MetricsInterceptor implements NestInterceptor {
  constructor(private readonly metrics: MetricsService) {}

  intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
    const req = context.switchToHttp().getRequest();
    const method = req.method;
    const route = req.route?.path || req.url;
    const end = this.metrics.httpRequestDuration.startTimer({ method, route });

    return next.handle().pipe(
      tap({
        next: () => {
          const statusCode = context.switchToHttp().getResponse().statusCode;
          end({ status_code: statusCode });
          this.metrics.httpRequestsTotal.inc({
            method,
            route,
            status_code: statusCode,
          });
        },
        error: (err) => {
          const statusCode = err.status || 500;
          end({ status_code: statusCode });
          this.metrics.httpRequestsTotal.inc({
            method,
            route,
            status_code: statusCode,
          });
        },
      }),
    );
  }
}

2.3 Prometheus Scrape Configuration

# prometheus.yml (relevant scrape job)
scrape_configs:
  - job_name: 'warmwind-api'
    scrape_interval: 15s
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
        action: replace
        target_label: __address__
        regex: (.+);(.+)
        replacement: $2:$1
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
graph LR
    POD1["NestJS Pod 1<br/>/metrics"] --> PROM["Prometheus"]
    POD2["NestJS Pod 2<br/>/metrics"] --> PROM
    POD3["NestJS Pod 3<br/>/metrics"] --> PROM
    WORKER["BullMQ Worker<br/>/metrics"] --> PROM
    PROM -->|"PromQL queries"| GRAFANA["Grafana"]
    PROM -->|"custom.metrics.k8s.io"| ADAPTER["Prometheus Adapter<br/>(for HPA)"]

3. Loki for Log Aggregation

3.1 Structured Logging with Pino

Loki works best with structured JSON logs. Pino is the fastest Node.js JSON logger and integrates cleanly with NestJS.

// logger.config.ts
import { LoggerModule } from 'nestjs-pino';
import { randomUUID } from 'crypto';

export const PinoLoggerModule = LoggerModule.forRoot({
  pinoHttp: {
    level: process.env.LOG_LEVEL || 'info',
    transport:
      process.env.NODE_ENV === 'development'
        ? { target: 'pino-pretty', options: { colorize: true } }
        : undefined, // JSON in production

    // Generate a request ID for trace correlation
    genReqId: (req) => req.headers['x-request-id'] || randomUUID(),

    // Redact sensitive fields
    redact: {
      paths: ['req.headers.authorization', 'req.headers.cookie'],
      censor: '[REDACTED]',
    },

    // Custom serializers for Warmwind-specific context
    serializers: {
      req: (req) => ({
        method: req.method,
        url: req.url,
        query: req.query,
        tenantId: req.raw?.tenantId,
      }),
      res: (res) => ({
        statusCode: res.statusCode,
      }),
    },
  },
});

Sample JSON log output:

{
  "level": 30,
  "time": 1710700000000,
  "pid": 1,
  "hostname": "warmwind-api-7b9f4c6d8-x2k4p",
  "reqId": "550e8400-e29b-41d4-a716-446655440000",
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "tenantId": "tenant-acme",
  "sessionId": "agent-abc123",
  "context": "AgentOrchestrator",
  "msg": "Agent pod created successfully",
  "podName": "agent-abc123",
  "resourceTier": "medium",
  "creationTimeMs": 1823
}

3.2 Loki Label Strategy

Labels are Loki's primary index. Keep cardinality low (fewer than 100 unique values per label). High-cardinality data goes into structured metadata or the log line itself.

Label Cardinality Indexed? Example Values
app ~5 Yes (label) api, worker, agent
namespace ~20 Yes (label) warmwind, tenant-acme
level 5 Yes (label) debug, info, warn, error, fatal
pod ~200 Structured metadata warmwind-api-7b9f4c6d8-x2k4p
traceId unbounded Structured metadata 4bf92f3577b34da6a3ce929d0e0e4736
sessionId ~1000 Structured metadata agent-abc123
tenantId ~50 Structured metadata tenant-acme

Labels are like indexes in PostgreSQL

If you think of Loki as a database, labels are the indexed columns and log lines are the unindexed payload. Just as you would not create a PostgreSQL index on a UUID column with millions of unique values, you should not use high-cardinality values as Loki labels. Use structured metadata for fields like traceId and sessionId -- they are queryable but not indexed, which keeps Loki's memory and storage costs manageable.

3.3 LogQL Queries

LogQL is Loki's query language. It starts with a log stream selector (labels) and chains pipeline stages for filtering, parsing, and aggregation.

# All error logs from the API in the last hour
{app="api", level="error"}

# Errors mentioning OOM, parsed as JSON, filtered by tenant
{app="agent", level="error"}
  | json
  | msg =~ "OOM|out of memory"
  | tenantId = "tenant-acme"

# Count errors per minute, grouped by context (module)
sum by (context) (
  rate({app="api", level="error"} | json [5m])
)

# P99 agent session creation time from structured logs
quantile_over_time(0.99,
  {app="api"} | json | context = "AgentOrchestrator" | msg = "Agent pod created successfully"
  | unwrap creationTimeMs [10m]
) / 1000   # convert ms to seconds

# Find slow database queries (> 1 second)
{app="api"}
  | json
  | context = "TypeOrmLogger"
  | unwrap durationMs
  | durationMs > 1000

# Correlated logs: find all logs for a specific trace
{app=~"api|worker|agent"} | json | traceId = "4bf92f3577b34da6a3ce929d0e0e4736"
graph LR
    QUERY["{app='api', level='error'}"] --> SELECTOR["Stream Selector<br/>(label matchers)"]
    SELECTOR --> FILTER["| json<br/>| msg =~ 'OOM'"]
    FILTER --> AGG["rate(...[5m])<br/>sum by (context)"]
    AGG --> RESULT["Time Series<br/>(for Grafana panel)"]

3.4 Promtail / Grafana Agent Configuration

Logs flow from container stdout to Loki via Promtail (or Grafana Alloy, the successor):

# promtail-config.yaml (deployed as DaemonSet)
server:
  http_listen_port: 3101

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki-gateway.monitoring:3100/loki/api/v1/push

scrape_configs:
  - job_name: kubernetes-pods
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with the annotation
      - source_labels: [__meta_kubernetes_pod_annotation_loki_io_scrape]
        action: keep
        regex: true
      # Set the app label from the pod label
      - source_labels: [__meta_kubernetes_pod_label_app]
        target_label: app
      # Set the namespace label
      - source_labels: [__meta_kubernetes_namespace]
        target_label: namespace
    pipeline_stages:
      # Parse JSON and extract level for indexing
      - json:
          expressions:
            level: level
            traceId: traceId
            sessionId: sessionId
      - labels:
          level:
      # High-cardinality fields go to structured metadata
      - structured_metadata:
          traceId:
          sessionId:
      # Drop debug logs in production to save storage
      - match:
          selector: '{level="debug"}'
          action: drop
          drop_counter_reason: debug_logs_dropped

4. Grafana Dashboards

4.1 What to Monitor for Warmwind

Panel Metric / Query Alert Threshold
Active Agent Sessions agent_sessions_active > 500 (capacity warning)
VNC Frame Rate avg(vnc_frame_rate_fps) < 5 fps (user experience degradation)
API Error Rate rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) > 1%
API P99 Latency histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2s
PostgreSQL Connection Pool pg_pool_active_connections / (pg_pool_active_connections + pg_pool_idle_connections) > 80% utilization
BullMQ Queue Depth bullmq_waiting_jobs{queue="agent-sessions"} > 50 (queue backing up)
Memory per Pod container_memory_working_set_bytes{namespace="warmwind"} > 80% of limit
WebSocket Connections ws_active_connections < 10 (may indicate connectivity issue)
Dropped VNC Frames rate(vnc_dropped_frames_total[5m]) > 10/sec per session
BullMQ Job Failure Rate rate(bullmq_failed_total[5m]) / rate(bullmq_completed_total[5m]) > 5%

4.2 Dashboard Provisioning Configuration

Grafana dashboards can be provisioned as code via Kubernetes ConfigMaps. The following is a complete provisioning setup:

# grafana-dashboard-provider.yaml (ConfigMap mounted into Grafana)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-dashboard-provider
  namespace: monitoring
data:
  warmwind.yaml: |
    apiVersion: 1
    providers:
      - name: warmwind
        orgId: 1
        folder: Warmwind
        type: file
        disableDeletion: false
        editable: true
        updateIntervalSeconds: 30
        options:
          path: /var/lib/grafana/dashboards/warmwind
          foldersFromFilesStructure: false

4.3 Dashboard JSON Snippet

A complete Grafana dashboard for Warmwind's core metrics:

{
  "uid": "warmwind-overview",
  "title": "Warmwind -- Service Overview",
  "tags": ["warmwind", "production"],
  "timezone": "utc",
  "refresh": "30s",
  "time": { "from": "now-1h", "to": "now" },
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": { "type": "prometheus" },
        "query": "label_values(http_requests_total, namespace)",
        "current": { "text": "warmwind", "value": "warmwind" },
        "refresh": 2
      },
      {
        "name": "tenant",
        "type": "query",
        "datasource": { "type": "prometheus" },
        "query": "label_values(agent_sessions_active, tenant)",
        "includeAll": true,
        "refresh": 2
      }
    ]
  },
  "panels": [
    {
      "title": "Active Agent Sessions",
      "type": "stat",
      "gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
      "targets": [
        {
          "expr": "sum(agent_sessions_active{namespace=\"$namespace\", tenant=~\"$tenant\"})",
          "legendFormat": "Active Sessions"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 300 },
              { "color": "red", "value": 500 }
            ]
          }
        }
      }
    },
    {
      "title": "API Error Rate (%)",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
      "targets": [
        {
          "expr": "100 * sum(rate(http_requests_total{namespace=\"$namespace\", status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[5m]))",
          "legendFormat": "5xx Error Rate %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": { "fillOpacity": 15, "lineWidth": 2 },
          "unit": "percent",
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 1 }
            ]
          }
        }
      }
    },
    {
      "title": "API Latency (P50 / P95 / P99)",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
      "targets": [
        {
          "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "P50"
        },
        {
          "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "P95"
        },
        {
          "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
          "legendFormat": "P99"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "s",
          "custom": { "lineWidth": 2 }
        }
      }
    },
    {
      "title": "BullMQ Queue Depth",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
      "targets": [
        {
          "expr": "bullmq_waiting_jobs{namespace=\"$namespace\", queue=\"agent-sessions\"}",
          "legendFormat": "Waiting ({{pod}})"
        },
        {
          "expr": "bullmq_active_jobs{namespace=\"$namespace\", queue=\"agent-sessions\"}",
          "legendFormat": "Active ({{pod}})"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "custom": { "fillOpacity": 20 }
        }
      }
    },
    {
      "title": "PostgreSQL Connection Pool",
      "type": "gauge",
      "gridPos": { "h": 8, "w": 6, "x": 12, "y": 12 },
      "targets": [
        {
          "expr": "100 * sum(pg_pool_active_connections{namespace=\"$namespace\"}) / (sum(pg_pool_active_connections{namespace=\"$namespace\"}) + sum(pg_pool_idle_connections{namespace=\"$namespace\"}))",
          "legendFormat": "Pool Utilization %"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "percent",
          "min": 0,
          "max": 100,
          "thresholds": {
            "steps": [
              { "color": "green", "value": null },
              { "color": "yellow", "value": 60 },
              { "color": "red", "value": 80 }
            ]
          }
        }
      }
    },
    {
      "title": "Memory per Pod",
      "type": "timeseries",
      "gridPos": { "h": 8, "w": 6, "x": 18, "y": 12 },
      "targets": [
        {
          "expr": "container_memory_working_set_bytes{namespace=\"$namespace\", container!=\"\", container!=\"POD\"}",
          "legendFormat": "{{pod}} / {{container}}"
        }
      ],
      "fieldConfig": {
        "defaults": {
          "unit": "bytes",
          "custom": { "fillOpacity": 10 }
        }
      }
    }
  ]
}

5. Alerting

5.1 Grafana Alerting Rules

Grafana Unified Alerting evaluates PromQL and LogQL expressions on a schedule and fires alerts to contact points (PagerDuty, Slack, email).

# alerting-rules.yaml (provisioned via ConfigMap)
apiVersion: 1
groups:
  - orgId: 1
    name: warmwind-critical
    folder: Warmwind Alerts
    interval: 1m
    rules:
      - uid: high-error-rate
        title: "API Error Rate > 1%"
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: >
                100 * sum(rate(http_requests_total{namespace="warmwind", status_code=~"5.."}[5m]))
                / sum(rate(http_requests_total{namespace="warmwind"}[5m]))
              instant: true
          - refId: B
            datasourceUid: "__expr__"
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: "__expr__"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [1]  # > 1%
        for: 5m
        labels:
          severity: critical
          team: backend
        annotations:
          summary: "API 5xx error rate is {{ $values.B }}%"
          runbook_url: "https://wiki.warmwind.dev/runbooks/api-error-rate"

      - uid: queue-buildup
        title: "BullMQ Queue Depth > 50"
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: "max(bullmq_waiting_jobs{queue=\"agent-sessions\"})"
              instant: true
          - refId: B
            datasourceUid: "__expr__"
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: "__expr__"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [50]
        for: 3m
        labels:
          severity: warning
          team: backend
        annotations:
          summary: "Agent session queue has {{ $values.B }} waiting jobs"

      - uid: agent-pod-oom
        title: "Agent Pod OOM Kill Detected"
        condition: C
        data:
          - refId: A
            datasourceUid: prometheus
            model:
              expr: >
                increase(kube_pod_container_status_restarts_total{
                  namespace=~"tenant-.*",
                  reason="OOMKilled"
                }[5m]) > 0
              instant: true
          - refId: B
            datasourceUid: "__expr__"
            model:
              type: reduce
              expression: A
              reducer: last
          - refId: C
            datasourceUid: "__expr__"
            model:
              type: threshold
              expression: B
              conditions:
                - evaluator:
                    type: gt
                    params: [0]
        for: 0s
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "Agent pod OOM killed in namespace {{ $labels.namespace }}"

5.2 Contact Points

# contact-points.yaml
apiVersion: 1
contactPoints:
  - orgId: 1
    name: backend-oncall
    receivers:
      - uid: pagerduty-backend
        type: pagerduty
        settings:
          integrationKey: "${PD_INTEGRATION_KEY}"
          severity: "{{ .CommonLabels.severity }}"
          summary: "{{ .CommonAnnotations.summary }}"
      - uid: slack-backend
        type: slack
        settings:
          url: "${SLACK_WEBHOOK_URL}"
          recipient: "#backend-alerts"
          title: "{{ .CommonLabels.alertname }}"
          text: "{{ .CommonAnnotations.summary }}"
graph LR
    PROM["Prometheus"] -->|"PromQL"| EVAL["Grafana Alert<br/>Evaluator"]
    LOKI["Loki"] -->|"LogQL"| EVAL
    EVAL -->|"firing"| ROUTE["Notification<br/>Policy"]
    ROUTE -->|"severity=critical"| PD["PagerDuty<br/>(pages on-call)"]
    ROUTE -->|"severity=warning"| SLACK["Slack<br/>#backend-alerts"]
    ROUTE -->|"all"| EMAIL["Email Digest<br/>(hourly)"]

6. Distributed Tracing with OpenTelemetry

6.1 SDK Setup

OpenTelemetry must be initialized before NestJS imports any modules, so it can monkey-patch HTTP, Express, pg, ioredis, and BullMQ libraries.

// tracing.ts -- imported FIRST in main.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'warmwind-api',
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || '0.0.0',
    'deployment.environment': process.env.NODE_ENV || 'development',
    'k8s.namespace.name': process.env.K8S_NAMESPACE || 'warmwind',
    'k8s.pod.name': process.env.K8S_POD_NAME || 'local',
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://tempo:4317',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://tempo:4317',
    }),
    exportIntervalMillis: 15000,
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Fine-tune which instrumentations are active
      '@opentelemetry/instrumentation-http': { enabled: true },
      '@opentelemetry/instrumentation-express': { enabled: true },
      '@opentelemetry/instrumentation-pg': { enabled: true },
      '@opentelemetry/instrumentation-ioredis': { enabled: true },
      '@opentelemetry/instrumentation-graphql': { enabled: true },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', async () => {
  await sdk.shutdown();
});
// main.ts
import './tracing'; // MUST be first import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';

async function bootstrap() {
  const app = await NestFactory.create(AppModule);
  await app.listen(3000);
}
bootstrap();

6.2 Manual Span Creation

Auto-instrumentation covers HTTP, database, and Redis operations. For custom business logic (agent session creation, VNC relay), add manual spans:

import { trace, SpanStatusCode, context } from '@opentelemetry/api';

const tracer = trace.getTracer('warmwind-api', '1.0.0');

@Injectable()
export class AgentOrchestrator {
  async createAgentSession(
    sessionId: string,
    tenantId: string,
    tier: 'small' | 'medium' | 'large',
  ): Promise<void> {
    return tracer.startActiveSpan(
      'agent.session.create',
      {
        attributes: {
          'agent.session.id': sessionId,
          'agent.tenant.id': tenantId,
          'agent.resource.tier': tier,
        },
      },
      async (span) => {
        try {
          // Child span: create PVC
          await tracer.startActiveSpan('k8s.pvc.create', async (pvcSpan) => {
            await this.createPvc(sessionId);
            pvcSpan.end();
          });

          // Child span: create Pod
          await tracer.startActiveSpan('k8s.pod.create', async (podSpan) => {
            const pod = await this.createPod(sessionId, tenantId, tier);
            podSpan.setAttribute('k8s.pod.name', pod.metadata!.name!);
            podSpan.end();
          });

          // Child span: wait for Pod readiness
          await tracer.startActiveSpan('k8s.pod.wait_ready', async (waitSpan) => {
            await this.waitForPodReady(sessionId, 60_000);
            waitSpan.end();
          });

          span.setStatus({ code: SpanStatusCode.OK });
        } catch (err) {
          span.setStatus({
            code: SpanStatusCode.ERROR,
            message: err.message,
          });
          span.recordException(err);
          throw err;
        } finally {
          span.end();
        }
      },
    );
  }
}

6.3 Trace Propagation Across BullMQ

BullMQ jobs run in a separate worker process. To maintain trace context across the queue boundary, inject the trace context into the job data and extract it in the processor:

import { context, propagation, trace } from '@opentelemetry/api';

// Producer: inject trace context into job data
@Injectable()
export class AgentJobProducer {
  constructor(@InjectQueue('agent-sessions') private queue: Queue) {}

  async enqueueSessionCreation(sessionId: string, tenantId: string): Promise<void> {
    // Capture current trace context
    const traceContext: Record<string, string> = {};
    propagation.inject(context.active(), traceContext);

    await this.queue.add('create-session', {
      sessionId,
      tenantId,
      _traceContext: traceContext, // propagated through the queue
    });
  }
}

// Consumer: extract trace context and create a linked span
@Processor('agent-sessions')
export class AgentJobProcessor extends WorkerHost {
  private readonly tracer = trace.getTracer('warmwind-worker');

  async process(job: Job): Promise<void> {
    // Extract the trace context from the job data
    const parentContext = propagation.extract(
      context.active(),
      job.data._traceContext,
    );

    // Create a new span linked to the original trace
    return context.with(parentContext, () => {
      return this.tracer.startActiveSpan(
        'bullmq.process.create-session',
        {
          attributes: {
            'bullmq.job.id': job.id!,
            'bullmq.job.name': job.name,
            'agent.session.id': job.data.sessionId,
          },
        },
        async (span) => {
          try {
            await this.orchestrator.createAgentSession(
              job.data.sessionId,
              job.data.tenantId,
              'medium',
            );
            span.setStatus({ code: SpanStatusCode.OK });
          } catch (err) {
            span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
            span.recordException(err);
            throw err;
          } finally {
            span.end();
          }
        },
      );
    });
  }
}
graph LR
    HTTP["HTTP Request<br/>traceId: abc..."] -->|"traceparent header"| API["NestJS API<br/>span: http.request"]
    API -->|"inject context<br/>into job.data"| BULL["BullMQ Queue<br/>(Redis)"]
    BULL -->|"extract context<br/>from job.data"| WORKER["BullMQ Worker<br/>span: bullmq.process"]
    WORKER -->|"traceparent header"| K8S["K8s API<br/>span: k8s.pod.create"]
    API --> TEMPO["Tempo"]
    WORKER --> TEMPO

6.4 Correlating Logs with Traces

Inject the trace ID into every log line so you can jump from a Grafana trace view to the corresponding logs in Loki:

// pino-otel-mixin.ts
import { trace, context } from '@opentelemetry/api';

export function otelMixin(): Record<string, string> {
  const span = trace.getSpan(context.active());
  if (!span) return {};

  const spanContext = span.spanContext();
  return {
    traceId: spanContext.traceId,
    spanId: spanContext.spanId,
    traceFlags: spanContext.traceFlags.toString(16),
  };
}

// Usage in logger config:
// pinoHttp: { mixin: otelMixin, ... }

In Grafana, configure Loki as a "derived field" data source on trace IDs:

Field name: traceId
Regex: "traceId":"([a-f0-9]+)"
Internal link: Tempo data source
Query: ${__value.raw}

This enables one-click navigation: click a trace ID in a Loki log line and jump directly to the trace waterfall in Tempo.


7. SLOs and Error Budgets

7.1 Defining Agent Session Reliability

An SLO quantifies reliability as a measurable target over a rolling window.

SLO Target Measurement Error Budget (30d)
Session creation success rate 99.5% agent_sessions_created_total vs failed 0.5% = ~36 failures per 7,200 sessions
Session availability (uptime) 99.9% Seconds agent pod was Running / total session duration 0.1% = ~43 seconds of downtime per 12h session
API latency P99 < 500ms http_request_duration_seconds 0.5% of requests may exceed 500ms
VNC frame delivery 99% Frames delivered / frames produced 1% frame drop acceptable

7.2 SLO Recording Rules

Prometheus recording rules pre-compute SLO metrics for efficient dashboard queries:

# slo-recording-rules.yaml
groups:
  - name: warmwind-slos
    interval: 1m
    rules:
      # Session creation success rate (30-day rolling)
      - record: slo:agent_session_creation:success_rate_30d
        expr: >
          1 - (
            sum(increase(agent_sessions_created_total{outcome="failed"}[30d]))
            /
            sum(increase(agent_sessions_created_total[30d]))
          )

      # Remaining error budget (1.0 = full budget, 0.0 = exhausted)
      - record: slo:agent_session_creation:error_budget_remaining
        expr: >
          1 - (
            (1 - slo:agent_session_creation:success_rate_30d)
            /
            (1 - 0.995)
          )

      # API latency SLO: fraction of requests under 500ms
      - record: slo:api_latency:good_ratio_5m
        expr: >
          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
          /
          sum(rate(http_request_duration_seconds_count[5m]))

      # Burn rate: how fast are we consuming error budget?
      # > 1.0 means consuming faster than sustainable
      - record: slo:agent_session_creation:burn_rate_1h
        expr: >
          (
            sum(increase(agent_sessions_created_total{outcome="failed"}[1h]))
            /
            sum(increase(agent_sessions_created_total[1h]))
          )
          /
          (1 - 0.995)

7.3 Error Budget Alerting

Alert when the burn rate indicates the error budget will be exhausted before the window ends:

# Multi-window, multi-burn-rate alert (Google SRE book pattern)
- alert: AgentSessionErrorBudgetBurn
  expr: >
    slo:agent_session_creation:burn_rate_1h > 14.4
    and
    slo:agent_session_creation:burn_rate_6h > 6
  for: 5m
  labels:
    severity: critical
    team: backend
  annotations:
    summary: "Agent session creation error budget burning at {{ $value }}x sustainable rate"
    description: "At this rate, the 30-day error budget will be exhausted in less than 3 days"
graph LR
    GOOD["Good Events<br/>(successful sessions)"] --> RATIO["Success Ratio<br/>= good / total"]
    BAD["Bad Events<br/>(failed sessions)"] --> RATIO
    RATIO -->|"compare to target"| SLO["SLO: 99.5%"]
    SLO --> BUDGET["Error Budget<br/>= 0.5% of total"]
    BUDGET -->|"burn rate > 14x"| ALERT["Page On-Call"]
    BUDGET -->|"burn rate > 6x"| WARN["Slack Warning"]

SLOs shift the conversation from uptime to customer impact

Instead of debating whether a 2-minute outage at 3 AM matters, SLOs frame reliability in terms of user-facing impact. If your error budget is healthy, you can deploy aggressively. If it is nearly exhausted, you freeze features and focus on reliability. This is the same risk-management framework that Warmwind likely uses to balance feature velocity with platform stability.


8. Production Observability Stack Deployment

8.1 Full Stack Architecture

graph LR
    subgraph "Application Tier"
        API["NestJS API Pods"]
        WORKER["BullMQ Worker Pods"]
        AGENT["Agent Pods"]
    end
    subgraph "Collection Tier"
        PROM["Prometheus"]
        PROMTAIL["Promtail<br/>(DaemonSet)"]
        COLLECTOR["OTel Collector"]
    end
    subgraph "Storage Tier"
        TSDB["Prometheus TSDB<br/>(metrics)"]
        LOKIST["Loki<br/>(logs + S3)"]
        TEMPOST["Tempo<br/>(traces + S3)"]
    end
    subgraph "Query Tier"
        GRAFANA["Grafana"]
    end
    API -->|"/metrics"| PROM
    WORKER -->|"/metrics"| PROM
    API -->|"stdout JSON"| PROMTAIL
    WORKER -->|"stdout JSON"| PROMTAIL
    AGENT -->|"stdout"| PROMTAIL
    API -->|"OTLP gRPC"| COLLECTOR
    WORKER -->|"OTLP gRPC"| COLLECTOR
    PROM --> TSDB
    PROMTAIL --> LOKIST
    COLLECTOR --> TEMPOST
    TSDB --> GRAFANA
    LOKIST --> GRAFANA
    TEMPOST --> GRAFANA

8.2 Grafana Data Source Provisioning

# grafana-datasources.yaml (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
  name: grafana-datasources
  namespace: monitoring
data:
  datasources.yaml: |
    apiVersion: 1
    datasources:
      - name: Prometheus
        type: prometheus
        uid: prometheus
        url: http://prometheus-server.monitoring:9090
        access: proxy
        isDefault: true
        jsonData:
          timeInterval: "15s"
          exemplarTraceIdDestinations:
            - name: traceId
              datasourceUid: tempo

      - name: Loki
        type: loki
        uid: loki
        url: http://loki-gateway.monitoring:3100
        access: proxy
        jsonData:
          derivedFields:
            - name: TraceID
              matcherRegex: '"traceId":"([a-f0-9]+)"'
              url: "$${__value.raw}"
              datasourceUid: tempo
              urlDisplayLabel: "View Trace"

      - name: Tempo
        type: tempo
        uid: tempo
        url: http://tempo.monitoring:3200
        access: proxy
        jsonData:
          tracesToLogsV2:
            datasourceUid: loki
            filterByTraceID: true
            filterBySpanID: false
            tags:
              - key: "service.name"
                value: "app"
          tracesToMetrics:
            datasourceUid: prometheus
            tags:
              - key: "service.name"
                value: "job"
          nodeGraph:
            enabled: true
          serviceMap:
            datasourceUid: prometheus

9. Key Takeaways for the Interview

Topic What to Demonstrate
Three pillars Metrics (Prometheus), logs (Loki), traces (Tempo) -- what each answers
NestJS metrics prom-client setup, custom counters/histograms/gauges, interceptor pattern
Structured logging Pino JSON, label cardinality, LogQL queries, structured metadata
Grafana dashboards Key panels for Warmwind, provisioning as code, template variables
Alerting Multi-severity rules, PagerDuty/Slack routing, runbook links
OpenTelemetry SDK init before imports, auto-instrumentation, manual spans
Trace propagation W3C traceparent header, BullMQ job data injection, log correlation
SLOs Error budget math, burn rate alerts, recording rules
Dashboard JSON Can provision dashboards as code via ConfigMaps

References