Observability with Grafana & Loki¶
Warmwind runs hundreds of ephemeral AI agent Pods, streams VNC frames over WebSockets, and processes job queues through BullMQ -- all of which can fail silently without proper observability. This article covers the three pillars (metrics with Prometheus, logs with Loki, traces with OpenTelemetry/Tempo), wiring each into a NestJS backend with real code. It then builds Grafana dashboards for the metrics that matter (active agents, frame rate, API error rate, connection pool health, queue depth, memory per pod), configures alerting rules with PagerDuty/Slack integration, defines SLOs and error budgets for agent session reliability, and includes a full Grafana provisioning configuration.
Glossary
Prometheus -- A pull-based time-series database. It scrapes HTTP /metrics endpoints at configured intervals and stores samples as {metric_name}{labels} value timestamp.
Loki -- Grafana's log aggregation system. Unlike Elasticsearch, Loki indexes only labels (not full text), making it cheaper to operate. Logs are queried with LogQL.
Tempo -- Grafana's distributed tracing backend. Stores trace spans and correlates them with logs and metrics via trace IDs.
OpenTelemetry (OTel) -- A vendor-neutral observability framework providing APIs and SDKs for metrics, logs, and traces. The standard instrumentation layer for modern backends.
LogQL -- Loki's query language. Combines label matchers ({app="api"}) with pipeline stages (| json | status >= 500).
Counter -- A Prometheus metric type that only goes up (e.g., total HTTP requests). Use rate() to compute per-second values.
Histogram -- A Prometheus metric that samples observations into configurable buckets (e.g., request duration). Enables percentile computation via histogram_quantile().
Gauge -- A Prometheus metric that can go up or down (e.g., active connections, queue depth, memory usage).
Span -- A single unit of work in a distributed trace. Spans have a start time, duration, parent span ID, and key-value attributes.
Trace context -- Metadata (trace ID, span ID, trace flags) propagated across service boundaries via HTTP headers (traceparent) or message metadata.
SLO -- Service Level Objective; a target for a measurable reliability metric (e.g., "99.5% of agent sessions complete without error").
Error budget -- The acceptable amount of unreliability within an SLO period. If your SLO is 99.5%, your error budget is 0.5% of total sessions.
Structured logging -- Emitting log lines as JSON objects with typed fields (level, message, traceId, sessionId) instead of unstructured text. Enables machine parsing and label extraction.
1. The Three Pillars¶
graph LR
APP["NestJS Application"] -->|"/metrics endpoint<br/>prom-client"| PROM["Prometheus<br/>(Metrics)"]
APP -->|"JSON stdout<br/>pino logger"| LOKI["Loki<br/>(Logs)"]
APP -->|"OTLP gRPC<br/>@opentelemetry/sdk"| TEMPO["Tempo<br/>(Traces)"]
PROM --> GRAFANA["Grafana<br/>(Dashboards + Alerts)"]
LOKI --> GRAFANA
TEMPO --> GRAFANA
GRAFANA -->|"alerting rules"| PD["PagerDuty"]
GRAFANA -->|"alerting rules"| SLACK["Slack"]
| Pillar | Tool | What It Answers | Warmwind Example |
|---|---|---|---|
| Metrics | Prometheus | "How much?" / "How fast?" | Active agents: 142, API P99: 45ms, BullMQ waiting: 7 |
| Logs | Loki | "What happened?" | agent-xyz failed: OOM killed at 4096MB RSS |
| Traces | Tempo | "Where did time go?" | HTTP request -> GraphQL resolver -> TypeORM query -> BullMQ job -> Agent Pod creation (2.3s total, 1.8s in K8s API) |
Metrics tell you THAT something is wrong; logs tell you WHAT; traces tell you WHERE
If you have used journalctl to debug systemd services and perf / strace
to profile processes, think of Prometheus as the system-wide sar equivalent,
Loki as centralized journalctl with a query language, and Tempo as
distributed strace that follows a request across process boundaries.
2. Prometheus Metrics in NestJS¶
2.1 Setup with prom-client¶
The prom-client library is the standard Prometheus client for Node.js. It exposes a /metrics endpoint that Prometheus scrapes.
// metrics.module.ts
import { Module, Global } from '@nestjs/common';
import { MetricsController } from './metrics.controller';
import { MetricsService } from './metrics.service';
@Global()
@Module({
controllers: [MetricsController],
providers: [MetricsService],
exports: [MetricsService],
})
export class MetricsModule {}
// metrics.controller.ts
import { Controller, Get, Header } from '@nestjs/common';
import { register } from 'prom-client';
@Controller('metrics')
export class MetricsController {
@Get()
@Header('Content-Type', register.contentType)
async getMetrics(): Promise<string> {
return register.metrics();
}
}
// metrics.service.ts
import { Injectable, OnModuleInit } from '@nestjs/common';
import {
Counter,
Histogram,
Gauge,
collectDefaultMetrics,
register,
} from 'prom-client';
@Injectable()
export class MetricsService implements OnModuleInit {
// ── HTTP Metrics ───────────────────────────────────────────
readonly httpRequestsTotal: Counter;
readonly httpRequestDuration: Histogram;
// ── Agent Session Metrics ──────────────────────────────────
readonly activeAgentSessions: Gauge;
readonly agentSessionsCreatedTotal: Counter;
readonly agentSessionDuration: Histogram;
// ── VNC Streaming Metrics ──────────────────────────────────
readonly vncFrameRate: Gauge;
readonly vncDroppedFramesTotal: Counter;
readonly vncClientBufferBytes: Gauge;
// ── BullMQ Metrics ─────────────────────────────────────────
readonly bullmqWaitingJobs: Gauge;
readonly bullmqActiveJobs: Gauge;
readonly bullmqCompletedTotal: Counter;
readonly bullmqFailedTotal: Counter;
readonly bullmqJobDuration: Histogram;
// ── Database Metrics ───────────────────────────────────────
readonly pgPoolActiveConnections: Gauge;
readonly pgPoolIdleConnections: Gauge;
readonly pgQueryDuration: Histogram;
// ── WebSocket Metrics ──────────────────────────────────────
readonly wsActiveConnections: Gauge;
readonly wsMessagesTotal: Counter;
constructor() {
this.httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
this.httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
this.activeAgentSessions = new Gauge({
name: 'agent_sessions_active',
help: 'Number of currently active agent sessions',
labelNames: ['tenant'],
});
this.agentSessionsCreatedTotal = new Counter({
name: 'agent_sessions_created_total',
help: 'Total agent sessions created',
labelNames: ['tenant', 'resource_tier'],
});
this.agentSessionDuration = new Histogram({
name: 'agent_session_duration_seconds',
help: 'Duration of agent sessions in seconds',
labelNames: ['tenant', 'outcome'],
buckets: [30, 60, 120, 300, 600, 1800, 3600],
});
this.vncFrameRate = new Gauge({
name: 'vnc_frame_rate_fps',
help: 'Current VNC frame rate per session',
labelNames: ['session_id'],
});
this.vncDroppedFramesTotal = new Counter({
name: 'vnc_dropped_frames_total',
help: 'VNC frames dropped due to client backpressure',
labelNames: ['session_id'],
});
this.vncClientBufferBytes = new Gauge({
name: 'vnc_client_buffer_bytes',
help: 'WebSocket send buffer size per client',
labelNames: ['session_id'],
});
this.bullmqWaitingJobs = new Gauge({
name: 'bullmq_waiting_jobs',
help: 'Number of jobs waiting in each queue',
labelNames: ['queue'],
});
this.bullmqActiveJobs = new Gauge({
name: 'bullmq_active_jobs',
help: 'Number of jobs currently being processed',
labelNames: ['queue'],
});
this.bullmqCompletedTotal = new Counter({
name: 'bullmq_completed_total',
help: 'Total completed BullMQ jobs',
labelNames: ['queue'],
});
this.bullmqFailedTotal = new Counter({
name: 'bullmq_failed_total',
help: 'Total failed BullMQ jobs',
labelNames: ['queue'],
});
this.bullmqJobDuration = new Histogram({
name: 'bullmq_job_duration_seconds',
help: 'BullMQ job processing duration',
labelNames: ['queue', 'job_name'],
buckets: [0.1, 0.5, 1, 5, 10, 30, 60, 120],
});
this.pgPoolActiveConnections = new Gauge({
name: 'pg_pool_active_connections',
help: 'Active PostgreSQL connections in the pool',
});
this.pgPoolIdleConnections = new Gauge({
name: 'pg_pool_idle_connections',
help: 'Idle PostgreSQL connections in the pool',
});
this.pgQueryDuration = new Histogram({
name: 'pg_query_duration_seconds',
help: 'PostgreSQL query duration in seconds',
labelNames: ['query_type'],
buckets: [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5],
});
this.wsActiveConnections = new Gauge({
name: 'ws_active_connections',
help: 'Active WebSocket connections',
labelNames: ['namespace'],
});
this.wsMessagesTotal = new Counter({
name: 'ws_messages_total',
help: 'Total WebSocket messages sent and received',
labelNames: ['namespace', 'direction'],
});
}
onModuleInit(): void {
// Collect default Node.js metrics (event loop lag, heap, GC, etc.)
collectDefaultMetrics({ register, prefix: 'warmwind_' });
}
}
2.2 HTTP Interceptor for Request Metrics¶
// metrics.interceptor.ts
import {
Injectable,
NestInterceptor,
ExecutionContext,
CallHandler,
} from '@nestjs/common';
import { Observable } from 'rxjs';
import { tap } from 'rxjs/operators';
import { MetricsService } from './metrics.service';
@Injectable()
export class MetricsInterceptor implements NestInterceptor {
constructor(private readonly metrics: MetricsService) {}
intercept(context: ExecutionContext, next: CallHandler): Observable<any> {
const req = context.switchToHttp().getRequest();
const method = req.method;
const route = req.route?.path || req.url;
const end = this.metrics.httpRequestDuration.startTimer({ method, route });
return next.handle().pipe(
tap({
next: () => {
const statusCode = context.switchToHttp().getResponse().statusCode;
end({ status_code: statusCode });
this.metrics.httpRequestsTotal.inc({
method,
route,
status_code: statusCode,
});
},
error: (err) => {
const statusCode = err.status || 500;
end({ status_code: statusCode });
this.metrics.httpRequestsTotal.inc({
method,
route,
status_code: statusCode,
});
},
}),
);
}
}
2.3 Prometheus Scrape Configuration¶
# prometheus.yml (relevant scrape job)
scrape_configs:
- job_name: 'warmwind-api'
scrape_interval: 15s
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port, __meta_kubernetes_pod_ip]
action: replace
target_label: __address__
regex: (.+);(.+)
replacement: $2:$1
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
graph LR
POD1["NestJS Pod 1<br/>/metrics"] --> PROM["Prometheus"]
POD2["NestJS Pod 2<br/>/metrics"] --> PROM
POD3["NestJS Pod 3<br/>/metrics"] --> PROM
WORKER["BullMQ Worker<br/>/metrics"] --> PROM
PROM -->|"PromQL queries"| GRAFANA["Grafana"]
PROM -->|"custom.metrics.k8s.io"| ADAPTER["Prometheus Adapter<br/>(for HPA)"]
3. Loki for Log Aggregation¶
3.1 Structured Logging with Pino¶
Loki works best with structured JSON logs. Pino is the fastest Node.js JSON logger and integrates cleanly with NestJS.
// logger.config.ts
import { LoggerModule } from 'nestjs-pino';
import { randomUUID } from 'crypto';
export const PinoLoggerModule = LoggerModule.forRoot({
pinoHttp: {
level: process.env.LOG_LEVEL || 'info',
transport:
process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined, // JSON in production
// Generate a request ID for trace correlation
genReqId: (req) => req.headers['x-request-id'] || randomUUID(),
// Redact sensitive fields
redact: {
paths: ['req.headers.authorization', 'req.headers.cookie'],
censor: '[REDACTED]',
},
// Custom serializers for Warmwind-specific context
serializers: {
req: (req) => ({
method: req.method,
url: req.url,
query: req.query,
tenantId: req.raw?.tenantId,
}),
res: (res) => ({
statusCode: res.statusCode,
}),
},
},
});
Sample JSON log output:
{
"level": 30,
"time": 1710700000000,
"pid": 1,
"hostname": "warmwind-api-7b9f4c6d8-x2k4p",
"reqId": "550e8400-e29b-41d4-a716-446655440000",
"traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
"spanId": "00f067aa0ba902b7",
"tenantId": "tenant-acme",
"sessionId": "agent-abc123",
"context": "AgentOrchestrator",
"msg": "Agent pod created successfully",
"podName": "agent-abc123",
"resourceTier": "medium",
"creationTimeMs": 1823
}
3.2 Loki Label Strategy¶
Labels are Loki's primary index. Keep cardinality low (fewer than 100 unique values per label). High-cardinality data goes into structured metadata or the log line itself.
| Label | Cardinality | Indexed? | Example Values |
|---|---|---|---|
app |
~5 | Yes (label) | api, worker, agent |
namespace |
~20 | Yes (label) | warmwind, tenant-acme |
level |
5 | Yes (label) | debug, info, warn, error, fatal |
pod |
~200 | Structured metadata | warmwind-api-7b9f4c6d8-x2k4p |
traceId |
unbounded | Structured metadata | 4bf92f3577b34da6a3ce929d0e0e4736 |
sessionId |
~1000 | Structured metadata | agent-abc123 |
tenantId |
~50 | Structured metadata | tenant-acme |
Labels are like indexes in PostgreSQL
If you think of Loki as a database, labels are the indexed columns and log
lines are the unindexed payload. Just as you would not create a PostgreSQL
index on a UUID column with millions of unique values, you should not use
high-cardinality values as Loki labels. Use structured metadata for fields
like traceId and sessionId -- they are queryable but not indexed, which
keeps Loki's memory and storage costs manageable.
3.3 LogQL Queries¶
LogQL is Loki's query language. It starts with a log stream selector (labels) and chains pipeline stages for filtering, parsing, and aggregation.
# All error logs from the API in the last hour
{app="api", level="error"}
# Errors mentioning OOM, parsed as JSON, filtered by tenant
{app="agent", level="error"}
| json
| msg =~ "OOM|out of memory"
| tenantId = "tenant-acme"
# Count errors per minute, grouped by context (module)
sum by (context) (
rate({app="api", level="error"} | json [5m])
)
# P99 agent session creation time from structured logs
quantile_over_time(0.99,
{app="api"} | json | context = "AgentOrchestrator" | msg = "Agent pod created successfully"
| unwrap creationTimeMs [10m]
) / 1000 # convert ms to seconds
# Find slow database queries (> 1 second)
{app="api"}
| json
| context = "TypeOrmLogger"
| unwrap durationMs
| durationMs > 1000
# Correlated logs: find all logs for a specific trace
{app=~"api|worker|agent"} | json | traceId = "4bf92f3577b34da6a3ce929d0e0e4736"
graph LR
QUERY["{app='api', level='error'}"] --> SELECTOR["Stream Selector<br/>(label matchers)"]
SELECTOR --> FILTER["| json<br/>| msg =~ 'OOM'"]
FILTER --> AGG["rate(...[5m])<br/>sum by (context)"]
AGG --> RESULT["Time Series<br/>(for Grafana panel)"]
3.4 Promtail / Grafana Agent Configuration¶
Logs flow from container stdout to Loki via Promtail (or Grafana Alloy, the successor):
# promtail-config.yaml (deployed as DaemonSet)
server:
http_listen_port: 3101
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki-gateway.monitoring:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with the annotation
- source_labels: [__meta_kubernetes_pod_annotation_loki_io_scrape]
action: keep
regex: true
# Set the app label from the pod label
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
# Set the namespace label
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
pipeline_stages:
# Parse JSON and extract level for indexing
- json:
expressions:
level: level
traceId: traceId
sessionId: sessionId
- labels:
level:
# High-cardinality fields go to structured metadata
- structured_metadata:
traceId:
sessionId:
# Drop debug logs in production to save storage
- match:
selector: '{level="debug"}'
action: drop
drop_counter_reason: debug_logs_dropped
4. Grafana Dashboards¶
4.1 What to Monitor for Warmwind¶
| Panel | Metric / Query | Alert Threshold |
|---|---|---|
| Active Agent Sessions | agent_sessions_active |
> 500 (capacity warning) |
| VNC Frame Rate | avg(vnc_frame_rate_fps) |
< 5 fps (user experience degradation) |
| API Error Rate | rate(http_requests_total{status_code=~"5.."}[5m]) / rate(http_requests_total[5m]) |
> 1% |
| API P99 Latency | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
> 2s |
| PostgreSQL Connection Pool | pg_pool_active_connections / (pg_pool_active_connections + pg_pool_idle_connections) |
> 80% utilization |
| BullMQ Queue Depth | bullmq_waiting_jobs{queue="agent-sessions"} |
> 50 (queue backing up) |
| Memory per Pod | container_memory_working_set_bytes{namespace="warmwind"} |
> 80% of limit |
| WebSocket Connections | ws_active_connections |
< 10 (may indicate connectivity issue) |
| Dropped VNC Frames | rate(vnc_dropped_frames_total[5m]) |
> 10/sec per session |
| BullMQ Job Failure Rate | rate(bullmq_failed_total[5m]) / rate(bullmq_completed_total[5m]) |
> 5% |
4.2 Dashboard Provisioning Configuration¶
Grafana dashboards can be provisioned as code via Kubernetes ConfigMaps. The following is a complete provisioning setup:
# grafana-dashboard-provider.yaml (ConfigMap mounted into Grafana)
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboard-provider
namespace: monitoring
data:
warmwind.yaml: |
apiVersion: 1
providers:
- name: warmwind
orgId: 1
folder: Warmwind
type: file
disableDeletion: false
editable: true
updateIntervalSeconds: 30
options:
path: /var/lib/grafana/dashboards/warmwind
foldersFromFilesStructure: false
4.3 Dashboard JSON Snippet¶
A complete Grafana dashboard for Warmwind's core metrics:
{
"uid": "warmwind-overview",
"title": "Warmwind -- Service Overview",
"tags": ["warmwind", "production"],
"timezone": "utc",
"refresh": "30s",
"time": { "from": "now-1h", "to": "now" },
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"datasource": { "type": "prometheus" },
"query": "label_values(http_requests_total, namespace)",
"current": { "text": "warmwind", "value": "warmwind" },
"refresh": 2
},
{
"name": "tenant",
"type": "query",
"datasource": { "type": "prometheus" },
"query": "label_values(agent_sessions_active, tenant)",
"includeAll": true,
"refresh": 2
}
]
},
"panels": [
{
"title": "Active Agent Sessions",
"type": "stat",
"gridPos": { "h": 4, "w": 6, "x": 0, "y": 0 },
"targets": [
{
"expr": "sum(agent_sessions_active{namespace=\"$namespace\", tenant=~\"$tenant\"})",
"legendFormat": "Active Sessions"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 300 },
{ "color": "red", "value": 500 }
]
}
}
}
},
{
"title": "API Error Rate (%)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 4 },
"targets": [
{
"expr": "100 * sum(rate(http_requests_total{namespace=\"$namespace\", status_code=~\"5..\"}[5m])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[5m]))",
"legendFormat": "5xx Error Rate %"
}
],
"fieldConfig": {
"defaults": {
"custom": { "fillOpacity": 15, "lineWidth": 2 },
"unit": "percent",
"thresholds": {
"mode": "absolute",
"steps": [
{ "color": "green", "value": null },
{ "color": "red", "value": 1 }
]
}
}
}
},
{
"title": "API Latency (P50 / P95 / P99)",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 12, "y": 4 },
"targets": [
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "P50"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "P95"
},
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le))",
"legendFormat": "P99"
}
],
"fieldConfig": {
"defaults": {
"unit": "s",
"custom": { "lineWidth": 2 }
}
}
},
{
"title": "BullMQ Queue Depth",
"type": "timeseries",
"gridPos": { "h": 8, "w": 12, "x": 0, "y": 12 },
"targets": [
{
"expr": "bullmq_waiting_jobs{namespace=\"$namespace\", queue=\"agent-sessions\"}",
"legendFormat": "Waiting ({{pod}})"
},
{
"expr": "bullmq_active_jobs{namespace=\"$namespace\", queue=\"agent-sessions\"}",
"legendFormat": "Active ({{pod}})"
}
],
"fieldConfig": {
"defaults": {
"custom": { "fillOpacity": 20 }
}
}
},
{
"title": "PostgreSQL Connection Pool",
"type": "gauge",
"gridPos": { "h": 8, "w": 6, "x": 12, "y": 12 },
"targets": [
{
"expr": "100 * sum(pg_pool_active_connections{namespace=\"$namespace\"}) / (sum(pg_pool_active_connections{namespace=\"$namespace\"}) + sum(pg_pool_idle_connections{namespace=\"$namespace\"}))",
"legendFormat": "Pool Utilization %"
}
],
"fieldConfig": {
"defaults": {
"unit": "percent",
"min": 0,
"max": 100,
"thresholds": {
"steps": [
{ "color": "green", "value": null },
{ "color": "yellow", "value": 60 },
{ "color": "red", "value": 80 }
]
}
}
}
},
{
"title": "Memory per Pod",
"type": "timeseries",
"gridPos": { "h": 8, "w": 6, "x": 18, "y": 12 },
"targets": [
{
"expr": "container_memory_working_set_bytes{namespace=\"$namespace\", container!=\"\", container!=\"POD\"}",
"legendFormat": "{{pod}} / {{container}}"
}
],
"fieldConfig": {
"defaults": {
"unit": "bytes",
"custom": { "fillOpacity": 10 }
}
}
}
]
}
5. Alerting¶
5.1 Grafana Alerting Rules¶
Grafana Unified Alerting evaluates PromQL and LogQL expressions on a schedule and fires alerts to contact points (PagerDuty, Slack, email).
# alerting-rules.yaml (provisioned via ConfigMap)
apiVersion: 1
groups:
- orgId: 1
name: warmwind-critical
folder: Warmwind Alerts
interval: 1m
rules:
- uid: high-error-rate
title: "API Error Rate > 1%"
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: >
100 * sum(rate(http_requests_total{namespace="warmwind", status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total{namespace="warmwind"}[5m]))
instant: true
- refId: B
datasourceUid: "__expr__"
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: "__expr__"
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [1] # > 1%
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "API 5xx error rate is {{ $values.B }}%"
runbook_url: "https://wiki.warmwind.dev/runbooks/api-error-rate"
- uid: queue-buildup
title: "BullMQ Queue Depth > 50"
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: "max(bullmq_waiting_jobs{queue=\"agent-sessions\"})"
instant: true
- refId: B
datasourceUid: "__expr__"
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: "__expr__"
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [50]
for: 3m
labels:
severity: warning
team: backend
annotations:
summary: "Agent session queue has {{ $values.B }} waiting jobs"
- uid: agent-pod-oom
title: "Agent Pod OOM Kill Detected"
condition: C
data:
- refId: A
datasourceUid: prometheus
model:
expr: >
increase(kube_pod_container_status_restarts_total{
namespace=~"tenant-.*",
reason="OOMKilled"
}[5m]) > 0
instant: true
- refId: B
datasourceUid: "__expr__"
model:
type: reduce
expression: A
reducer: last
- refId: C
datasourceUid: "__expr__"
model:
type: threshold
expression: B
conditions:
- evaluator:
type: gt
params: [0]
for: 0s
labels:
severity: critical
team: platform
annotations:
summary: "Agent pod OOM killed in namespace {{ $labels.namespace }}"
5.2 Contact Points¶
# contact-points.yaml
apiVersion: 1
contactPoints:
- orgId: 1
name: backend-oncall
receivers:
- uid: pagerduty-backend
type: pagerduty
settings:
integrationKey: "${PD_INTEGRATION_KEY}"
severity: "{{ .CommonLabels.severity }}"
summary: "{{ .CommonAnnotations.summary }}"
- uid: slack-backend
type: slack
settings:
url: "${SLACK_WEBHOOK_URL}"
recipient: "#backend-alerts"
title: "{{ .CommonLabels.alertname }}"
text: "{{ .CommonAnnotations.summary }}"
graph LR
PROM["Prometheus"] -->|"PromQL"| EVAL["Grafana Alert<br/>Evaluator"]
LOKI["Loki"] -->|"LogQL"| EVAL
EVAL -->|"firing"| ROUTE["Notification<br/>Policy"]
ROUTE -->|"severity=critical"| PD["PagerDuty<br/>(pages on-call)"]
ROUTE -->|"severity=warning"| SLACK["Slack<br/>#backend-alerts"]
ROUTE -->|"all"| EMAIL["Email Digest<br/>(hourly)"]
6. Distributed Tracing with OpenTelemetry¶
6.1 SDK Setup¶
OpenTelemetry must be initialized before NestJS imports any modules, so it can monkey-patch HTTP, Express, pg, ioredis, and BullMQ libraries.
// tracing.ts -- imported FIRST in main.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { Resource } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
} from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[ATTR_SERVICE_NAME]: process.env.OTEL_SERVICE_NAME || 'warmwind-api',
[ATTR_SERVICE_VERSION]: process.env.APP_VERSION || '0.0.0',
'deployment.environment': process.env.NODE_ENV || 'development',
'k8s.namespace.name': process.env.K8S_NAMESPACE || 'warmwind',
'k8s.pod.name': process.env.K8S_POD_NAME || 'local',
}),
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://tempo:4317',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://tempo:4317',
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
// Fine-tune which instrumentations are active
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true },
'@opentelemetry/instrumentation-ioredis': { enabled: true },
'@opentelemetry/instrumentation-graphql': { enabled: true },
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', async () => {
await sdk.shutdown();
});
// main.ts
import './tracing'; // MUST be first import
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.create(AppModule);
await app.listen(3000);
}
bootstrap();
6.2 Manual Span Creation¶
Auto-instrumentation covers HTTP, database, and Redis operations. For custom business logic (agent session creation, VNC relay), add manual spans:
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('warmwind-api', '1.0.0');
@Injectable()
export class AgentOrchestrator {
async createAgentSession(
sessionId: string,
tenantId: string,
tier: 'small' | 'medium' | 'large',
): Promise<void> {
return tracer.startActiveSpan(
'agent.session.create',
{
attributes: {
'agent.session.id': sessionId,
'agent.tenant.id': tenantId,
'agent.resource.tier': tier,
},
},
async (span) => {
try {
// Child span: create PVC
await tracer.startActiveSpan('k8s.pvc.create', async (pvcSpan) => {
await this.createPvc(sessionId);
pvcSpan.end();
});
// Child span: create Pod
await tracer.startActiveSpan('k8s.pod.create', async (podSpan) => {
const pod = await this.createPod(sessionId, tenantId, tier);
podSpan.setAttribute('k8s.pod.name', pod.metadata!.name!);
podSpan.end();
});
// Child span: wait for Pod readiness
await tracer.startActiveSpan('k8s.pod.wait_ready', async (waitSpan) => {
await this.waitForPodReady(sessionId, 60_000);
waitSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: err.message,
});
span.recordException(err);
throw err;
} finally {
span.end();
}
},
);
}
}
6.3 Trace Propagation Across BullMQ¶
BullMQ jobs run in a separate worker process. To maintain trace context across the queue boundary, inject the trace context into the job data and extract it in the processor:
import { context, propagation, trace } from '@opentelemetry/api';
// Producer: inject trace context into job data
@Injectable()
export class AgentJobProducer {
constructor(@InjectQueue('agent-sessions') private queue: Queue) {}
async enqueueSessionCreation(sessionId: string, tenantId: string): Promise<void> {
// Capture current trace context
const traceContext: Record<string, string> = {};
propagation.inject(context.active(), traceContext);
await this.queue.add('create-session', {
sessionId,
tenantId,
_traceContext: traceContext, // propagated through the queue
});
}
}
// Consumer: extract trace context and create a linked span
@Processor('agent-sessions')
export class AgentJobProcessor extends WorkerHost {
private readonly tracer = trace.getTracer('warmwind-worker');
async process(job: Job): Promise<void> {
// Extract the trace context from the job data
const parentContext = propagation.extract(
context.active(),
job.data._traceContext,
);
// Create a new span linked to the original trace
return context.with(parentContext, () => {
return this.tracer.startActiveSpan(
'bullmq.process.create-session',
{
attributes: {
'bullmq.job.id': job.id!,
'bullmq.job.name': job.name,
'agent.session.id': job.data.sessionId,
},
},
async (span) => {
try {
await this.orchestrator.createAgentSession(
job.data.sessionId,
job.data.tenantId,
'medium',
);
span.setStatus({ code: SpanStatusCode.OK });
} catch (err) {
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
span.recordException(err);
throw err;
} finally {
span.end();
}
},
);
});
}
}
graph LR
HTTP["HTTP Request<br/>traceId: abc..."] -->|"traceparent header"| API["NestJS API<br/>span: http.request"]
API -->|"inject context<br/>into job.data"| BULL["BullMQ Queue<br/>(Redis)"]
BULL -->|"extract context<br/>from job.data"| WORKER["BullMQ Worker<br/>span: bullmq.process"]
WORKER -->|"traceparent header"| K8S["K8s API<br/>span: k8s.pod.create"]
API --> TEMPO["Tempo"]
WORKER --> TEMPO
6.4 Correlating Logs with Traces¶
Inject the trace ID into every log line so you can jump from a Grafana trace view to the corresponding logs in Loki:
// pino-otel-mixin.ts
import { trace, context } from '@opentelemetry/api';
export function otelMixin(): Record<string, string> {
const span = trace.getSpan(context.active());
if (!span) return {};
const spanContext = span.spanContext();
return {
traceId: spanContext.traceId,
spanId: spanContext.spanId,
traceFlags: spanContext.traceFlags.toString(16),
};
}
// Usage in logger config:
// pinoHttp: { mixin: otelMixin, ... }
In Grafana, configure Loki as a "derived field" data source on trace IDs:
Field name: traceId
Regex: "traceId":"([a-f0-9]+)"
Internal link: Tempo data source
Query: ${__value.raw}
This enables one-click navigation: click a trace ID in a Loki log line and jump directly to the trace waterfall in Tempo.
7. SLOs and Error Budgets¶
7.1 Defining Agent Session Reliability¶
An SLO quantifies reliability as a measurable target over a rolling window.
| SLO | Target | Measurement | Error Budget (30d) |
|---|---|---|---|
| Session creation success rate | 99.5% | agent_sessions_created_total vs failed |
0.5% = ~36 failures per 7,200 sessions |
| Session availability (uptime) | 99.9% | Seconds agent pod was Running / total session duration | 0.1% = ~43 seconds of downtime per 12h session |
| API latency P99 | < 500ms | http_request_duration_seconds |
0.5% of requests may exceed 500ms |
| VNC frame delivery | 99% | Frames delivered / frames produced | 1% frame drop acceptable |
7.2 SLO Recording Rules¶
Prometheus recording rules pre-compute SLO metrics for efficient dashboard queries:
# slo-recording-rules.yaml
groups:
- name: warmwind-slos
interval: 1m
rules:
# Session creation success rate (30-day rolling)
- record: slo:agent_session_creation:success_rate_30d
expr: >
1 - (
sum(increase(agent_sessions_created_total{outcome="failed"}[30d]))
/
sum(increase(agent_sessions_created_total[30d]))
)
# Remaining error budget (1.0 = full budget, 0.0 = exhausted)
- record: slo:agent_session_creation:error_budget_remaining
expr: >
1 - (
(1 - slo:agent_session_creation:success_rate_30d)
/
(1 - 0.995)
)
# API latency SLO: fraction of requests under 500ms
- record: slo:api_latency:good_ratio_5m
expr: >
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
# Burn rate: how fast are we consuming error budget?
# > 1.0 means consuming faster than sustainable
- record: slo:agent_session_creation:burn_rate_1h
expr: >
(
sum(increase(agent_sessions_created_total{outcome="failed"}[1h]))
/
sum(increase(agent_sessions_created_total[1h]))
)
/
(1 - 0.995)
7.3 Error Budget Alerting¶
Alert when the burn rate indicates the error budget will be exhausted before the window ends:
# Multi-window, multi-burn-rate alert (Google SRE book pattern)
- alert: AgentSessionErrorBudgetBurn
expr: >
slo:agent_session_creation:burn_rate_1h > 14.4
and
slo:agent_session_creation:burn_rate_6h > 6
for: 5m
labels:
severity: critical
team: backend
annotations:
summary: "Agent session creation error budget burning at {{ $value }}x sustainable rate"
description: "At this rate, the 30-day error budget will be exhausted in less than 3 days"
graph LR
GOOD["Good Events<br/>(successful sessions)"] --> RATIO["Success Ratio<br/>= good / total"]
BAD["Bad Events<br/>(failed sessions)"] --> RATIO
RATIO -->|"compare to target"| SLO["SLO: 99.5%"]
SLO --> BUDGET["Error Budget<br/>= 0.5% of total"]
BUDGET -->|"burn rate > 14x"| ALERT["Page On-Call"]
BUDGET -->|"burn rate > 6x"| WARN["Slack Warning"]
SLOs shift the conversation from uptime to customer impact
Instead of debating whether a 2-minute outage at 3 AM matters, SLOs frame reliability in terms of user-facing impact. If your error budget is healthy, you can deploy aggressively. If it is nearly exhausted, you freeze features and focus on reliability. This is the same risk-management framework that Warmwind likely uses to balance feature velocity with platform stability.
8. Production Observability Stack Deployment¶
8.1 Full Stack Architecture¶
graph LR
subgraph "Application Tier"
API["NestJS API Pods"]
WORKER["BullMQ Worker Pods"]
AGENT["Agent Pods"]
end
subgraph "Collection Tier"
PROM["Prometheus"]
PROMTAIL["Promtail<br/>(DaemonSet)"]
COLLECTOR["OTel Collector"]
end
subgraph "Storage Tier"
TSDB["Prometheus TSDB<br/>(metrics)"]
LOKIST["Loki<br/>(logs + S3)"]
TEMPOST["Tempo<br/>(traces + S3)"]
end
subgraph "Query Tier"
GRAFANA["Grafana"]
end
API -->|"/metrics"| PROM
WORKER -->|"/metrics"| PROM
API -->|"stdout JSON"| PROMTAIL
WORKER -->|"stdout JSON"| PROMTAIL
AGENT -->|"stdout"| PROMTAIL
API -->|"OTLP gRPC"| COLLECTOR
WORKER -->|"OTLP gRPC"| COLLECTOR
PROM --> TSDB
PROMTAIL --> LOKIST
COLLECTOR --> TEMPOST
TSDB --> GRAFANA
LOKIST --> GRAFANA
TEMPOST --> GRAFANA
8.2 Grafana Data Source Provisioning¶
# grafana-datasources.yaml (ConfigMap)
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
uid: prometheus
url: http://prometheus-server.monitoring:9090
access: proxy
isDefault: true
jsonData:
timeInterval: "15s"
exemplarTraceIdDestinations:
- name: traceId
datasourceUid: tempo
- name: Loki
type: loki
uid: loki
url: http://loki-gateway.monitoring:3100
access: proxy
jsonData:
derivedFields:
- name: TraceID
matcherRegex: '"traceId":"([a-f0-9]+)"'
url: "$${__value.raw}"
datasourceUid: tempo
urlDisplayLabel: "View Trace"
- name: Tempo
type: tempo
uid: tempo
url: http://tempo.monitoring:3200
access: proxy
jsonData:
tracesToLogsV2:
datasourceUid: loki
filterByTraceID: true
filterBySpanID: false
tags:
- key: "service.name"
value: "app"
tracesToMetrics:
datasourceUid: prometheus
tags:
- key: "service.name"
value: "job"
nodeGraph:
enabled: true
serviceMap:
datasourceUid: prometheus
9. Key Takeaways for the Interview¶
| Topic | What to Demonstrate |
|---|---|
| Three pillars | Metrics (Prometheus), logs (Loki), traces (Tempo) -- what each answers |
| NestJS metrics | prom-client setup, custom counters/histograms/gauges, interceptor pattern |
| Structured logging | Pino JSON, label cardinality, LogQL queries, structured metadata |
| Grafana dashboards | Key panels for Warmwind, provisioning as code, template variables |
| Alerting | Multi-severity rules, PagerDuty/Slack routing, runbook links |
| OpenTelemetry | SDK init before imports, auto-instrumentation, manual spans |
| Trace propagation | W3C traceparent header, BullMQ job data injection, log correlation |
| SLOs | Error budget math, burn rate alerts, recording rules |
| Dashboard JSON | Can provision dashboards as code via ConfigMaps |
References
- Grafana Loki documentation -- grafana.com
- LogQL query best practices -- grafana.com
- Loki structured metadata -- grafana.com
- Grafana dashboard provisioning -- grafana.com
- OpenTelemetry NestJS guide -- SigNoz
- nestjs-otel: OpenTelemetry module for NestJS -- GitHub
- NestJS observability with OpenTelemetry, Prometheus, Jaeger, Grafana -- GitHub
- prom-client: Prometheus client for Node.js -- npm
- NestJS Terminus health checks -- docs.nestjs.com
- Grafana Kubernetes dashboards -- GitHub
- The Site Reliability Workbook: SLOs -- Google SRE