AWS · Prometheus · Istio Dashboard

Service	State	Failures	Action
Loading…

⚡

Cache Management

LRU in-process metadata cache stats and controls

Size

—

Max

—

PromQL hit ratio →
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))

📊

Simulate Load

Generate synthetic Prometheus metrics to practise PromQL

Requests: 200

Error rate: 5%

Max latency: 500 ms

⚡

Circuit Breaker Management

Real-time state + manual reset. In PromQL: circuit_breaker_state{circuit_breaker_state="open"} == 1

Service	State	Failures	Reset
Loading…

📡

Prometheus Scrape Endpoints

Two endpoints — Istio mesh (port 5000) and dedicated scrape server (port 9100)

App metrics (proxied through Nginx): /api/metrics

Dedicated metrics server (bypasses Istio): :9100/metrics

Frontend container: istio-frontend-deployment-version2-6b475cb8c6-9jjvv

📘

PromQL Reference — Exam Cheatsheet

All queries target metrics exported by this app. Click Copy to paste into Grafana/Prometheus.

Exam Prep

COUNTER — rate / irate / increase

Request Rate (RED — Rate)

Per-second request rate per endpoint over 5 m window. Foundation of the RED method.

sum by (endpoint) (
  rate(http_requests_total[5m])
)

Error Rate (RED — Errors)

Fraction of requests returning 5xx. Use as SLI for availability SLO.

sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

Total Uploads in Last Hour

increase() gives the absolute count increase over a range — use for capacity planning.

increase(file_uploads_total{status="success"}[1h])

Instant Rate — irate()

irate uses only the last 2 samples — very responsive to spikes, noisy over slow windows.

irate(http_requests_total[5m])

HISTOGRAM — histogram_quantile()

p95 Latency per Endpoint (RED — Duration)

histogram_quantile aggregates across instances — this is the main advantage over Summary.

histogram_quantile(0.95,
  sum by (le, endpoint) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)

Top 5 Slowest Endpoints (p99)

topk() ranks — great for finding latency outliers in dashboards.

topk(5,
  histogram_quantile(0.99,
    rate(http_request_duration_seconds_bucket[5m])
  )
)

S3 Upload Latency Distribution

Observe bucket distribution to choose correct SLO thresholds.

histogram_quantile(0.99,
  rate(s3_operation_duration_seconds_bucket
       {operation="put_object"}[5m])
)

Avg Request Duration (via _sum / _count)

Mean latency from the histogram sum and count suffix — simpler than quantile.

rate(http_request_duration_seconds_sum[5m])
  /
rate(http_request_duration_seconds_count[5m])

SUMMARY — client-side quantiles (no aggregation)

p99 via Summary (single instance)

Summary quantiles are pre-computed — accurate per pod but cannot be summed across pods.

request_processing_seconds{quantile="0.99"}

Summary vs Histogram — when to use each

Summary: accurate per-instance quantiles. Histogram: aggregatable — use when you have multiple replicas.

# Summary (per-pod, cannot aggregate)
db_query_processing_seconds{quantile="0.95"}

# Histogram (aggregatable across pods)
histogram_quantile(0.95,
  sum by (le) (rate(
    postgres_query_duration_seconds_bucket[5m]))
)

GAUGE — saturation, health, staleness

Dependency Down Alert

GAUGE = 0 means unhealthy — fire alert immediately (no for: duration needed).

dependency_health == 0

Connection Pool Saturation

USE method — Utilisation > 80% is a saturation signal before errors occur.

db_pool_connections_active
  /
db_pool_connections_max
> 0.8

Stale Upload Detection (time since)

Alert when no successful upload for more than 1 h — Gauge as a timestamp pattern.

time() - last_successful_s3_upload_timestamp_seconds
> 3600

Memory Pressure Prediction — predict_linear()

Predicts if memory usage will exceed total in the next 4 hours based on current trend.

predict_linear(
  system_memory_used_bytes[1h], 4*3600
) > system_memory_total_bytes

SLO / ERROR BUDGET

Error Budget Remaining

1 = full budget, 0 = exhausted. SLO target = 99.9% → allowed error rate = 0.001.

1 - (
  rate(slo_errors_total{slo_name="availability"}[1h])
    /
  rate(slo_requests_total{slo_name="availability"}[1h])
)

Multi-Window Burn Rate (fast + slow)

Google SRE burn-rate alert: fast window catches sudden spikes, slow window catches slow burns.

# Fast window (1h) burn rate > 14× budget
(rate(slo_errors_total[1h])
 / rate(slo_requests_total[1h]))
/ (1 - 0.999) > 14

# Slow window (6h) burn rate > 6×
AND
(rate(slo_errors_total[6h])
 / rate(slo_requests_total[6h]))
/ (1 - 0.999) > 6

Cache Hit Ratio

Fraction of cache lookups served from memory. Low hit ratio → DynamoDB pressure.

rate(cache_hits_total{cache_name="metadata"}[5m])
  / (
  rate(cache_hits_total{cache_name="metadata"}[5m])
  + rate(cache_misses_total{cache_name="metadata"}[5m])
)

Circuit Breaker Open Alert + ENUM usage

Enum emits a separate Gauge (0/1) per state. Alert when "open" state = 1.

# Alert: any circuit breaker is open
circuit_breaker_state{
  circuit_breaker_state="open"
} == 1

# Count of trips (Counter)
increase(circuit_breaker_trips_total[1h])

ADVANCED — aggregations, labels, absent

absent() — Dead-Man's-Switch Alert

Fires when the metric disappears — detects a stopped scrape target or silent exporter.

absent(up{job="backend"})

label_replace() — Transform Labels

Rename or extract label values — useful for joining metrics from different exporters.

label_replace(
  http_requests_total,
  "short_endpoint", "$1",
  "endpoint", "^/([^/]+).*"
)

Info Metric Join (group_left)

Join app metadata labels onto a metric — fleet dashboards use this to filter by version.

rate(http_requests_total[5m])
  * on(instance) group_left(version)
app_build_info

Percentile Breakdown by Status Code

sum by() shows how error traffic breaks down — useful for cardinality analysis.

sum by (status_code) (
  rate(http_requests_total[5m])
)

AWS · Prometheus · Istio Dashboard

Dependency Health

DB Connection Pool

Circuit Breaker States

In-Process Cache Stats

Upload File to S3

Search S3 by Prefix

Browse S3 Bucket

Postgres Live Query

Postgres Write Event

DynamoDB — Fetch File Metadata

Service Health & Readiness

Cache Management

Simulate Load

Circuit Breaker Management

Prometheus Scrape Endpoints

PromQL Reference — Exam Cheatsheet