AWS · Prometheus · Istio Dashboard

S3  ·  DynamoDB  ·  PostgreSQL  ·  Istio  ·  Prometheus

Refreshing… v2.0
Backend API
Not checked yet
Readiness
Not checked yet
App State
Via /stats
Container
Pod hostname
🩺

Dependency Health

Live status of all external dependencies

S3
Unknown
🐘
PostgreSQL
Unknown
DynamoDB
Unknown
🔗

DB Connection Pool

Utilisation, active vs max connections

Active: — Max: —
Utilisation: —% Idle: —
PromQL → db_pool_connections_active / db_pool_connections_max

Circuit Breaker States

CLOSED = healthy · OPEN = blocking calls (pulsing) · HALF_OPEN = recovery probe

ServiceStateFailuresAction
Loading…

In-Process Cache Stats

LRU metadata cache — hit rate drives performance

Cache Size
items in LRU
Max Size
256
max capacity
Cache Name
LRU instance

Upload File to S3

Stores file in S3 and metadata in DynamoDB

Uploading…0%
Idle
🔎

Search S3 by Prefix

List all objects whose key starts with the prefix

Idle
📂

Browse S3 Bucket

List, download, or delete objects. Select rows for batch delete.

0 selected
Click Refresh List to load files
🗄

Postgres Live Query

SELECT NOW() + server version from RDS


        
Metrics: postgres_query_duration_seconds_bucket
✏️

Postgres Write Event

INSERT into the events table — triggers write latency metrics


      
🔍

DynamoDB — Fetch File Metadata

Scan by file_name with cache lookup first

First call hits DynamoDB; repeat call is served from in-process LRU cache (check cache hit metrics).

    
💓

Service Health & Readiness

Liveness (/health) and readiness (/ready) probes

Backend: Unknown
Readiness: Unknown

Cache Management

LRU in-process metadata cache stats and controls

Size
Max
PromQL hit ratio →
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))
📊

Simulate Load

Generate synthetic Prometheus metrics to practise PromQL


      

Circuit Breaker Management

Real-time state + manual reset. In PromQL: circuit_breaker_state{circuit_breaker_state="open"} == 1

ServiceStateFailuresReset
Loading…
📡

Prometheus Scrape Endpoints

Two endpoints — Istio mesh (port 5000) and dedicated scrape server (port 9100)

App metrics (proxied through Nginx): /api/metrics
Dedicated metrics server (bypasses Istio): :9100/metrics
Frontend container: istio-frontend-deployment-version2-6b475cb8c6-9jjvv
📘

PromQL Reference — Exam Cheatsheet

All queries target metrics exported by this app. Click Copy to paste into Grafana/Prometheus.

Exam Prep
COUNTER — rate / irate / increase
Request Rate (RED — Rate)
Per-second request rate per endpoint over 5 m window. Foundation of the RED method.
sum by (endpoint) (
  rate(http_requests_total[5m])
)
Error Rate (RED — Errors)
Fraction of requests returning 5xx. Use as SLI for availability SLO.
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
  /
sum(rate(http_requests_total[5m]))
Total Uploads in Last Hour
increase() gives the absolute count increase over a range — use for capacity planning.
increase(file_uploads_total{status="success"}[1h])
Instant Rate — irate()
irate uses only the last 2 samples — very responsive to spikes, noisy over slow windows.
irate(http_requests_total[5m])
HISTOGRAM — histogram_quantile()
p95 Latency per Endpoint (RED — Duration)
histogram_quantile aggregates across instances — this is the main advantage over Summary.
histogram_quantile(0.95,
  sum by (le, endpoint) (
    rate(http_request_duration_seconds_bucket[5m])
  )
)
Top 5 Slowest Endpoints (p99)
topk() ranks — great for finding latency outliers in dashboards.
topk(5,
  histogram_quantile(0.99,
    rate(http_request_duration_seconds_bucket[5m])
  )
)
S3 Upload Latency Distribution
Observe bucket distribution to choose correct SLO thresholds.
histogram_quantile(0.99,
  rate(s3_operation_duration_seconds_bucket
       {operation="put_object"}[5m])
)
Avg Request Duration (via _sum / _count)
Mean latency from the histogram sum and count suffix — simpler than quantile.
rate(http_request_duration_seconds_sum[5m])
  /
rate(http_request_duration_seconds_count[5m])
SUMMARY — client-side quantiles (no aggregation)
p99 via Summary (single instance)
Summary quantiles are pre-computed — accurate per pod but cannot be summed across pods.
request_processing_seconds{quantile="0.99"}
Summary vs Histogram — when to use each
Summary: accurate per-instance quantiles. Histogram: aggregatable — use when you have multiple replicas.
# Summary (per-pod, cannot aggregate)
db_query_processing_seconds{quantile="0.95"}

# Histogram (aggregatable across pods)
histogram_quantile(0.95,
  sum by (le) (rate(
    postgres_query_duration_seconds_bucket[5m]))
)
GAUGE — saturation, health, staleness
Dependency Down Alert
GAUGE = 0 means unhealthy — fire alert immediately (no for: duration needed).
dependency_health == 0
Connection Pool Saturation
USE method — Utilisation > 80% is a saturation signal before errors occur.
db_pool_connections_active
  /
db_pool_connections_max
> 0.8
Stale Upload Detection (time since)
Alert when no successful upload for more than 1 h — Gauge as a timestamp pattern.
time() - last_successful_s3_upload_timestamp_seconds
> 3600
Memory Pressure Prediction — predict_linear()
Predicts if memory usage will exceed total in the next 4 hours based on current trend.
predict_linear(
  system_memory_used_bytes[1h], 4*3600
) > system_memory_total_bytes
SLO / ERROR BUDGET
Error Budget Remaining
1 = full budget, 0 = exhausted. SLO target = 99.9% → allowed error rate = 0.001.
1 - (
  rate(slo_errors_total{slo_name="availability"}[1h])
    /
  rate(slo_requests_total{slo_name="availability"}[1h])
)
Multi-Window Burn Rate (fast + slow)
Google SRE burn-rate alert: fast window catches sudden spikes, slow window catches slow burns.
# Fast window (1h) burn rate > 14× budget
(rate(slo_errors_total[1h])
 / rate(slo_requests_total[1h]))
/ (1 - 0.999) > 14

# Slow window (6h) burn rate > 6×
AND
(rate(slo_errors_total[6h])
 / rate(slo_requests_total[6h]))
/ (1 - 0.999) > 6
Cache Hit Ratio
Fraction of cache lookups served from memory. Low hit ratio → DynamoDB pressure.
rate(cache_hits_total{cache_name="metadata"}[5m])
  / (
  rate(cache_hits_total{cache_name="metadata"}[5m])
  + rate(cache_misses_total{cache_name="metadata"}[5m])
)
Circuit Breaker Open Alert + ENUM usage
Enum emits a separate Gauge (0/1) per state. Alert when "open" state = 1.
# Alert: any circuit breaker is open
circuit_breaker_state{
  circuit_breaker_state="open"
} == 1

# Count of trips (Counter)
increase(circuit_breaker_trips_total[1h])
ADVANCED — aggregations, labels, absent
absent() — Dead-Man's-Switch Alert
Fires when the metric disappears — detects a stopped scrape target or silent exporter.
absent(up{job="backend"})
label_replace() — Transform Labels
Rename or extract label values — useful for joining metrics from different exporters.
label_replace(
  http_requests_total,
  "short_endpoint", "$1",
  "endpoint", "^/([^/]+).*"
)
Info Metric Join (group_left)
Join app metadata labels onto a metric — fleet dashboards use this to filter by version.
rate(http_requests_total[5m])
  * on(instance) group_left(version)
app_build_info
Percentile Breakdown by Status Code
sum by() shows how error traffic breaks down — useful for cardinality analysis.
sum by (status_code) (
  rate(http_requests_total[5m])
)