S3 · DynamoDB · PostgreSQL · Istio · Prometheus
Live status of all external dependencies
Utilisation, active vs max connections
db_pool_connections_active / db_pool_connections_max
CLOSED = healthy · OPEN = blocking calls (pulsing) · HALF_OPEN = recovery probe
| Service | State | Failures | Action |
|---|---|---|---|
| Loading… | |||
LRU metadata cache — hit rate drives performance
Stores file in S3 and metadata in DynamoDB
List all objects whose key starts with the prefix
List, download, or delete objects. Select rows for batch delete.
SELECT NOW() + server version from RDS
postgres_query_duration_seconds_bucket
INSERT into the events table — triggers write latency metrics
Scan by file_name with cache lookup first
Liveness (/health) and readiness (/ready) probes
LRU in-process metadata cache stats and controls
rate(cache_hits_total[5m]) / (rate(cache_hits_total[5m]) + rate(cache_misses_total[5m]))
Generate synthetic Prometheus metrics to practise PromQL
Real-time state + manual reset. In PromQL: circuit_breaker_state{circuit_breaker_state="open"} == 1
| Service | State | Failures | Reset |
|---|---|---|---|
| Loading… | |||
Two endpoints — Istio mesh (port 5000) and dedicated scrape server (port 9100)
/api/metrics
:9100/metrics
istio-frontend-deployment-version2-6b475cb8c6-9jjvv
All queries target metrics exported by this app. Click Copy to paste into Grafana/Prometheus.
sum by (endpoint) ( rate(http_requests_total[5m]) )
sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
increase(file_uploads_total{status="success"}[1h])
irate(http_requests_total[5m])
histogram_quantile(0.95,
sum by (le, endpoint) (
rate(http_request_duration_seconds_bucket[5m])
)
)
topk(5,
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket[5m])
)
)
histogram_quantile(0.99,
rate(s3_operation_duration_seconds_bucket
{operation="put_object"}[5m])
)
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
request_processing_seconds{quantile="0.99"}
# Summary (per-pod, cannot aggregate)
db_query_processing_seconds{quantile="0.95"}
# Histogram (aggregatable across pods)
histogram_quantile(0.95,
sum by (le) (rate(
postgres_query_duration_seconds_bucket[5m]))
)
dependency_health == 0
db_pool_connections_active / db_pool_connections_max > 0.8
time() - last_successful_s3_upload_timestamp_seconds > 3600
predict_linear( system_memory_used_bytes[1h], 4*3600 ) > system_memory_total_bytes
1 - (
rate(slo_errors_total{slo_name="availability"}[1h])
/
rate(slo_requests_total{slo_name="availability"}[1h])
)
# Fast window (1h) burn rate > 14× budget (rate(slo_errors_total[1h]) / rate(slo_requests_total[1h])) / (1 - 0.999) > 14 # Slow window (6h) burn rate > 6× AND (rate(slo_errors_total[6h]) / rate(slo_requests_total[6h])) / (1 - 0.999) > 6
rate(cache_hits_total{cache_name="metadata"}[5m])
/ (
rate(cache_hits_total{cache_name="metadata"}[5m])
+ rate(cache_misses_total{cache_name="metadata"}[5m])
)
# Alert: any circuit breaker is open
circuit_breaker_state{
circuit_breaker_state="open"
} == 1
# Count of trips (Counter)
increase(circuit_breaker_trips_total[1h])
absent(up{job="backend"})
label_replace( http_requests_total, "short_endpoint", "$1", "endpoint", "^/([^/]+).*" )
rate(http_requests_total[5m]) * on(instance) group_left(version) app_build_info
sum by (status_code) ( rate(http_requests_total[5m]) )