Skip to main content

Metrics

Aastro uses OpenTelemetry for instrumentation. Metrics can be exported via two backends:

  • Prometheus — OTel Prometheus exporter exposes a /metrics endpoint for scraping
  • OTLP — pushes metrics to any OpenTelemetry-compatible backend (OTel Collector, Grafana, Datadog, etc.)
gateway:
observability:
metrics:
enabled: true
exporter: prometheus # or: otlp
otlp:
endpoint: otel-collector:4318
insecure: true
interval: 10s
FieldTypeDefaultDescription
metrics.enabledboolfalseEnable metrics instrumentation
metrics.exporterstringprometheus or otlp
metrics.otlp.endpointstringOTLP HTTP endpoint to push metrics to
metrics.otlp.insecureboolfalseDisable TLS for the OTLP connection
metrics.otlp.intervalduration60sHow often to push metrics to the OTLP endpoint
info

When using exporter: prometheus, the /metrics endpoint is served on the admin port (server.admin_port), not the data port. This means Prometheus can scrape Aastro over plain HTTP without needing a client certificate, even when the data port enforces mTLS. The admin port binds to 127.0.0.1 by default — see the Server configuration for details on exposing it to an external scraper.

When using exporter: otlp, no HTTP endpoint is exposed — metrics are pushed on the configured interval.

Available Metrics


MetricTypeLabelsDescription
aastro_requests_totalCounterroute, method, statusTotal incoming requests that reached a flow, labeled by final HTTP status
aastro_requests_duration_secondsHistogramroute, methodEnd-to-end request latency from gateway entry to response write
aastro_requests_in_flightGaugeCurrent number of requests being processed
aastro_failed_requests_totalCounterreasonRequests rejected before reaching a flow (see reasons below)
aastro_upstream_requests_totalCounterroute, upstreamTotal requests dispatched to each upstream
aastro_upstream_errors_totalCounterroute, upstream, kindUpstream errors broken down by error kind
aastro_upstream_latency_secondsHistogramroute, upstreamTime from upstream request dispatch to response received
aastro_upstream_retries_totalCounterroute, upstreamNumber of retry attempts per upstream
aastro_circuit_breaker_stateGaugeupstreamCircuit breaker state: 0=closed, 1=open, 2=half-open

Histogram Buckets


Aastro uses fixed bucket boundaries tuned for typical gateway latencies:

MetricBoundaries (seconds)
aastro_requests_duration_seconds0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5
aastro_upstream_latency_seconds0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5

Upstream Error Kinds


The kind label on aastro_upstream_errors_total reflects the internal error classification:

KindDescription
timeoutUpstream did not respond within the configured timeout
connectionFailed to establish a connection to the upstream (includes TLS handshake failures)
bad_statusUpstream returned HTTP 5xx
read_errorConnection was closed while reading the response body
body_too_largeResponse body exceeded max_response_body_size
canceledRequest was canceled by the client before a response was received
circuit_openRequest was rejected by an open circuit breaker — upstream was not contacted
policy_violationResponse violated upstream policy (allowed_statuses, require_body)

Failure Reasons


aastro_failed_requests_total tracks requests that never reach a flow:

ReasonDescription
too_many_requestsRate limiter rejected the request
no_matched_flowNo flow matched the request path or method
body_too_largeRequest body exceeded the gateway-wide limit (5 MB)

Grafana


When using exporter: otlp, the recommended setup is:

aastro → [OTLP HTTP] → OTel Collector → [remote_write] → Prometheus ← Grafana

The OTel Collector receives metrics from aastro, transforms them, and pushes to Prometheus via remote_write. Prometheus must be started with --web.enable-remote-write-receiver.

When using exporter: prometheus, Prometheus scrapes aastro directly — no Collector needed. Point the scrape target at the admin port (server.admin_port).


PanelQuery
RPSrate(aastro_requests_total[1m])
p99 latencyhistogram_quantile(0.99, rate(aastro_requests_duration_seconds_bucket[5m]))
Error raterate(aastro_requests_total{status=~"5.."}[1m]) / rate(aastro_requests_total[1m])
Upstream error raterate(aastro_upstream_errors_total[1m])
Circuit breaker openaastro_circuit_breaker_state == 1
Retry pressurerate(aastro_upstream_retries_total[5m])
In-flight requestsaastro_requests_in_flight
Upstream p95 latencyhistogram_quantile(0.95, rate(aastro_upstream_latency_seconds_bucket[5m]))
info

Counter metrics like aastro_requests_total are monotonically increasing — they never decrease. Always use rate() or increase() in Grafana queries rather than the raw counter value.