tempo

Overview

Jsonnet 源码地址:github.com/grafana/tempo

Alerts

告警Alerts配置列表 源文件.

tempo_alerts

TempoRequestLatency

alert: TempoRequestLatency
annotations:
  message: |
    {{ $labels.job }} {{ $labels.route }} is experiencing {{ printf "%.2f" $value }}s 99th percentile latency.    
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoRequestLatency
expr: |
  cluster_namespace_job_route:tempo_request_duration_seconds:99quantile{route!~"metrics|/frontend.Frontend/Process|debug_pprof"} > 3  
for: 15m
labels:
  severity: critical
TempoCompactorUnhealthy

alert: TempoCompactorUnhealthy
annotations:
  message: There are {{ printf "%f" $value }} unhealthy compactor(s).
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorUnhealthy
expr: |
  max by (cluster, namespace) (tempo_ring_members{state="Unhealthy", name="compactor", namespace=~".*"}) > 0  
for: 15m
labels:
  severity: critical
TempoDistributorUnhealthy

alert: TempoDistributorUnhealthy
annotations:
  message: There are {{ printf "%f" $value }} unhealthy distributor(s).
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoDistributorUnhealthy
expr: |
  max by (cluster, namespace) (tempo_ring_members{state="Unhealthy", name="distributor", namespace=~".*"}) > 0  
for: 15m
labels:
  severity: warning
TempoCompactionsFailing

alert: TempoCompactionsFailing
annotations:
  message: Greater than 2 compactions have failed in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactionsFailing
expr: |
  sum by (cluster, namespace) (increase(tempodb_compaction_errors_total{}[1h])) > 2 and
  sum by (cluster, namespace) (increase(tempodb_compaction_errors_total{}[5m])) > 0  
for: 5m
labels:
  severity: critical
TempoIngesterFlushesUnhealthy

alert: TempoIngesterFlushesUnhealthy
annotations:
  message: Greater than 2 flush retries have occurred in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing
expr: |
  sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[1h])) > 2 and
  sum by (cluster, namespace) (increase(tempo_ingester_failed_flushes_total{}[5m])) > 0  
for: 5m
labels:
  severity: warning
TempoIngesterFlushesFailing

alert: TempoIngesterFlushesFailing
annotations:
  message: Greater than 2 flush retries have failed in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterFlushesFailing
expr: |
  sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[1h])) > 2 and
  sum by (cluster, namespace) (increase(tempo_ingester_flush_failed_retries_total{}[5m])) > 0  
for: 5m
labels:
  severity: critical
TempoPollsFailing

alert: TempoPollsFailing
annotations:
  message: Greater than 2 polls have failed in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoPollsFailing
expr: |
  sum by (cluster, namespace) (increase(tempodb_blocklist_poll_errors_total{}[1h])) > 2 and
  sum by (cluster, namespace) (increase(tempodb_blocklist_poll_errors_total{}[5m])) > 0  
labels:
  severity: critical
TempoTenantIndexFailures

alert: TempoTenantIndexFailures
annotations:
  message: Greater than 2 tenant index failures in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoTenantIndexFailures
expr: |
  sum by (cluster, namespace) (increase(tempodb_blocklist_tenant_index_errors_total{}[1h])) > 2 and
  sum by (cluster, namespace) (increase(tempodb_blocklist_tenant_index_errors_total{}[5m])) > 0  
labels:
  severity: critical
TempoNoTenantIndexBuilders

alert: TempoNoTenantIndexBuilders
annotations:
  message: No tenant index builders for tenant {{ $labels.tenant }}. Tenant index will quickly become stale.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoNoTenantIndexBuilders
expr: |
  sum by (cluster, namespace, tenant) (tempodb_blocklist_tenant_index_builder{}) == 0 and
  max by (cluster, namespace) (tempodb_blocklist_length{}) > 0  
for: 5m
labels:
  severity: critical
TempoTenantIndexTooOld

alert: TempoTenantIndexTooOld
annotations:
  message: Tenant index age is 600 seconds old for tenant {{ $labels.tenant }}.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoTenantIndexTooOld
expr: |
  max by (cluster, namespace, tenant) (tempodb_blocklist_tenant_index_age_seconds{}) > 600  
for: 5m
labels:
  severity: critical
TempoBlockListRisingQuickly

alert: TempoBlockListRisingQuickly
annotations:
  message: Tempo block list length is up 40 percent over the last 7 days.  Consider scaling compactors.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoBlockListRisingQuickly
expr: |
  avg(tempodb_blocklist_length{namespace=".*"}) / avg(tempodb_blocklist_length{namespace=".*", job=~"$namespace/$component"} offset 7d) > 1.4  
for: 15m
labels:
  severity: critical
TempoBadOverrides

alert: TempoBadOverrides
annotations:
  message: '{{ $labels.job }} failed to reload overrides.'
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoBadOverrides
expr: |
  sum(tempo_runtime_config_last_reload_successful{namespace=~".*"} == 0) by (cluster, namespace, job)  
for: 15m
labels:
  severity: warning
TempoUserConfigurableOverridesReloadFailing

alert: TempoUserConfigurableOverridesReloadFailing
annotations:
  message: Greater than 5 user-configurable overides reloads failed in the past hour.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoTenantIndexFailures
expr: |
  sum by (cluster, namespace) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total{}[1h])) > 5 and
  sum by (cluster, namespace) (increase(tempo_overrides_user_configurable_overrides_reload_failed_total{}[5m])) > 0  
labels:
  severity: critical
TempoProvisioningTooManyWrites

alert: TempoProvisioningTooManyWrites
annotations:
  message: Ingesters in {{ $labels.cluster }}/{{ $labels.namespace }} are receiving more data/second than desired, add more ingesters.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoProvisioningTooManyWrites
expr: |
  avg by (cluster, namespace) (rate(tempo_ingester_bytes_received_total{job=~".+/ingester"}[1m])) / 1024 / 1024 > 30  
for: 15m
labels:
  severity: warning
TempoCompactorsTooManyOutstandingBlocks

alert: TempoCompactorsTooManyOutstandingBlocks
annotations:
  message: There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks
expr: |
  sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~".*"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~".*"}) by (cluster, namespace) > 100  
for: 6h
labels:
  severity: warning
TempoCompactorsTooManyOutstandingBlocks

alert: TempoCompactorsTooManyOutstandingBlocks
annotations:
  message: There are too many outstanding compaction blocks in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}, increase compactor's CPU or add more compactors.
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoCompactorsTooManyOutstandingBlocks
expr: |
  sum by (cluster, namespace, tenant) (tempodb_compaction_outstanding_blocks{container="compactor", namespace=~".*"}) / ignoring(tenant) group_left count(tempo_build_info{container="compactor", namespace=~".*"}) by (cluster, namespace) > 250  
for: 24h
labels:
  severity: critical
TempoIngesterReplayErrors

alert: TempoIngesterReplayErrors
annotations:
  message: Tempo ingester has encountered errors while replaying a block on startup in {{ $labels.cluster }}/{{ $labels.namespace }} for tenant {{ $labels.tenant }}
  runbook_url: https://github.com/grafana/tempo/tree/main/operations/tempo-mixin/runbook.md#TempoIngesterReplayErrors
expr: |
  sum by (cluster, namespace, tenant) (increase(tempo_ingester_replay_errors_total{namespace=~".*"}[5m])) > 0  
for: 5m
labels:
  severity: critical

Recording Rules

指标计算Recording规则配置列表 源文件.

tempo_rules

cluster_namespace_job_route:tempo_request_duration_seconds:99quantile

expr: histogram_quantile(0.99, sum(rate(tempo_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route))
record: cluster_namespace_job_route:tempo_request_duration_seconds:99quantile
cluster_namespace_job_route:tempo_request_duration_seconds:50quantile

expr: histogram_quantile(0.50, sum(rate(tempo_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route))
record: cluster_namespace_job_route:tempo_request_duration_seconds:50quantile
cluster_namespace_job_route:tempo_request_duration_seconds:avg

expr: sum(rate(tempo_request_duration_seconds_sum[1m])) by (cluster, namespace, job, route) / sum(rate(tempo_request_duration_seconds_count[1m])) by (cluster, namespace, job, route)
record: cluster_namespace_job_route:tempo_request_duration_seconds:avg
cluster_namespace_job_route:tempo_request_duration_seconds_bucket:sum_rate

expr: sum(rate(tempo_request_duration_seconds_bucket[1m])) by (le, cluster, namespace, job, route)
record: cluster_namespace_job_route:tempo_request_duration_seconds_bucket:sum_rate
cluster_namespace_job_route:tempo_request_duration_seconds_sum:sum_rate

expr: sum(rate(tempo_request_duration_seconds_sum[1m])) by (cluster, namespace, job, route)
record: cluster_namespace_job_route:tempo_request_duration_seconds_sum:sum_rate
cluster_namespace_job_route:tempo_request_duration_seconds_count:sum_rate

expr: sum(rate(tempo_request_duration_seconds_count[1m])) by (cluster, namespace, job, route)
record: cluster_namespace_job_route:tempo_request_duration_seconds_count:sum_rate

Dashboards

仪表盘配置文件下载地址: