grafana-agent

Overview

Jsonnet 源码地址:github.com/grafana/agent.git

Alerts

告警Alerts配置列表 源文件.

clustering

ClusterNotConverging

alert: ClusterNotConverging
annotations:
  message: 'Cluster is not converging: nodes report different number of peers in the cluster.'
expr: stddev by (cluster, namespace) (sum without (state) (cluster_node_peers)) != 0
for: 10m
ClusterNodeCountMismatch

alert: ClusterNodeCountMismatch
annotations:
  message: Nodes report different number of peers vs. the count of observed agent metrics. Some agent metrics may be missing or the cluster is in a split brain state.
expr: |
  sum without (state) (cluster_node_peers) !=
  on (cluster, namespace) group_left
  count by (cluster, namespace) (cluster_node_info)  
for: 15m
ClusterNodeUnhealthy

alert: ClusterNodeUnhealthy
annotations:
  message: Cluster node is reporting a gossip protocol health score > 0.
expr: |
  cluster_node_gossip_health_score > 0  
for: 10m
ClusterNodeNameConflict

alert: ClusterNodeNameConflict
annotations:
  message: A node tried to join the cluster with a name conflicting with an existing peer.
expr: sum by (cluster, namespace) (rate(cluster_node_gossip_received_events_total{event="node_conflict"}[2m])) > 0
for: 10m
ClusterNodeStuckTerminating

alert: ClusterNodeStuckTerminating
annotations:
  message: Cluster node stuck in Terminating state.
expr: sum by (cluster, namespace, instance) (cluster_node_peers{state="terminating"}) > 0
for: 10m
ClusterConfigurationDrift

alert: ClusterConfigurationDrift
annotations:
  message: Cluster nodes are not using the same configuration file.
expr: |
  count without (sha256) (
      max by (cluster, namespace, sha256) (agent_config_hash and on(cluster, namespace) cluster_node_info)
  ) > 1  
for: 5m

agent_controller

SlowComponentEvaluations

alert: SlowComponentEvaluations
annotations:
  message: Flow component evaluations are taking too long.
expr: sum by (cluster, namespace, component_path, component_id) (rate(agent_component_evaluation_slow_seconds[10m])) > 0
for: 15m
UnhealthyComponents

alert: UnhealthyComponents
annotations:
  message: Unhealthy Flow components detected.
expr: sum by (cluster, namespace) (agent_component_controller_running_components{health_type!="healthy"}) > 0
for: 15m

otelcol

OtelcolReceiverRefusedSpans

alert: OtelcolReceiverRefusedSpans
annotations:
  message: The receiver could not push some spans to the pipeline.
expr: sum(rate(receiver_refused_spans_ratio_total{}[1m])) > 0
for: 5m
OtelcolExporterFailedSpans

alert: OtelcolExporterFailedSpans
annotations:
  message: The exporter failed to send spans to their destination.
expr: sum(rate(exporter_send_failed_spans_ratio_total{}[1m])) > 0
for: 5m

Dashboards

仪表盘配置文件下载地址: