grafana-agent
Overview
Jsonnet 源码地址:github.com/grafana/agent.git
Alerts
告警Alerts配置列表 源文件.
clustering
ClusterNotConverging
alert: ClusterNotConverging
annotations:
message: 'Cluster is not converging: nodes report different number of peers in the cluster.'
expr: stddev by (cluster, namespace) (sum without (state) (cluster_node_peers)) != 0
for: 10m
ClusterNodeCountMismatch
alert: ClusterNodeCountMismatch
annotations:
message: Nodes report different number of peers vs. the count of observed agent metrics. Some agent metrics may be missing or the cluster is in a split brain state.
expr: |
sum without (state) (cluster_node_peers) !=
on (cluster, namespace) group_left
count by (cluster, namespace) (cluster_node_info)
for: 15m
ClusterNodeUnhealthy
alert: ClusterNodeUnhealthy
annotations:
message: Cluster node is reporting a gossip protocol health score > 0.
expr: |
cluster_node_gossip_health_score > 0
for: 10m
ClusterNodeNameConflict
alert: ClusterNodeNameConflict
annotations:
message: A node tried to join the cluster with a name conflicting with an existing peer.
expr: sum by (cluster, namespace) (rate(cluster_node_gossip_received_events_total{event="node_conflict"}[2m])) > 0
for: 10m
ClusterNodeStuckTerminating
alert: ClusterNodeStuckTerminating
annotations:
message: Cluster node stuck in Terminating state.
expr: sum by (cluster, namespace, instance) (cluster_node_peers{state="terminating"}) > 0
for: 10m
ClusterConfigurationDrift
alert: ClusterConfigurationDrift
annotations:
message: Cluster nodes are not using the same configuration file.
expr: |
count without (sha256) (
max by (cluster, namespace, sha256) (agent_config_hash and on(cluster, namespace) cluster_node_info)
) > 1
for: 5m
agent_controller
SlowComponentEvaluations
alert: SlowComponentEvaluations
annotations:
message: Flow component evaluations are taking too long.
expr: sum by (cluster, namespace, component_path, component_id) (rate(agent_component_evaluation_slow_seconds[10m])) > 0
for: 15m
UnhealthyComponents
alert: UnhealthyComponents
annotations:
message: Unhealthy Flow components detected.
expr: sum by (cluster, namespace) (agent_component_controller_running_components{health_type!="healthy"}) > 0
for: 15m
otelcol
OtelcolReceiverRefusedSpans
alert: OtelcolReceiverRefusedSpans
annotations:
message: The receiver could not push some spans to the pipeline.
expr: sum(rate(receiver_refused_spans_ratio_total{}[1m])) > 0
for: 5m
OtelcolExporterFailedSpans
alert: OtelcolExporterFailedSpans
annotations:
message: The exporter failed to send spans to their destination.
expr: sum(rate(exporter_send_failed_spans_ratio_total{}[1m])) > 0
for: 5m
Dashboards
仪表盘配置文件下载地址: