k8s-prometheus

芒果牛奶 2021-02-23 16:03:57
技术开发 Prometheus SegmentFault k8s-prometheus


promethus

基于k8s

收集数据

node-exporter

vi node-exporter-ds.yml

apiVersion: extensions/v1beta1

kind: DaemonSet

metadata:

  name: node-exporter

  labels:

    app: node-exporter

spec:

  template:

    metadata:

      labels:

        app: node-exporter

    spec:

      hostNetwork: true

      containers:

      - image: prom/node-exporter

        name: node-exporter

        ports:

        - containerPort: 9100

        volumeMounts:

        - mountPath: "/etc/localtime"

          name: timezone

      volumes:

      - name: timezone

          hostPath:

            path: /etc/localtime

存储,持久卷,创建一个10G的pv,基于nfs

vi prometheus-pv.yaml

apiVersion: v1

kind: PersistentVolume

metadata:

  name: gwj-pv-prometheus

  labels:

    app: gwj-pv

spec:

  capacity:

    storage: 10Gi

  volumeMode: Filesystem

  accessModes:

  - ReadWriteMany

  persistentVolumeReclaimPolicy: Recycle

  storageClassName: slow

  mountOptions:

  - hard

  - nfsvers=4.1

  nfs:

    path: /storage/gwj-prometheus

    server: 10.1.99.1

持久卷申领,基于刚刚创建的pv,申领一个5G的pvc

vi prometheus-pvc.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: gwj-prometheus-pvc

  namespace: gwj

spec:

  accessModes:

  - ReadWriteMany

  volumeMode: Filesystem

  resources:

    requests:

      storage: 5Gi

  selector:

    matchLabels:

      app: gwj-pv

  storageClassName: slow

设置prometheus rbac权限

clusterrole.rbac.authorization.k8s.io/gwj-prometheus-clusterrole created

serviceaccount/gwj-prometheus created

clusterrolebinding.rbac.authorization.k8s.io/gwj-prometheus-rolebinding created

vi prometheus-rbac.yml

apiVersion: rbac.authorization.k8s.io/v1beta1

kind: ClusterRole

metadata:

  name: gwj-prometheus-clusterrole

rules:

  • apiGroups: [""]

  resources:

  - nodes

  - nodes/proxy

  - services

  - endpoints

  - pods

  verbs: ["get", "list", "watch"]

  • apiGroups:

  - extensions

  resources:

  - ingresses

  verbs: ["get", "list", "watch"]

  • nonResourceURLs: ["/metrics"]

  verbs: ["get"]


apiVersion: v1

kind: ServiceAccount

metadata:

  namespace: gwj

  name: gwj-prometheus


apiVersion: rbac.authorization.k8s.io/v1beta1

kind: ClusterRoleBinding

metadata:

  name: gwj-prometheus-rolebinding

roleRef:

  apiGroup: rbac.authorization.k8s.io

  kind: ClusterRole

  name: gwj-prometheus-clusterrole

subjects:

  • kind: ServiceAccount

  name: gwj-prometheus

  namespace: gwj

创建prometheus 配置文件,使用configmap

vi prometheus-cm.yml

apiVersion: v1

kind: ConfigMap

metadata:

  name: gwj-prometheus-cm

  namespace: gwj

data:

  prometheus.yml: |

    rule_files:

    - /etc/prometheus/rules.yml

    alerting:

      alertmanagers:

      - static_configs:

        - targets: ["gwj-alertmanger-svc:80"]

    global:

      scrape_interval: 10s

      scrape_timeout: 10s

      evaluation_interval: 10s

    scrape_configs:

    - job_name: 'kubernetes-nodes'

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - source_labels: [__meta_kubernetes_node_name]

        regex: (.+)

        target_label: metrics_path

        replacement: /api/v1/nodes/${1}/proxy/metrics

      - target_label: address

        replacement: kubernetes.default.svc:443

    - job_name: 'kubernetes-node-exporter'

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - source_labels: [__meta_kubernetes_role]

        action: replace

        target_label: kubernetes_role

      - source_labels: [__address__]

        regex: '(.*):10250'

        replacement: '${1}:9100'

        target_label: address

    - job_name: 'kubernetes-pods'

      kubernetes_sd_configs:

      - role: pod

      relabel_configs:

      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]

        action: replace

        target_label: address

        regex: (1+)(?::d+)?;(d+)

        replacement: $1:$2

      - action: labelmap

        regex: __meta_kubernetes_pod_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_pod_name]

        action: replace

        target_label: kubernetes_pod_name

    - job_name: 'kubernetes-cadvisor'

      scheme: https

      tls_config:

        ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt

      bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

      kubernetes_sd_configs:

      - role: node

      relabel_configs:

      - action: labelmap

        regex: __meta_kubernetes_node_label_(.+)

      - target_label: address

        replacement: kubernetes.default.svc:443

      - source_labels: [__meta_kubernetes_node_name]

        regex: (.+)

        target_label: metrics_path

        replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor

    - job_name: 'kubernetes-service-endpoints'

      kubernetes_sd_configs:

      - role: endpoints

      relabel_configs:

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]

        action: keep

        regex: true

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]

        action: replace

        target_label: scheme

        regex: (https?)

      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]

        action: replace

        target_label: metrics_path

        regex: (.+)

      - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]

        action: replace

        target_label: address

        regex: (1+)(?::d+)?;(d+)

        replacement: $1:$2

      - action: labelmap

        regex: __meta_kubernetes_service_label_(.+)

      - source_labels: [__meta_kubernetes_namespace]

        action: replace

        target_label: kubernetes_namespace

      - source_labels: [__meta_kubernetes_service_name]

        action: replace

        target_label: kubernetes_name

  rules.yml: |

    groups:

    - name: kebernetes_rules

      rules:

      - alert: InstanceDown

        expr: up{job="kubernetes-node-exporter"} == 0

        for: 5m

        labels:

          severity: page

        annotations:

          summary: "Instance {{ $labels.instance }} down"

          description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes."

      - alert: APIHighRequestLatency

        expr: api_http_request_latencies_second{quantile="0.5"} > 1

        for: 10m

        annotations:

          summary: "High request latency on {{ $labels.instance }}"

          description: "{{ $labels.instance }} has a median request latency above 1s (current value: {{ $value }}s)"

      - alert: StatefulSetReplicasMismatch

        annotations:

          summary: "Replicas miss match"

          description: StatefulSet {{ $labels.namespace }}/{{ $labels.statefulset }} has not matched the expected number of replicas for longer than 3 minutes.

        expr: label_join(kube_statefulset_status_replicas_ready != kube_statefulset_replicas, "instance", "/", "namespace", "statefulset")

        for: 3m

        labels:

          severity: critical

      - alert: PodFrequentlyRestarting

        expr: increase(kube_pod_container_status_restarts_total[1h]) > 5

        for: 5m

        labels:

          severity: warning

        annotations:

          description: Pod {{ $labels.namespaces }}/{{ $labels.pod }} is was restarted {{ $value }} times within the last hour

          summary: Pod is restarting frequently

      - alert: DeploymentReplicasNotUpdated

        expr: ((kube_deployment_status_replicas_updated != kube_deployment_spec_replicas)

          or (kube_deployment_status_replicas_available != kube_deployment_spec_replicas))

          unless (kube_deployment_spec_paused == 1)

        for: 5m

        labels:

          severity: critical

        annotations:

          description: Replicas are not updated and available for deployment {{ $labels.namespace }}/{{ $labels.deployment }}

          summary: Deployment replicas are outdated

      - alert: DaemonSetRolloutStuck

        expr: kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled * 100 < 100

        for: 5m

        labels:

          severity: critical

        annotations:

          description: Only {{ $value }}% of desired pods scheduled and ready for daemonset {{ $labels.namespace }}/{{ $labels.daemonset }}

          summary: DaemonSet is missing pods

      - alert: DaemonSetsNotScheduled

        expr: kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0

        for: 10m

        labels:

          severity: warning

        annotations:

          description: '{{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are not scheduled.'

          summary: Daemonsets are not scheduled correctly

      - alert: DaemonSetsMissScheduled

        expr: kube_daemonset_status_number_misscheduled > 0

        for: 10m

        labels:

          severity: warning

        annotations:

          description: '{{{{ $value }}}} Pods of DaemonSet {{{{ $labels.namespace }}}}/{{{{ $labels.daemonset }}}} are running where they are not supposed to run.'

          summary: Daemonsets are not scheduled correctly

      - alert: Node_Boot_Time

        expr: (node_time_seconds - node_boot_time_seconds) <= 150

        for: 15s

        annotations:

          summary: "机器{{ $labels.instacnce }} 刚刚重启,时间少于 150s"

      - alert: Available_Percent

        expr: node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes <= 0.2

        for: 15s

        annotations:

          summary: "机器{{ $labels.instacnce }} available less than 20%"

      - alert: FD_Used_Percent

        expr: (node_filefd_allocated / node_filefd_maximum) >= 0.8

        for: 15s

        annotations:

          summary: "机器{{ $labels.instacnce }} FD used more than 80%"

根据刚刚创建的cm的要求,创建alertmanger 用于告警

vi alertmanger.yml


kind: Service

apiVersion: v1

metadata:

  name: gwj-alertmanger-svc

  namespace: gwj

spec:

  selector:

    app: gwj-alert-pod

  ports:

    - protocol: TCP

      port: 80

      targetPort: 9093


apiVersion: apps/v1

kind: StatefulSet

metadata:

  name: gwj-alert-sts

  namespace: gwj

  labels:

    app: gwj-alert-sts

spec:

  replicas: 1

  serviceName: gwj-alertmanger-svc

  selector:

    matchLabels:

      app: gwj-alert-pod

  template:

    metadata:

      labels:

        app: gwj-alert-pod

    spec:

      containers:

      - image: prom/alertmanager:v0.14.0

        name: gwj-alert-pod

        ports:

        - containerPort: 9093

          protocol: TCP

        volumeMounts:

        - mountPath: "/etc/localtime"

          name: timezone

      volumes:

      - name: timezone

        hostPath:

          path: /etc/localtime

kubectl apply -f alertmanger.yml

  service/gwj-alertmanger-svc created

  statefulset.apps/gwj-alert-sts created

创建prometheus statefulset来创建prometheus

service/gwj-prometheus-svc created

statefulset.apps/gwj-prometheus-sts created

/prometheus

pvc: gwj-prometheus-pvc

/etc/prometheus/

configMap:

  name: gwj-prometheus-cm

vi prometheus-sts.yml


kind: Service

apiVersion: v1

metadata:

  name: gwj-prometheus-svc

  namespace: gwj

  labels:

    app: gwj-prometheus-svc

spec:

  ports:

  - port: 80

    targetPort: 9090

  selector:

    app: gwj-prometheus-pod


apiVersion: apps/v1

kind: StatefulSet

metadata:

  name: gwj-prometheus-sts

  namespace: gwj

  labels:

    app: gwj-prometheus-sts

spec:

  replicas: 1

  serviceName: gwj-prometheus-svc

  selector:

    matchLabels:

      app: gwj-prometheus-pod

  template:

    metadata:

      labels:

        app: gwj-prometheus-pod

    spec:

      containers:

      - image: prom/prometheus:v2.9.2

        name: gwj-prometheus-pod

        ports:

        - containerPort: 9090

          protocol: TCP

        volumeMounts:

        - mountPath: "/prometheus"

          name: data

        - mountPath: "/etc/prometheus/"

          name: config-volume

        - mountPath: "/etc/localtime"

          name: timezone

        resources:

          requests:

            cpu: 100m

            memory: 100Mi

          limits:

            cpu: 500m

            memory: 2000Mi

      serviceAccountName: gwj-prometheus

      volumes:

      - name: data

        persistentVolumeClaim:

          claimName: gwj-prometheus-pvc

      - name: config-volume

        configMap:

          name: gwj-prometheus-cm

      - name: gwj-prometheus-rule-cm

        configMap:

          name: gwj-prometheus-rule-cm

      - name: timezone

        hostPath:

          path: /etc/localtime

kubectl apply -f prometheus-sts.yml

  service/gwj-prometheus-svc created

  statefulset.apps/gwj-prometheus-sts created

创建ingress,根据域名分发到不同的service

vi prometheus-ingress.yml


apiVersion: extensions/v1beta1

kind: Ingress

metadata:

  namespace: gwj

  annotations:

  name: gwj-ingress-prometheus

spec:

  rules:

  - host: gwj.syncbug.com

    http:

      paths:

        - path: /

          backend:

            serviceName: gwj-prometheus-svc

            servicePort: 80

  - host: gwj-alert.syncbug.com

    http:

      paths:

        - path: /

          backend:

            serviceName: gwj-alertmanger-svc

            servicePort: 80

kubectl apply -f prometheus-ingress.yml

  ingress.extensions/gwj-ingress-prometheus created

访问对应的域名

gwj.syncbug.com

查看目标对象是否正确

http://gwj.syncbug.com/targets

查看配置文件是否正确

http://gwj.syncbug.com/config

gwj-alert.syncbug.com

===grafana

vi grafana-pv.yaml

apiVersion: v1

kind: PersistentVolume

metadata:

  name: gwj-pv-grafana

  labels:

    app: gwj-pv-gra

spec:

  capacity:

    storage: 2Gi

  volumeMode: Filesystem

  accessModes:

  - ReadWriteMany

  persistentVolumeReclaimPolicy: Recycle

  storageClassName: slow

  mountOptions:

  - hard

  - nfsvers=4.1

  nfs:

    path: /storage/gwj-grafana

    server: 10.1.99.1

vi grafana-pvc.yaml

apiVersion: v1

kind: PersistentVolumeClaim

metadata:

  name: gwj-grafana-pvc

  namespace: gwj

spec:

  accessModes:

  - ReadWriteMany

  volumeMode: Filesystem

  resources:

    requests:

      storage: 1Gi

  selector:

    matchLabels:

      app: gwj-pv-gra

  storageClassName: slow

vi grafana-deployment.yaml

apiVersion: extensions/v1beta1

kind: Deployment

metadata:

  labels:

    name: grafana

  name: grafana

  namespace: gwj

spec:

  replicas: 1

  revisionHistoryLimit: 10

  selector:

    matchLabels:

      app: grafana

  template:

    metadata:

      labels:

        app: grafana

      name: grafana

    spec:

      containers:

      - env:

        - name: GF_PATHS_DATA

          value: /var/lib/grafana/

        - name: GF_PATHS_PLUGINS

          value: /var/lib/grafana/plugins

        image: grafana/grafana:6.2.4

        imagePullPolicy: IfNotPresent

        name: grafana

        ports:

        - containerPort: 3000

          name: grafana

          protocol: TCP

        volumeMounts:

        - mountPath: /var/lib/grafana/

          name: data

        - mountPath: /etc/localtime

          name: localtime

      dnsPolicy: ClusterFirst

      restartPolicy: Always

      volumes:

      - name: data

        persistentVolumeClaim:

          claimName: gwj-grafana-pvc

      - name: localtime

        hostPath:

          path: /etc/localtime

vi grafana-ingress.yaml


apiVersion: extensions/v1beta1

kind: Ingress

metadata:

  namespace: gwj

  annotations:

  name: gwj-ingress-grafana

spec:

  rules:

  - host: gwj-grafana.syncbug.com

    http:

      paths:

        - path: /

          backend:

            serviceName: gwj-grafana-svc

            servicePort: 80


kind: Service

apiVersion: v1

metadata:

  name: gwj-grafana-svc

  namespace: gwj

spec:

  selector:

    app: grafana

  ports:

    - protocol: TCP

      port: 80

      targetPort: 3000

进入grafana,gwj-grafana.syncbug.com

默认: admin admin

输入datasource: http://gwj-prometheus-svc:80

import模版


  1. :
版权声明
本文为[芒果牛奶]所创,转载请带上原文链接,感谢
https://segmentfault.com/a/1190000039262732

  1. Spring can still play like this! Ali's new spring product has successfully overturned my understanding of spring!
  2. IntelliJ idea can also draw mind maps. It's really the strongest ide!
  3. JavaScript performance optimization [inline cache] V8 engine features
  4. linux 配置java环境
  5. linux find 查找文件
  6. 深入理解 Web 协议 (三):HTTP 2
  7. IntelliJ IDEA 相关问题记录
  8. Deep understanding of Web protocol (3): http 2
  9. 深入理解 Web 协议 (三):HTTP 2
  10. 腾讯IEG开源AI SDK:自动化测试吃鸡、MOBA类游戏
  11. Mysql Command
  12. Configuring Java environment with Linux
  13. Find files in Linux
  14. docker-Dockerfile 创建镜像
  15. Redis Cluster
  16. 深入理解 Web 协议 (三):HTTP 2
  17. JavaScriptBOM操作
  18. JavaScriptBOM操作
  19. Deep understanding of Web protocol (3): http 2
  20. Record of IntelliJ idea related problems
  21. Deep understanding of Web protocol (3): http 2
  22. Tencent IEG open source AI SDK: automatic testing of chicken eating and MoBa games
  23. Mysql Command
  24. Docker dockerfile create image
  25. Redis Cluster
  26. 死磕Spring之IoC篇 - 文章导读
  27. Deep understanding of Web protocol (3): http 2
  28. JavaScript BOM operation
  29. JavaScript BOM operation
  30. 死磕Spring之IoC篇 - 文章导读
  31. k8s node 操作与维护
  32. k8s 证书更新
  33. 【Java面试题第三期】JVM中哪些地方会出现内存溢出?出现的原因是什么?
  34. HashMap连环问你能答出几道?
  35. k8s-cronjob
  36. k8s-cert
  37. 头条面试官:说说Kafka的消费者提交方式,怎么实现的
  38. 什么是HTTPS以及如何实施HTTPS?
  39. Spring: an introduction to IOC
  40. Spring: an introduction to IOC
  41. Operation and maintenance of k8s node
  42. K8s certificate update
  43. vue使用sdk进行七牛上传
  44. k8s-dns
  45. JavaScript 邮箱验证 - 正则验证
  46. k8s-dashboard
  47. HashMap连环问你能答出几道?
  48. Where does memory overflow occur in the JVM? What are the reasons for this?
  49. How many questions can you answer?
  50. k8s-cronjob
  51. spring注解--Transactional
  52. k8s-cert
  53. Will the Spring Festival holiday be extended to February 27 in 2021? Here comes the response
  54. Headline Interviewer: talk about Kafka's consumer submission method, how to achieve it
  55. 【k8s集群】搭建步骤
  56. k8s-kubeadm
  57. k8s-etcd
  58. What is HTTPS and how to implement it?
  59. Java中使用HashMap改进查找性能
  60. maven发布jar包运行时找不到类问题