kubernetes使用kuberhealthy监控集群状态-睿象云平台

kubernetes使用kuberhealthy监控集群状态

本站部分文章、图片属于网络上可搜索到的公开信息，均用于学习和交流用途，不能代表睿象云的观点、立场或意见。我们接受网民的监督，如发现任何违法内容或侵犯了您的权益，请第一时间联系小编邮箱jiasou666@gmail.com 处理。

kubernetes使用kuberhealthy监控集群状态

Kuberhealthy 是一个检查kubernetes集群是否能够正常工作的工具。通过在群集中创建自定义资源，可以轻松地启用各种综合测试容器。

Kuberhealthy 提供一个简单的 JSON 状态页（Prometheus指标终结点），以便集成到您选择的警报解决方案中。

安装

使用Helm 3进行安装：

在Kubernetes集群/上下文中创建命名空间“kuberhealthy”：

kubectl create namespace kuberhealthy

2.将kuberhealthy repo添加到Helm：

helm repo add kuberhealthy https://comcast.github.io/kuberhealthy/helm-repos

根据你的Prometheus实现，使用适当的命令为你的集群安装kuberhealthy：

如果你用Prometheus操作器：

helm install kuberhealthy kuberhealthy/kuberhealthy --set prometheus.enabled=true,prometheus.enableAlerting=true,prometheus.enableScraping=true,prometheus.serviceMonitor=true

如果你用Prometheus，而没使用Prometheus操作器：

helm install kuberhealthy kuberhealthy/kuberhealthy --set prometheus.enabled=true,prome

如果不使用Prometheus：

helm install kuberhealthy kuberhealthy/kuberhealthy

直接使用yaml文件安装

实在不是很喜欢helm，于是这里使用yaml文件安装

yaml文件位于https://github.com/Comcast/kuberhealthy/blob/master/deploy 下

在Kubernetes集群/上下文中创建命名空间“kuberhealthy”：

kubectl create namespace kuberhealthy

2.部署kuberhealthy ，这里已经安装里prometheus-operator 所以使用kuberhealthy-prometheus-operator.yaml这个文件

kubectl apply -f https://raw.githubusercontent.***/Comcast/kuberhealthy/master/deploy/kuberhealthy-prometheus-operator.yaml [root@master01 kuberhealthy]# kubectl apply -f kuberhealthy-prometheus-operator.yaml customresourcedefinition.apiextensions.k8s.io/khchecks.comcast.github.io created customresourcedefinition.apiextensions.k8s.io/khstates.comcast.github.io created poddisruptionbudget.policy/kuberhealthy-pdb created configmap/kuberhealthy created serviceaccount/check-reaper created serviceaccount/daemonset-khcheck created serviceaccount/deployment-sa created serviceaccount/kuberhealthy created clusterrole.rbac.authorization.k8s.io/kuberhealthy created clusterrole.rbac.authorization.k8s.io/kuberhealthy-daemonset-khcheck created clusterrolebinding.rbac.authorization.k8s.io/check-reaper created clusterrolebinding.rbac.authorization.k8s.io/kuberhealthy-daemonset-khcheck created clusterrolebinding.rbac.authorization.k8s.io/kuberhealthy created role.rbac.authorization.k8s.io/ds-admin created role.rbac.authorization.k8s.io/deployment-service-role created rolebinding.rbac.authorization.k8s.io/daemonset-khcheck created rolebinding.rbac.authorization.k8s.io/deployment-check-rb created service/kuberhealthy created deployment.apps/kuberhealthy created cronjob.batch/check-reaper created kuberhealthycheck.comcast.github.io/daemonset created kuberhealthycheck.comcast.github.io/deployment created kuberhealthycheck.comcast.github.io/dns-status-internal created servicemonitor.monitoring.coreos.com/kuberhealthy created

部署后运行kubectl get pods，应该会看到两个Kuberhealthy pod。这些是创建、协调和跟踪测试pod的pod。这两个Kuberhealthy pod还提供一个JSON状态页面和一个/metrics端点。你看到创建的每个其他pod都是一个检查器pod，设计用于执行并在完成时关闭。

其中daemonset-xxxx 是用来检测daemonset控制器，deployment-xxxx是用来检测deployment控制器，dns-xxxx 是用来检测coredns服务。

配置额外的检查

接下来，你可以运行kubectl get khchecks。你应该会看到三个Kuberhealthy检查默认安装:

daemonset：部署并拆除一个daemonset，以确保集群中的所有节点都能正常工作。deployment：创建部署，然后触发滚动更新。测试部署是否可以通过服务访问，然后删除所有内容。此过程中的任何问题都将导致此检查报告失败。dn-status-internal：验证内部集群DNS是否按预期运行。

如果想要开启别的检查，可以参考https://github.com/Comcast/kuberhealthy/blob/master/docs/EXTERNAL_CHECKS_REGISTRY.md

状态查看

kuberhealthy默认会提供一个json格式的状态页面和metrics端点供我们查看状态。

之前部署之后会创建一个service，type类型为loadblance，我们实际使用可能需要修改

访问metrices

创建关键性能指标

可用性（Availability）

我们将可用性定义为K8s集群控制平面按预期启动并运行。这是通过我们在一段时间内创建部署、执行滚动更新和删除部署的能力来衡量的。

我们通过度量Kuberhealthy的部署检查成功和失败来计算这个。

https://github.com/Comcast/kuberhealthy/tree/master/cmd/deployment-check

Availability = Uptime (Uptime * Downtime)Uptime = Number of Deployment Check Passes * Check Run IntervalDowntime = Number of Deployment Check Fails * Check Run IntervalCheck Run Interval = how often the check runs (runInterval set in your KuberhealthyCheck Spec)PromQL Query (Availability % over the past 30 days):

1 - (sum(count_over_time(kuberhealthy_check{check="kuberhealthy/deployment", status="0"}[30d])) OR vector(0))/(sum(count_over_time(kuberhealthy_check{check="kuberhealthy/deployment", status="1"}[30d])) * 100)

利用率（Utilization）

我们将利用率定义为产品（k8s）及其资源（pod、服务等）的用户利用率。这是通过客户使用多少节点、deployment、statefulset、持久卷、服务、pod和作业来度量的。我们通过计算节点、deployment、statefulset、持久卷、服务、pod和作业的总数来计算。

持续时间（延迟）

我们将持续时间（duration）定义为控制平面的容量和吞吐量的利用率。我们通过捕获kuberhealthy部署检查运行的平均运行持续时间来计算。

PromQL查询（部署检查平均运行时间）：

avg(kuberhealthy_check_duration_seconds{check="kuberhealthy/deployment"})

grafana监控

https://github.com/Comcast/kuberhealthy/tree/master/deploy/grafana下提供了dashboard.json ,我们只需下载导入grafana即可

如何在智能告警平台CA触发测试告警

1116 2022-11-02

kubernetes使用kuberhealthy监控集群状态

如何在智能告警平台CA触发测试告警

睿象云AIOps产品家族还不快来Pick一下

事件自动流程化解决方案，企业自动化流程的最佳解决方案