一.污点和容忍概念
官方文档:https://kubernetes.io/zh/docs/concepts/scheduling-eviction/taint-and-toleration/
设计理念:Taint在一类服务器上打上污点,让不能容忍这个污点的Pod不能部署在打了污点的服务器上。Toleration是让Pod容忍节点上配置的污点,可以让一些需要特殊配置的Pod能够调用到具有污点和特殊配置的节点上。
- Taint是作用在节点上。
Toleration是作用在pod上。
1.污点配置解析
创建一个污点(一个节点可以有多个污点):
kubectl taint nodes NODE_NAME TAINT_KEY=TAINT_VALUE:EFFECT
比如: kubectl taint nodes k8s-node01 ssd=true:PreferNoSchedule
查看: kubectl describe node k8s-node01 | grep Taint (注意大写T)
NoSchedule:禁止调度到该节点,已经在该节点上的Pod不受影响
NoExecute:禁止调度到该节点,如果不符合这个污点,会立马被驱逐(或在一段时间后)
PreferNoSchedule:尽量避免将Pod调度到指定的节点上,如果没有更合适的节点,可以部署到该节点
2.容忍配置解析
方式一完全匹配:满足全部条件
tolerations:
– key: “taintKey”
operator: “Equal”
value: “taintValue”
effect: “NoSchedule”
方式二不完全匹配:满足一个key,符合NoSchedule
tolerations:
– key: “taintKey”
operator: “Exists”
effect: “NoSchedule”
方式三大范围匹配(不推荐key为内置Taint):满足一个key即可
– key: “taintKey”
operator: “Exists
方式四匹配所有(不推荐):
tolerations:
– operator: “Exists”
停留时间配置:(默认300秒迁移,tolerationSeconds设置迁移时间,下列3600秒驱逐走)
tolerations:
– key: “key1”
operator: “Equal”
value: “value1”
effect: “NoExecute”
tolerationSeconds: 3600
实例:
1. 有一个节点(假设node01)是纯SSD硬盘的节点,现需要只有一些需要高性能存储的Pod才能调度到该节点上
给节点打上污点和标签:
[root@k8s-master01 ~]# kubectl get po -A -owide | grep node01 #查看node01有哪些pod
[root@k8s-master01 ~]# kubectl taint nodes k8s-node01 ssd:PreferNoSchedule- #去除PreferNoSchedule属性污点
[root@k8s-master01 ~]# kubectl taint nodes k8s-node01 ssd=true:NoExecute #此时会驱逐没有容忍该污点的Pod
[root@k8s-master01 ~]# kubectl taint nodes k8s-node01 ssd=true:NoSchedule #给node01打上污点
[root@k8s-master01 ~]# kubectl label node k8s-node01 ssd=true #给node01打上ssd标签
[root@k8s-master01 ~]# kubectl get node -l ssd #查看集群有ssd的节点
[root@k8s-master01 ~]# kubectl describe node k8s-node01 | grep Taint #查看node01打上的污点
配置Pod:(表示能部署到node01节点,并不表示一定能部署在node01节点)
[root@k8s-master01 ~]# vim tolerations.yaml apiVersion: v1 kind: Pod metadata: name: nginx labels: env: test spec: containers: - name: nginx image: nginx imagePullPolicy: IfNotPresent nodeSelector: ssd: "true" tolerations: - key: "ssd" operator: "Exists"
由于打了NoExecute,node01上只剩下了calico,然后创建pod,发现pod已经成功部署到node01节点。
[root@k8s-master01 ~]# kubectl get pod -A -owide | grep node01 kube-system calico-node-hrj82 1/1 Running 15 6d13h 192.168.0.103 k8s-node01 kube-system kube-proxy-mrl9j 1/1 Running 7 6d12h 192.168.0.103 k8s-node01 [root@k8s-master01 ~]# kubectl create -f tolerations.yaml #创建pod,查看效果 pod/nginx created [root@k8s-master01 ~]# kubectl get pod -A -owide | grep node01 #发现pod已经成功部署到node01节点 default nginx 1/1 Running 0 50s 172.161.125.33 k8s-node01
删除pod,修改yaml,把容忍注释掉,再次部署发现pod没部署成功,describe寻找问题。
[root@k8s-master01 ~]# vim tolerations.yaml #修改 .... nodeSelector: ssd: "true" #tolerations: #- key: "ssd" # operator: "Exists" [root@k8s-master01 ~]# kubectl delete -f tolerations.yaml #删除 pod "nginx" deleted [root@k8s-master01 ~]# kubectl get -f tolerations.yaml #nginx处于pending状态 NAME READY STATUS RESTARTS AGE nginx 0/1 Pending 0 89s [root@k8s-master01 ~]# kubectl describe po nginx #一个节点有污点,但是没容忍,四个节点没affinity。 ... Warning FailedScheduling 84s default-scheduler 0/5 nodes are available: 1 node(s) had taint {ssd: true}, that the pod didn't tolerate, 4 node(s) didn't match Pod's node affinity.
3.内置污点
node.kubernetes.io/not-ready:#节点未准备好,相当于节点状态Ready的值为False。 node.kubernetes.io/unreachable:#Node Controller访问不到节点,相当于节点状态Ready的值为Unknown。node.kubernetes.io/out-of-disk:#节点磁盘耗尽。 node.kubernetes.io/memory-pressure:#节点存在内存压力。 node.kubernetes.io/disk-pressure:#节点存在磁盘压力。 node.kubernetes.io/network-unavailable:#节点网络不可达。 node.kubernetes.io/unschedulable:#节点不可调度。 node.cloudprovider.kubernetes.io/uninitialized:#如果Kubelet启动时指定了一个外部的cloudprovider,它将给当前节点添加一个Taint将其标记为不可用。在cloud-controller-manager的一个controller初始化这个节点后,Kubelet将删除这个Taint。 节点不健康,6000秒后再驱逐(默认是300秒): tolerations: - key: "node.kubernetes.io/unreachable" operator: "Exists" effect: "NoExecute" tolerationSeconds: 6000
4.taint常用命令
创建一个污点(一个节点可以有多个污点): kubectl taint nodes NODE_NAME TAINT_KEY=TAINT_VALUE:EFFECT 比如: kubectl taint nodes k8s-node01 ssd=true:PreferNoSchedule 查看一个节点的污点: kubectl get node k8s-node01 -o go-template --template {{.spec.taints}} kubectl describe node k8s-node01 | grep Taints -A 10 删除污点(和label类似): 基于Key删除: kubectl taint nodes k8s-node01 ssd- 基于Key+Effect删除: kubectl taint nodes k8s-node01 ssd:PreferNoSchedule- 修改污点(Key和Effect相同): kubectl taint nodes k8s-node01 ssd=true:PreferNoSchedule --overwrite
二.亲和力Affinity
Affinity亲和力:
·NodeAffinity:节点亲和力/反亲和力
·PodAffinity:Pod亲和力
·PodAntiAffinity:Pod反亲和力
Affinity分类:
1.Affinity的几种场景
如下图,一个应用分别部署在4个node节点中,当其中一个出现问题时,其他3个可以确保高可用。
如下图,一个应用分别部署在两个区域,当其中一个区域出故障(光纤挖断等),另一个区域可以确保高可用。
尽量把同一项目不同应用部署在不同的节点上(确保宕机等影响范围降低)
2.节点node亲和力的配置
[root@k8s-master01 ~]# vim with-node-affinity.yaml
apiVersion: v1 kind: Pod metadata: name: with-node-affinity spec: affinity: #和containers对齐 nodeAffinity: #节点亲和力(部署在一个节点) requiredDuringSchedulingIgnoredDuringExecution: #硬亲和力配置(required,强制),与软亲和力只能存在一项。 nodeSelectorTerms: #节点选择器配置,可以配置多个matchExpressions(满足其一) - matchExpressions: #可以配置多个key、value类型的选择器(都需要满足) - key: kubernetes.io/e2e-az-name operator: In #标签匹配的方式(下文) values: #可以配置多个(满足其一) - e2e-az1 - az-2 preferredDuringSchedulingIgnoredDuringExecution: #软亲和力配置(preferred),与硬亲和力只能存在一项。 - weight: 1 #软亲和力的权重,权重越高优先级越大,范围1-100 preference: #软亲和力配置项,和weight同级,可以配置多个,matchExpressions和硬亲和力一致 matchExpressions: - key: another-node-label-key operator: In #标签匹配的方式(下文) values: - another-node-label-value containers: - name: with-node-affinity image: nginx
operator:标签匹配的方式
In:相当于key = value的形式 NotIn:相当于key != value的形式 Exists:节点存在label的key为指定的值即可,不能配置values字段 DoesNotExist:节点不存在label的key为指定的值即可,不能配置values字段 Gt:大于value指定的值 Lt:小于value指定的值
3.pod亲和力的配置
[root@k8s-master01 ~]# vim with-pod-affinity.yaml
apiVersion: v1 kind: Pod metadata: name: with-pod-affinity spec: affinity: podAffinity: #pod亲和力 requiredDuringSchedulingIgnoredDuringExecution: #硬亲和力 - labelSelector: #Pod选择器配置,可以配置多个 matchExpressions: #可以配置多个key、value类型的选择器(都需要满足) - key: security operator: In #标签匹配的方式 values: #可以配置多个(满足其一) - S1 topologyKey: failure-domain.beta.kubernetes.io/zone #匹配的拓扑域的key,也就是节点上label的key,key和value相同的为同一个域,可以用于标注不同的机房和地区 podAntiAffinity: #pod反亲和力 preferredDuringSchedulingIgnoredDuringExecution: #软亲和力 - weight: 100 #权重,权重越高优先级越大,范围1-100 podAffinityTerm: labelSelector: matchExpressions: - key: security operator: In values: - S2 namespaces: #和哪个命名空间的Pod进行匹配,为空为当前命名空间 - default topologyKey: failure-domain.beta.kubernetes.io/zone containers: - name: with-pod-affinity image: nginx
4.同一个应用部署在不同的宿主机
如下例,有5个副本,配置的是强制反亲和力,假设K8S总共有3个节点,那么会分别在3个节点启动一个pod,剩下2个会一直处于pending状态,并且pod不能和app=must-be-diff-nodes的标签部署在一起。
apiVersion: apps/v1 kind: Deployment metadata: labels: app: must-be-diff-nodes name: must-be-diff-nodes namespace: kube-public spec: replicas: 5 #副本数 selector: matchLabels: app: must-be-diff-nodes template: metadata: labels: app: must-be-diff-nodes spec: affinity: podAntiAffinity: #反亲和力 requiredDuringSchedulingIgnoredDuringExecution: #强制 - labelSelector: matchExpressions: - key: app operator: In values: - must-be-diff-nodes #标签 topologyKey: kubernetes.io/hostname containers: - image: nginx imagePullPolicy: IfNotPresent name: must-be-diff-nodes
4.1同一个应用不同副本固定节点
apiVersion: apps/v1 kind: Deployment metadata: name: redis-cache spec: selector: matchLabels: app: store replicas: 3 template: metadata: labels: app: store spec: nodeSelector: app: store affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "kubernetes.io/hostname" containers: - name: redis-server image: redis:3.2-alpine
4.2应用和缓存尽量部署在同一个域内
apiVersion: apps/v1 kind: Deployment metadata: name: web-server spec: selector: matchLabels: app: web-store replicas: 3 template: metadata: labels: app: web-store spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - web-store topologyKey: "kubernetes.io/hostname" podAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - store topologyKey: "kubernetes.io/hostname" containers: - name: web-app image: nginx:1.16-alpine
5.尽量调度到高配置服务器
如下图pod尽量配置到ssd=true的标签节点(软亲和,100权重),而且没有GPU=true标签的节点,也可以部署在type=physical标签的节点(权重10).
[root@k8s-master01 ~]# vim nodeAffinitySSD.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: prefer-ssd name: prefer-ssd namespace: kube-public spec: replicas: 1 selector: matchLabels: app: prefer-ssd template: metadata: creationTimestamp: null labels: app: prefer-ssd spec: affinity: nodeAffinity: #节点亲和力 preferredDuringSchedulingIgnoredDuringExecution: #软亲和力,如果需要强制部署在一个节点可以用requried - preference: matchExpressions: - key: ssd #ssd标签 operator: In #满足 values: - "true" - key: GPU #GPU便标签 operator: NotIn #不满足 values: - "true" weight: 100 #权重 - preference: matchExpressions: - key: type #type=physical标签 operator: In values: - physical weight: 10 #权重 containers: - env: - name: TZ value: Asia/Shanghai - name: LANG value: C.UTF-8 image: nginx imagePullPolicy: IfNotPresent name: prefer-ssd
打标签
[root@k8s-master01 ~]# kubectl get node --show-labels #查看节点标签 分别给master01,node01节点打上ssd=true的标签,master01节点单独打上GPU=true的标签 [root@k8s-master01 ~]# kubectl label node k8s-master01 ssd=true [root@k8s-master01 ~]# kubectl label node k8s-master01 GPU=true [root@k8s-master01 ~]# kubectl label node k8s-node01 GPU=true 给node02打上type=physical的标签 [root@k8s-master01 ~]# kubectl label node k8s-node02 type=physical
创建应用
[root@k8s-master01 ~]# kubectl create -f nodeAffinitySSD.yaml [root@k8s-master01 ~]# kubectl get pod -n kube-public -owide #发现部署在node01上 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prefer-ssd-dcb88b7d9-wdd54 1/1 Running 0 18s 172.161.125.41 k8s-node01
假设去掉node1的ssd标签,再重新创建,就会创建到node02打了type标签的节点上。
[root@k8s-master01 ~]# kubectl label node k8s-node01 ssd- #去掉node01的ssd标签 node/k8s-node01 labeled [root@k8s-master01 ~]# kubectl delete -f nodeAffinitySSD.yaml #删除 deployment.apps "prefer-ssd" deleted [root@k8s-master01 ~]# kubectl create -f nodeAffinitySSD.yaml #再创建 deployment.apps/prefer-ssd created [root@k8s-master01 ~]# kubectl get pod -n kube-public -owide #查看 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES prefer-ssd-dcb88b7d9-58rfw 1/1 Running 0 87s 172.171.14.212 k8s-node02
6.拓扑域TopologyKey
topologyKey:拓扑域,主要针对宿主机,相当于对宿主机进行区域的划分。用label进行判断,不同的key和不同的value是属于不同的拓扑域
如下图,相同区域可以打相同的一个标签,不同区域打不一样的标签,避免同一个区出现故障,所有pod都部署在同一个区域里面导致服务无法使用。
6.1同一个应用多区域部署
根据上图逻辑上设置3个域标签,把应用pod部署在不同的的区域内
master01,02: region=daxing
master03,node01: region=chaoyang
node02: region=xxx
[root@k8s-master01 ~]# kubectl label node k8s-master01 k8s-master02 region=daxing [root@k8s-master01 ~]# kubectl label node k8s-node01 k8s-master03 region=chaoyang [root@k8s-master01 ~]# kubectl label node k8s-node02 region=xxx
创建yaml,设置topologyKey为region,每个pod会部署在不同region上,由于设置是pod强制反亲和力,如果pod副本数超过区域上限数量,剩下的pod就会处于pending状态启动不了。
[root@k8s-master01 ~]# vim must-be-diff-zone.yaml apiVersion: apps/v1 kind: Deployment metadata: labels: app: must-be-diff-zone name: must-be-diff-zone namespace: kube-public spec: replicas: 3 selector: matchLabels: app: must-be-diff-zone template: metadata: labels: app: must-be-diff-zone spec: affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: #强制反亲和力 - labelSelector: matchExpressions: - key: app operator: In values: - must-be-diff-zone topologyKey: region #根据上图设置区域标签 containers: - image: nginx imagePullPolicy: IfNotPresent name: must-be-diff-zone
创建并查看,发现启动的3个pod都在不同的节点上。
[root@k8s-master01 ~]# kubectl create -f must-be-diff-zone.yaml [root@k8s-master01 ~]# kubectl get pod -n kube-public -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES must-be-diff-zone-755966bd8b-42fft 1/1 Running 0 2m22s 172.171.14.213 k8s-node02must-be-diff-zone-755966bd8b-fx6cs 1/1 Running 0 2m22s 172.169.92.68 k8s-master02 must-be-diff-zone-755966bd8b-k5d7q 1/1 Running 0 2m22s 172.161.125.42 k8s-node01