Pod のプロセスの oom_score_adj はどうやって決まるか

Node のメモリが上限に達した場合、プロセスが最初にkillされるのはBestEffort→Burstable→Guaranteed という順番。もし同じ QoS だった場合、 oom_score_adj で決まる。

のだけれど、 oom_score_adj はどうやって決まるのだろう…と気になったのでメモ。

次のようにメモリ制限を 128Mi に抑えた Pod を用意する。

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - name: test
    image: ubuntu:latest
    command: ["sleep", "3600"]
    resources:
      limits:
        memory: "128Mi"

$ kubectl apply -f pod.yml
pod/test created

制限が効いているか cgroups を確認する。
Pod の uid を調べる。

❯ kubectl get pods test -o yaml | grep uid
  uid: 7e3dc6dc-1fc8-11ea-998e-fa163e46e514

各 Pod の cgroup は次のディレクトリにある。

$ cat /sys/fs/cgroup/memory/kubepods/burstable/pod7e3dc6dc-1fc8-11ea-998e-fa163e46e514/memory.limit_in_bytes
134217728 # 128 MB

ちゃんと設定されているようだ。
stress で cgroups の制限値より大きなメモリを使う。

root@test:/# stress --vm 1 --vm-bytes 200M                                                           
stress: info: [275] dispatching hogs: 0 cpu, 0 io, 1 vm, 0 hdd
stress: FAIL: [275] (415) <-- worker 276 got signal 9
stress: WARN: [275] (417) now reaping child worker processes
stress: FAIL: [275] (451) failed run completed in 1s

syslog には OOM によってプロセスが kill されたログが出てくる。

Dec 16 06:12:28 k8s-node-0001 kernel: [333063.945854] Memory cgroup stats for /kubepods/burstable/pod7e3dc6dc-1fc8-11ea-998e-fa163e46e514/a0affade58ebe557aa2efbe0795a11558198e93a128bf384e7f9774e3a8d32bb: cache:0KB rss:44KB rss_huge:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:44KB inactive_file:0KB active_file:0KB unevictable:0KB
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.945870] Memory cgroup stats for /kubepods/burstable/pod7e3dc6dc-1fc8-11ea-998e-fa163e46e514/64da4ff8457c6389581a447c1728db835d204916b977821eaeb366b028c08fa3: cache:20KB rss:128024KB rss_huge:0KB mapped_file:20KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:128024KB inactive_file:0KB active_file:20KB unevictable:0KB
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.945887] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946073] [15342]     0 15342      256        1       3       2        0          -998 pause
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946079] [15505]     0 15505     1133      195       8       3        0           968 sleep
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946084] [20884]     0 20884     4627      878      16       3        0           968 bash
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946105] [23022]     0 23022     2060      291       9       3        0           968 stress
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946110] [23023]     0 23023    53261    31890      72       3        0           968 stress
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.946113] Memory cgroup out of memory: Kill process 23023 (stress) score 1920 or sacrifice child
Dec 16 06:12:28 k8s-node-0001 kernel: [333063.974151] Killed process 23023 (stress) total-vm:213044kB, anon-rss:127304kB, file-rss:256kB

oom_score_adj を見ると pause 以外は 968 になっている。

oom_score_adj は https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/#node-oom-behavior に記載されている通りの計算式で算出される。

	min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

Guaranteed は -998 で、BestEffort は 1000, Burstable は上記の式で決定される。

今回のケースは Burstable なので…

min(MAX(2, 1000 - 128*1024*1000 / 3943872), 999) = 966.7656557

となり、だいたい 968 (なんでちょっとだけ誤差があるんだろう)。

試しに Guaranteed な Pod を作ると、たしかに -998 になっていた。

6551cbe224baca: cache:20KB rss:127996KB rss_huge:0KB mapped_file:20KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:127996KB inactive_file:0KB active_file:20KB unevictable:0KB
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334042] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334431] [10452]     0 10452      256        1       4       2        0          -998 pause
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334438] [10603]     0 10603     1133      183       8       3        0          -998 sleep
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334443] [11028]     0 11028     4627      830      14       3        0          -998 bash
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334460] [13802]     0 13802     2060      283      10       3        0          -998 stress
Dec 16 07:00:44 k8s-node-0001 kernel: [335959.334464] [13803]     0 13803    35341    31873      72       3        0          -998 stress