blog · git · desktop · images · contact


Revisiting Linux CPU scheduling

2024-08-03

Disclaimer: This is to the best of my knowledge. It's a complicated topic, there are tons of options, and this only covers a tiny fraction of this topic anyway. If you spot mistakes, please tell me.

The effects of autogroup

A surprisingly long time ago, even though it feels like yesterday, automatic process grouping was added to Linux 2.6.38. It was/is basically a hook into setsid(): That function is usually called in situations like opening a new terminal window. On a call to setsid(), the autogroup feature puts the calling process into a new scheduling group.

In other words, each terminal window gets its own scheduling group.

Running 4 processes in one terminal and 1 process in another results in 50% of the CPU time for the 4 guys (i.e., one of those gets 1/2 * 1/4 = 1/8, 12.5%) and 50% for the other single process.

When you run this, load4.sh, in one terminal:

#!/bin/bash

children=()

for i in $(seq 4)
do
    taskset -c 0 sh -c 'while true; do true; done' &
    children+=($!)
done

if [[ -t 0 ]]
then
    read -p 'Press ENTER to quit'
    kill "${children[@]}"
else
    sleep 2m
fi

(The sleep call will become relevant later. Also, all of these examples use taskset -c 0 to pin everything to one CPU to make it easier to understand -- we don't want your 256 core super mega modern CPU to get in the way.)

And this, load1.sh, in another one:

#!/bin/bash

taskset -c 0 sh -c 'while true; do true; done'

Then both get roughly the same amount of CPU time if autogroup is in effect. The single loop of load1.sh gets 50% of CPU time and the other four loops together also get 50%, because the task groups look like this:

[ load4[0] load4[1] load4[2] load4[3]  |  load1 ]
  \--------------- 50% -------------/     \50%/

If you want to see the old behaviour, add the kernel parameter noautogroup.

To reproduce all this, use a Linux distribution without systemd, like Void Linux. We'll soon see why.

nice was "neutralized"

One effect of this change was that using nice had surprising results. The manpage sched(7) tells us why:

Under group scheduling, a thread's nice value has an effect  for
scheduling  decisions only relative to other threads in the same
task group.

So, in the example above, running nice -n 19 ./load4.sh in one terminal and just ./load1.sh in another had no effect whatsoever. The nice values were only relevant inside the task group of load4.sh. So, if one of those four loops in load4.sh had a lower nice value, it would only win over the other three loops in the same group. But it wouldn't affect the loop of load1.sh.

The Internet is full of questions why nice "doesn't work anymore". This is usually the reason.

To be honest, I had autogroup disabled for a long time, because it didn't really fit my usage pattern very well. Sometimes it's helpful, often times it's just confusing -- because nice didn't do anymore what I expected.

cgroups win over autogroup

I recently had to revisit this topic again and, even though autogroup was enabled now, it didn't have any effect anymore. What's going on, did I disable it and forgot about it? It wasn't the kernel parameter noautogroup and /proc/sys/kernel/sched_autogroup_enabled did contain a 1. So why wasn't it working anymore?

Again, sched(7) has the answer:

The  use  of the cgroups(7) CPU controller to place processes in
cgroups other than the root CPU cgroup overrides the  effect  of
autogrouping.

I don't remember assigning any cgroups explicitly, but apparently systemd does that for me now. In every terminal, I get this answer:

$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-2.scope

A more comprehensive overview can be found in /sys/kernel/debug/sched/debug: It lists all runnable tasks and their scheduling group at the end. You might need to mount this first using mount -t debugfs none /sys/kernel/debug. (This used to be /proc/sched_debug but has been moved in 2021, which landed in Linux 5.12, I think.)

The answer is different when I log in to another TTY. Arch's systemd uses a unified cgroup hierarchy by default these days and each process's "cgroup" matches the process's position in the output of systemd-cgls. I put "cgroup" in quotes, because there used to be several cgroup hierarchies and the one used for CPU scheduling didn't necessarily match the one used in systemd-cgls. So, if I understand correctly, the system no longer distinguishes between "category (A) and (B)" as described in this old blog post of Lennart Poettering.

This means, when I log in on one TTY and do nothing special, all processes spawned from here are put into the same cgroup and thus into the same task group for scheduling. As a result, nice mostly works as expected again.

(I tried to find out when this change hit Arch Linux, but I couldn't find it. I vaguely remember some big discussion about this ... Did I dream that? Who knows.)

-- edit: A reader suggested these two links regarding the general history of cgroups v1 vs. v2:

Back to the example above: These days, nice -n 19 ./load4.sh in one terminal and ./load1.sh in another one results in load1.sh winning again if you use systemd. load4.sh hardly gets any CPU time (it still gets some).

Scheduling is hierarchical

Suppose you run load4.sh as a systemd unit, for example put this in ~/.local/share/systemd/user/load4.service:

[Service]
ExecStart=/bin/sh -c 'nice -n 19 /foo/load4.sh'

And then run:

$ systemctl --user start load4.service

And then load1.sh in a regular terminal again.

First of all, nice is irrelevant again: These scripts run in different cgroups.

What do I have to do to make this a "background" unit? I want load4.sh to run with the lowest scheduling policy, all other processes shall win over it (at least those in my regular user session).

This doesn't work:

[Service]
ExecStart=/foo/load4.sh
Nice=19

Neither does this:

[Service]
ExecStart=/foo/load4.sh
CPUWeight=idle

In both cases, running load1.sh in a terminal in parallel gets load1.sh only 50% of CPU time. I want it to get ~100%.

(CPUQuota=1% would work, but that's an absolute quota: Even when no other processes are running, load4.sh gets very little CPU time.)

The manpage systemd.resource-control(5) explains why it doesn't work:

Controllers in the cgroup hierarchy are hierarchical, and resource
control is realized by distributing resource assignments between
siblings in branches of the cgroup hierarchy.

In my case, the two scripts are not siblings in the hierarchy:

CGroup /:
-.slice
|-user.slice
| `-user-1000.slice
|   |-user@1000.service ...
|   | |-app.slice
|   | | |-load4.service
|   | | | |-92656 /bin/bash /home/void/tmp/2024-08-02/load4.sh
|   | | | |-92658 sh -c while true; do true; done
|   | | | |-92659 sh -c while true; do true; done
|   | | | |-92660 sh -c while true; do true; done
|   | | | |-92661 sh -c while true; do true; done
|   | | | `-92662 sleep 2m
...
|   |-session-2.scope
|   | |-92675 /bin/bash ./load1.sh
|   | |-92676 sh -c while true; do true; done
...

If I understand this correctly, then the entirety of the user@1000.service cgroup is scheduled against session-2.scope. The scheduling decision should go something like this (simplified):

Now ... if I read this systemd documentation about desktop environments correctly, then my regular processes should actually all be below app.slice (probably in one or more services) instead of session-2.scope. The fact that they're not is probably caused by my setup not caring about systemd very much: I log in on an agetty, it launches a shell and in that shell I launch a simple startx-like launcher for X11. None of these programs (except for maybe login) do anything systemd-specific at all.

Anyway, this isn't about my setup. The point is: The location of your process/cgroup in the cgroup hierarchy matters. You can't just put CPUWeight=idle in some .service file and expect it to work.

For system-wide units, you can relatively easily put them into a group that is a sibling of user.slice. That way, you can do systemctl start foo.service and now CPUWeight=idle is in effect as expected. To do so, first set a property on a new slice:

sudo systemctl set-property background.slice CPUWeight=idle

This creates a new file below /etc/systemd/system.control, so this is a persistent setting.

Then you can set the slice on your service, for example /etc/systemd/system/load4.service:

[Service]
ExecStart=/foo/load4.sh
Slice=background.slice

And now systemd-cgls looks like this when you run sudo systemctl start load4 and load1.sh in a terminal:

CGroup /:
-.slice
|-background.slice
| `-test.service
|   |-98581 /bin/bash /home/void/tmp/2024-08-02/load4.sh
|   |-98586 sh -c while true; do true; done
|   |-98587 sh -c while true; do true; done
|   |-98588 sh -c while true; do true; done
|   |-98589 sh -c while true; do true; done
|   `-98590 sleep 2m
|-user.slice
| `-user-1000.slice
|   `-session-2.scope
|     |-98599 /bin/bash ./load1.sh
|     |-98600 sh -c while true; do true; done
...

Decision tree:

If only load4.sh is running, each loop gets 25% CPU time:

load4-only.png

If load1.sh is running in parallel, it gets all the time instead:

load4-and-1.png

background.slice is a sibling of system.slice as well, so it really should lose against all other processes running on the system.

Without background.slice, you would end up scheduling system.slice against user.slice and our service would get up to 50% of CPU time in this scenario, which is not quite what we want.

Can you get global nice back?

Can you disable all this cgroup stuff and go back to using global nice values? I think the answer to this is: Only if you don't use systemd and put noautogroup in your kernel parameters. On my Void Linux box, this results in all tasks showing a / in /sys/kernel/debug/sched/debug, so we're back to flat scheduling.

If you're using systemd, you're probably out of luck. cgroups are a core feature of it and now that it uses a unified hierarchy (i.e., no more "some cgroups just for labels/organization, other cgroups for resources"), every process always gets put into a cgroup. (Maybe somehow disabling the CPU cgroup controller could be an option, but it's highly doubtful that it would be a stable, supported setup.)

Conclusion

This is powerful, but also complicated. The sibling thing can be an unexpected obstacle if you just want to say: "That service over there shall be very low-priority."

I could probably make more use of all of this, but then again: A simple traditional nice does the trick for 99% of my use cases and so I'm glad that nice is (mostly) back to its old behavior on my Arch box. Most of what I do happens in my interactive session and this is the area where I can just use nice, so it's (mostly) just fine.

Comments?