blog · git · desktop · images · contact
2024-08-03
Disclaimer: This is to the best of my knowledge. It's a complicated topic, there are tons of options, and this only covers a tiny fraction of this topic anyway. If you spot mistakes, please tell me.
autogroup
A surprisingly long time ago, even though it feels like yesterday,
automatic process grouping was added to Linux 2.6.38.
It was/is basically a hook into setsid()
: That function is usually
called in situations like opening a new terminal window. On a call to
setsid()
, the autogroup
feature puts the calling process into a new
scheduling group.
In other words, each terminal window gets its own scheduling group.
Running 4 processes in one terminal and 1 process in another results in
50% of the CPU time for the 4 guys (i.e., one of those gets 1/2 * 1/4
= 1/8
, 12.5%) and 50% for the other single process.
When you run this, load4.sh
, in one terminal:
#!/bin/bash
children=()
for i in $(seq 4)
do
taskset -c 0 sh -c 'while true; do true; done' &
children+=($!)
done
if [[ -t 0 ]]
then
read -p 'Press ENTER to quit'
kill "${children[@]}"
else
sleep 2m
fi
(The sleep
call will become relevant later. Also, all of these
examples use taskset -c 0
to pin everything to one CPU to make it
easier to understand -- we don't want your 256 core super mega modern CPU
to get in the way.)
And this, load1.sh
, in another one:
#!/bin/bash
taskset -c 0 sh -c 'while true; do true; done'
Then both get roughly the same amount of CPU time if autogroup
is in
effect. The single loop of load1.sh
gets 50% of CPU time and the other
four loops together also get 50%, because the task groups look like
this:
[ load4[0] load4[1] load4[2] load4[3] | load1 ]
\--------------- 50% -------------/ \50%/
If you want to see the old behaviour, add the kernel parameter
noautogroup
.
To reproduce all this, use a Linux distribution without systemd, like Void Linux. We'll soon see why.
nice
was "neutralized"One effect of this change was that using nice
had surprising results.
The manpage sched(7)
tells us why:
Under group scheduling, a thread's nice value has an effect for
scheduling decisions only relative to other threads in the same
task group.
So, in the example above, running nice -n 19 ./load4.sh
in one
terminal and just ./load1.sh
in another had no effect whatsoever. The
nice
values were only relevant inside the task group of load4.sh
.
So, if one of those four loops in load4.sh
had a lower nice
value,
it would only win over the other three loops in the same group.
But it wouldn't affect the loop of load1.sh
.
The Internet is full of questions why nice
"doesn't work anymore".
This is usually the reason.
To be honest, I had autogroup
disabled for a long time, because it
didn't really fit my usage pattern very well. Sometimes it's helpful,
often times it's just confusing -- because nice
didn't do anymore what
I expected.
cgroups
win over autogroup
I recently had to revisit this topic again and, even though autogroup
was enabled now, it didn't have any effect anymore. What's going on,
did I disable it and forgot about it? It wasn't the kernel parameter
noautogroup
and /proc/sys/kernel/sched_autogroup_enabled
did contain
a 1
. So why wasn't it working anymore?
Again, sched(7)
has the answer:
The use of the cgroups(7) CPU controller to place processes in
cgroups other than the root CPU cgroup overrides the effect of
autogrouping.
I don't remember assigning any cgroups explicitly, but apparently systemd does that for me now. In every terminal, I get this answer:
$ cat /proc/self/cgroup
0::/user.slice/user-1000.slice/session-2.scope
A more comprehensive overview can be found in
/sys/kernel/debug/sched/debug
: It lists all runnable tasks and their
scheduling group at the end. You might need to mount this first using
mount -t debugfs none /sys/kernel/debug
. (This used to be
/proc/sched_debug
but has been
moved in 2021,
which landed in Linux 5.12, I think.)
The answer is different when I log in to another TTY. Arch's systemd
uses a
unified cgroup hierarchy
by default these days and each process's "cgroup" matches the process's
position in the output of systemd-cgls
. I put "cgroup" in quotes,
because there used to be several cgroup hierarchies and the one used for
CPU scheduling didn't necessarily match the one used in systemd-cgls
.
So, if I understand correctly, the system no longer distinguishes
between "category (A) and (B)" as described in
this old blog post of Lennart Poettering.
This means, when I log in on one TTY and do nothing special, all
processes spawned from here are put into the same cgroup and thus into
the same task group for scheduling. As a result, nice
mostly works
as expected again.
(I tried to find out when this change hit Arch Linux, but I couldn't find it. I vaguely remember some big discussion about this ... Did I dream that? Who knows.)
-- edit: A reader suggested these two links regarding the general history of cgroups v1 vs. v2:
Back to the example above: These days, nice -n 19 ./load4.sh
in one
terminal and ./load1.sh
in another one results in load1.sh
winning
again if you use systemd. load4.sh
hardly gets any CPU time (it still
gets some).
Suppose you run load4.sh
as a systemd unit, for example put this in
~/.local/share/systemd/user/load4.service
:
[Service]
ExecStart=/bin/sh -c 'nice -n 19 /foo/load4.sh'
And then run:
$ systemctl --user start load4.service
And then load1.sh
in a regular terminal again.
First of all, nice
is irrelevant again: These scripts run in different
cgroups.
What do I have to do to make this a "background" unit? I want load4.sh
to run with the lowest scheduling policy, all other processes shall win
over it (at least those in my regular user session).
This doesn't work:
[Service]
ExecStart=/foo/load4.sh
Nice=19
Neither does this:
[Service]
ExecStart=/foo/load4.sh
CPUWeight=idle
In both cases, running load1.sh
in a terminal in parallel gets
load1.sh
only 50% of CPU time. I want it to get ~100%.
(CPUQuota=1%
would work, but that's an absolute quota: Even when no
other processes are running, load4.sh
gets very little CPU time.)
The manpage systemd.resource-control(5)
explains why it doesn't work:
Controllers in the cgroup hierarchy are hierarchical, and resource
control is realized by distributing resource assignments between
siblings in branches of the cgroup hierarchy.
In my case, the two scripts are not siblings in the hierarchy:
CGroup /:
-.slice
|-user.slice
| `-user-1000.slice
| |-user@1000.service ...
| | |-app.slice
| | | |-load4.service
| | | | |-92656 /bin/bash /home/void/tmp/2024-08-02/load4.sh
| | | | |-92658 sh -c while true; do true; done
| | | | |-92659 sh -c while true; do true; done
| | | | |-92660 sh -c while true; do true; done
| | | | |-92661 sh -c while true; do true; done
| | | | `-92662 sleep 2m
...
| |-session-2.scope
| | |-92675 /bin/bash ./load1.sh
| | |-92676 sh -c while true; do true; done
...
If I understand this correctly, then the entirety of the
user@1000.service
cgroup is scheduled against session-2.scope
. The
scheduling decision should go something like this (simplified):
load1.sh
and load4.sh
want CPU time. Compare their
hierarchies.-.slice
vs. -.slice
: Identical.user.slice
vs. user.slice
: Identical.user-1000.slice
vs. user-1000.slice
: Identical.user@1000.service
vs. session-2.scope
: This is where
the trees diverge. Neither has something like
CPUWeight
set, so both get 50%.Now ... if I read
this systemd documentation about desktop environments
correctly, then my regular processes should actually all be below
app.slice
(probably in one or more services) instead of
session-2.scope
. The fact that they're not is probably caused by my
setup not caring about systemd very much: I log in on an
agetty,
it launches a shell and in that shell I launch a simple
startx-like launcher for X11. None of
these programs (except for maybe login
) do anything systemd-specific
at all.
Anyway, this isn't about my setup. The point is: The location of your
process/cgroup in the cgroup hierarchy matters. You can't just put
CPUWeight=idle
in some .service
file and expect it to work.
For system-wide units, you can relatively easily put them into a group
that is a sibling of user.slice
. That way, you can do systemctl start
foo.service
and now CPUWeight=idle
is in effect as expected. To do
so, first set a property on a new slice:
sudo systemctl set-property background.slice CPUWeight=idle
This creates a new file below /etc/systemd/system.control
, so this is
a persistent setting.
Then you can set the slice on your service, for example
/etc/systemd/system/load4.service
:
[Service]
ExecStart=/foo/load4.sh
Slice=background.slice
And now systemd-cgls
looks like this when you run sudo systemctl
start load4
and load1.sh
in a terminal:
CGroup /:
-.slice
|-background.slice
| `-test.service
| |-98581 /bin/bash /home/void/tmp/2024-08-02/load4.sh
| |-98586 sh -c while true; do true; done
| |-98587 sh -c while true; do true; done
| |-98588 sh -c while true; do true; done
| |-98589 sh -c while true; do true; done
| `-98590 sleep 2m
|-user.slice
| `-user-1000.slice
| `-session-2.scope
| |-98599 /bin/bash ./load1.sh
| |-98600 sh -c while true; do true; done
...
Decision tree:
load1.sh
and load4.sh
want CPU time. Compare their
hierarchies.-.slice
vs. -.slice
: Identical.background.slice
vs. user.slice
: background.slice
has
CPUWeight=idle
, so user.slice
wins.If only load4.sh
is running, each loop gets 25% CPU time:
If load1.sh
is running in parallel, it gets all the time instead:
background.slice
is a sibling of system.slice
as well, so it really
should lose against all other processes running on the system.
Without background.slice
, you would end up scheduling system.slice
against user.slice
and our service would get up to 50% of CPU time in
this scenario, which is not quite what we want.
nice
back?Can you disable all this cgroup stuff and go back to using global nice
values? I think the answer to this is: Only if you don't use systemd
and put noautogroup
in your kernel parameters. On my Void Linux box,
this results in all tasks showing a /
in
/sys/kernel/debug/sched/debug
, so we're back to flat scheduling.
If you're using systemd, you're probably out of luck. cgroups are a core feature of it and now that it uses a unified hierarchy (i.e., no more "some cgroups just for labels/organization, other cgroups for resources"), every process always gets put into a cgroup. (Maybe somehow disabling the CPU cgroup controller could be an option, but it's highly doubtful that it would be a stable, supported setup.)
This is powerful, but also complicated. The sibling thing can be an unexpected obstacle if you just want to say: "That service over there shall be very low-priority."
I could probably make more use of all of this, but then again: A simple
traditional nice
does the trick for 99% of my use cases and so I'm
glad that nice
is (mostly) back to its old behavior on my Arch box.
Most of what I do happens in my interactive session and this is the area
where I can just use nice
, so it's (mostly) just fine.