7.10 Real world
The xv6 scheduler implements a simple scheduling policy, which runs each process in turn. This policy is called round robin. Real operating systems implement more sophisticated policies that, for example, allow processes to have priorities. The idea is that a runnable high-priority process will be preferred by the scheduler over a runnable low-priority process. These policies can become complex quickly because there are often competing goals: for example, the operating system might also want to guarantee fairness and high throughput. In addition, complex policies may lead to unintended interactions such as priority inversion and convoys. Priority inversion can happen when a low-priority and high-priority process both use a particular lock, which when acquired by the low-priority process can prevent the high-priority process from making progress. A long convoy of waiting processes can form when many high-priority processes are waiting for a low-priority process that acquires a shared lock; once a convoy has formed it can persist for long time. To avoid these kinds of problems additional mechanisms are necessary in sophisticated schedulers.
sleep
and
wakeup
are a simple and effective synchronization method,
but there are many others.
The first challenge in all of them is to
avoid the “lost wakeups” problem we saw at the
beginning of the chapter.
The original Unix kernel’s
sleep
simply disabled interrupts,
which sufficed because Unix ran on a single-CPU system.
Because xv6 runs on multiprocessors,
it adds an explicit lock to
sleep
.
FreeBSD’s
msleep
takes the same approach.
Plan 9’s
sleep
uses a callback function that runs with the scheduling
lock held just before going to sleep;
the function serves as a last-minute check
of the sleep condition, to avoid lost wakeups.
The Linux kernel’s
sleep
uses an explicit process queue, called a wait queue, instead of
a wait channel; the queue has its own internal lock.
Scanning the entire set of processes in
wakeup
is inefficient. A better solution is to
replace the
chan
in both
sleep
and
wakeup
with a data structure that holds
a list of processes sleeping on that structure,
such as Linux’s wait queue.
Plan 9’s
sleep
and
wakeup
call that structure a rendezvous point.
Many thread libraries refer to the same
structure as a condition variable;
in that context, the operations
sleep
and
wakeup
are called
wait
and
signal
.
All of these mechanisms share the same
flavor: the sleep condition is protected by
some kind of lock dropped atomically during sleep.
The implementation of
wakeup
wakes up all processes that are waiting on a particular channel, and it might be
the case that many processes are waiting for that particular channel. The
operating system will schedule all these processes and they will race to check
the sleep condition. Processes that behave in this way are sometimes called a
thundering herd,
and it is best avoided.
Most condition variables have two primitives for
wakeup
:
signal
,
which wakes up one process, and
broadcast
,
which wakes up all waiting processes.
Semaphores are often used for synchronization. The count typically corresponds to something like the number of bytes available in a pipe buffer or the number of zombie children that a process has. Using an explicit count as part of the abstraction avoids the “lost wakeup” problem: there is an explicit count of the number of wakeups that have occurred. The count also avoids the spurious wakeup and thundering herd problems.
Terminating processes and cleaning them up introduces much complexity in xv6.
In most operating systems it is even more complex, because, for example, the
victim process may be deep inside the kernel sleeping, and unwinding its
stack requires care, since each function on the call stack
may need to do some clean-up. Some languages help out by providing
an exception mechanism, but not C.
Furthermore, there are other events that can cause a sleeping process to be
woken up, even though the event it is waiting for has not happened yet. For
example, when a Unix process is sleeping, another process may send a
signal
to it. In this case, the
process will return from the interrupted system call with the value -1 and with
the error code set to EINTR. The application can check for these values and
decide what to do. Xv6 doesn’t support signals and this complexity doesn’t arise.
Xv6’s support for
kill
is not entirely satisfactory: there are sleep loops
which probably should check for
p->killed
.
A related problem is that, even for
sleep
loops that check
p->killed
,
there is a race between
sleep
and
kill
;
the latter may set
p->killed
and try to wake up the victim just after the victim’s loop
checks
p->killed
but before it calls
sleep
.
If this problem occurs, the victim won’t notice the
p->killed
until the condition it is waiting for occurs. This may be quite a bit later
or even never
(e.g., if the victim is waiting for input from the console, but the user
doesn’t type any input).
A real operating system would find free
proc
structures with an explicit free list
in constant time instead of the linear-time search in
allocproc
;
xv6 uses the linear scan for simplicity.