4.6 Page-fault exceptions
Xv6’s response to exceptions is quite boring: if an exception happens in user space, the kernel kills the faulting process. If an exception happens in the kernel, the kernel panics. Real operating systems often respond in much more interesting ways.
As an example,
many kernels use page faults to implement
copy-on-write (COW) fork.
To explain copy-on-write fork, consider xv6’s fork
,
described in Chapter 3.
fork
causes the child’s initial
memory content to be the same as the parent’s at the time of the fork.
Xv6 implements fork with
uvmcopy
(kernel/vm.c:313),
which allocates physical memory for
the child and copies the parent’s memory into it.
It would be more efficient if the child and parent could share
the parent’s physical memory.
A straightforward implementation of this would not work, however,
since it would cause the parent and child to disrupt each other’s
execution with their writes to the shared stack and heap.
Parent and child can safely share physical memory by
appropriate use of page-table permissions and page faults.
The CPU raises a
page-fault exception
when a virtual address is used that has no mapping
in the page table, or has a mapping whose PTE_V
flag is clear, or a mapping whose permission bits
(PTE_R
,
PTE_W
,
PTE_X
,
PTE_U
)
forbid the operation being attempted.
RISC-V distinguishes three
kinds of page fault: load page faults (caused by load instructions),
store page faults (caused by store instructions),
and instruction
page faults (caused by fetches of instructions to be executed). The
scause
register indicates the type of the
page fault and the stval
register contains the address
that couldn’t be translated.
The basic plan in COW fork is for the parent and child to initially
share all physical pages, but for each to map them read-only (with the
PTE_W
flag clear).
Parent and child can read from the shared physical memory.
If either writes a given page,
the RISC-V CPU raises a page-fault exception.
The kernel’s trap handler responds by allocating a
new page of physical memory and copying into it
the physical page that the faulted address maps to.
The kernel changes the relevant PTE in the faulting process’s page
table to point to the copy and to allow writes as well as reads,
and then resumes the faulting
process at the instruction that caused the fault. Because the
PTE now allows writes, the re-executed instruction
will execute without a fault. Copy-on-write requires book-keeping
to help decide when physical pages can be freed, since each page can
be referenced by a varying number of page tables depending on the history of
forks, page faults, execs, and exits. This book-keeping allows
an important optimization: if a process incurs a store page
fault and the physical page is only referred to from that process’s
page table, no copy is needed.
Copy-on-write makes fork
faster, since fork
need not copy memory. Some of the memory will have to be copied
later, when written, but it’s often the case that most of the
memory never has to be copied.
A common example is
fork
followed by exec
:
a few pages may be written after the fork
,
but then the child’s exec
releases
the bulk of the memory inherited from the parent.
Copy-on-write fork
eliminates the need to
ever copy this memory.
Furthermore, COW fork is transparent:
no modifications to applications are necessary for
them to benefit.
The combination of page tables and page faults opens up a wide range
of interesting possibilities in addition to COW fork. Another
widely-used feature is called lazy allocation, which has
two parts. First, when an application asks for more memory by calling
sbrk
, the kernel notes the increase in size, but does not
allocate physical memory and does not create PTEs for the new range of
virtual addresses. Second, on a page fault on one of those new
addresses, the kernel allocates a page of physical memory and maps it
into the page table.
Like COW fork,
the kernel can implement lazy allocation transparently to applications.
Since applications often ask for more memory than they need, lazy
allocation is a win: the kernel doesn’t have to do any work at all
for pages that the application never uses.
Furthermore, if the application is
asking to grow the address space by a lot, then sbrk
without lazy allocation is
expensive: if an application asks for a gigabyte of memory,
the kernel has to allocate and zero 262,144 4096-byte pages.
Lazy allocation allows this cost
to be spread over time. On the other hand, lazy allocation
incurs the extra
overhead of page faults, which involve a user/kernel transition.
Operating systems can reduce this cost by allocating a batch of
consecutive pages per
page fault instead of one page and by specializing the kernel
entry/exit code for such page-faults.
Yet another widely-used feature that exploits page faults is
demand paging. In exec
, xv6 loads all
of an application’s text
and data into memory before starting
the application. Since applications
can be large and reading from disk takes time, this startup cost can
be noticeable to users. To
decrease startup time, a modern kernel doesn’t initially load
the executable file into memory, but just creates the user page table with
all PTEs marked invalid. The kernel starts the program running;
each time the program uses a page for the first time, a page
fault occurs, and in response
the kernel reads the content of the page from disk and
maps it into the user address space. Like COW fork and lazy
allocation, the kernel can implement this feature transparently to
applications.
The programs running on a computer may need more memory than the computer has RAM. To cope gracefully, the operating system may implement paging to disk. The idea is to store only a fraction of user pages in RAM, and to store the rest on disk in a paging area. The kernel marks PTEs that correspond to memory stored in the paging area (and thus not in RAM) as invalid. If an application tries to use one of the pages that has been paged out to disk, the application will incur a page fault, and the page must be paged in: the kernel trap handler will allocate a page of physical RAM, read the page from disk into the RAM, and modify the relevant PTE to point to the RAM.
What happens if a page needs to be paged in, but there is no free physical RAM? In that case, the kernel must first free a physical page by paging it out or evicting it to the paging area on disk, and marking the PTEs referring to that physical page as invalid. Eviction is expensive, so paging performs best if it’s infrequent: if applications use only a subset of their memory pages and the union of the subsets fits in RAM. This property is often referred to as having good locality of reference. As with many virtual memory techniques, kernels usually implement paging to disk in a way that’s transparent to applications.
Computers often operate with little or no free physical memory, regardless of how much RAM the hardware provides. For example, cloud providers multiplex many customers on a single machine to use their hardware cost-effectively. As another example, users run many applications on smart phones in a small amount of physical memory. In such settings allocating a page may require first evicting an existing page. Thus, when free physical memory is scarce, allocation is expensive.
Lazy allocation and demand paging are particularly advantageous when free memory is scarce and programs actively use only a fraction of their allocated memory. These techniques can also avoid the work wasted when a page is allocated or loaded but either never used or evicted before it can be used.
Other features that combine paging and page-fault exceptions include automatically extending stacks and memory-mapped files, which are files that a program mapped into its address space using the mmap system call so that the program can read and write them using load and store instructions.