8.14 Code: System calls

With the functions that the lower layers provide, the implementation of most system calls is trivial (see (kernel/sysfile.c)). There are a few calls that deserve a closer look.

The functions sys_link and sys_unlink edit directories, creating or removing references to inodes. They are another good example of the power of using transactions. sys_link (kernel/sysfile.c:124) begins by fetching its arguments, two strings old and new (kernel/sysfile.c:129). Assuming old exists and is not a directory (kernel/sysfile.c:133-136), sys_link increments its ip->nlink count. Then sys_link calls nameiparent to find the parent directory and final path element of new (kernel/sysfile.c:149) and creates a new directory entry pointing at old ’s inode (kernel/sysfile.c:152). The new parent directory must exist and be on the same device as the existing inode: inode numbers only have a unique meaning on a single disk. If an error like this occurs, sys_link must go back and decrement ip->nlink.

Transactions simplify the implementation because it requires updating multiple disk blocks, but we don’t have to worry about the order in which we do them. They either will all succeed or none. For example, without transactions, updating ip->nlink before creating a link, would put the file system temporarily in an unsafe state, and a crash in between could result in havoc. With transactions we don’t have to worry about this.

sys_link creates a new name for an existing inode. The function create (kernel/sysfile.c:246) creates a new name for a new inode. It is a generalization of the three file creation system calls: open with the O_CREATE flag makes a new ordinary file, mkdir makes a new directory, and mkdev makes a new device file. Like sys_link, create starts by calling nameiparent to get the inode of the parent directory. It then calls dirlookup to check whether the name already exists (kernel/sysfile.c:256). If the name does exist, create’s behavior depends on which system call it is being used for: open has different semantics from mkdir and mkdev. If create is being used on behalf of open (type == T_FILE) and the name that exists is itself a regular file, then open treats that as a success, so create does too (kernel/sysfile.c:260). Otherwise, it is an error (kernel/sysfile.c:261-262). If the name does not already exist, create now allocates a new inode with ialloc (kernel/sysfile.c:265). If the new inode is a directory, create initializes it with . and .. entries. Finally, now that the data is initialized properly, create can link it into the parent directory (kernel/sysfile.c:278). create, like sys_link, holds two inode locks simultaneously: ip and dp. There is no possibility of deadlock because the inode ip is freshly allocated: no other process in the system will hold ip ’s lock and then try to lock dp.

Using create, it is easy to implement sys_open, sys_mkdir, and sys_mknod. sys_open (kernel/sysfile.c:305) is the most complex, because creating a new file is only a small part of what it can do. If open is passed the O_CREATE flag, it calls create (kernel/sysfile.c:320). Otherwise, it calls namei (kernel/sysfile.c:326). create returns a locked inode, but namei does not, so sys_open must lock the inode itself. This provides a convenient place to check that directories are only opened for reading, not writing. Assuming the inode was obtained one way or the other, sys_open allocates a file and a file descriptor (kernel/sysfile.c:344) and then fills in the file (kernel/sysfile.c:356-361). Note that no other process can access the partially initialized file since it is only in the current process’s table.

Chapter 7 examined the implementation of pipes before we even had a file system. The function sys_pipe connects that implementation to the file system by providing a way to create a pipe pair. Its argument is a pointer to space for two integers, where it will record the two new file descriptors. Then it allocates the pipe and installs the file descriptors.