wonderfly Blog

UTLK - CHAPTER 1 Introduction

Kernel Architecture

Monolithic v.s. microkernels. Like most Unix kernels, Linux is monolithic: single binary in one address space. The major advantage a monolithic kernel has over a microkernel is performance, because there aren’t many inter-process messaging as the latter has. That said, microkernels have a few legitimate pros on its own:

  • Modularized design. Many kernel components are broken down to smaller binaries and inter-process communication requires well designed interfaces
  • Portability. Because the platform dependant portion is stripped down to a minimal (only the main kernel), it’s easier to port a microkernel to a different platform
  • Better memory performance. Unneeded kernel code isn’t loaded into memory, unlike a monolithic one which is always mapped at its entirety.

To achieve some of the above advantages, Linux offers modules:

  • Modules force good modular design. Interface between a module and the main kernel should be well designed.
  • Platform independence.
  • Frugal memory usage. Tiny systems like embedded devices can choose to not load all modules.
  • No performance penalty. Loading and unloading a module is much cheaper than creating and destroying a process like in the microkernel model.

An Overview of the Unix Filesystem

At the heart of the Unix operating system is its file system.

Files and Directories

A file is a sequence of bytes. A file system is a user level abstraction of the data stored on physical devices like a hard disk, because user processes don’t interact with devices directly.

A directory contains information about the files and directories beneath it.

A file can have multiple names, in the form of links. A file is only deleted when its number of links decreases to zero. Links have two major limitations:

  • Links to directories are not allowed. This is to prevent cycles the directory tree.
  • Links can not cross filesystems. There might be multiple of them mounted at the same time, for example from two hard drives, or two partitions on the same drive.

To mitigate those limitations, there exist soft links. They are called so to distinguish from the regular links. Soft links can be created for directories, and can cross file system boundaries. To create a soft link:

ln -s <target> <link>

File Types

  • Regular files, directories, symlinks
  • Block device files, character device files, pipes, sockets

File Descriptor and Inode

Unix distinguishes between the contents of a file, and the metadata of a file. The former is purely a sequence of bytes, while the latter is a data structure that holds information like creation time, owner, access rights, etc. The latter is often called the inode. Each file has its own inode.

Access Rights and File Mode

Owner, group, other, read, write, execute, suid, sgid, sticky. Note: the “sticky” bit on an executable tells the kernel not to release its code block from memory even when the process that executes it has terminated. It has been deprecated.

File-Handling System Calls

Opening a file

  fd = open(path, flag, mode)

The return value, fd, is the File Descriptor, which is an index to an open file object, which contains information about the current read offset (file pointer), pointers to kernel functions that the process can invoke, and etc.

Accessing an opened file

read, lseek.

Closing a file

close.

Renaming and deleting a file

These operations are actually on the directory that the file is in. To delete a file, use unlink(pathname). As can be tell from the function name, this decreases the link of the file by one. It won’t be deleted until the link number drops to zero.

An Overview of Unix Kernels

A quick overview of almost all aspects of a Unix kernel.

The Process/Kernel Model

CPUs have multiple execution states, two of them are “Kernel Mode” and “User Mode”. In Unix, user processes run in “User Mode”, while kernel threads run in “Kernel Mode”. In the majority of time, some user process is running on the CPU while the kernel is largely asleep, except for a few lightweight “kernel threads”. There are a few cases where the main kernel body will be run:

  • When a user process explicitly requests the kernel through a system call.
  • When some exception happens while executing a user process, e.g., page fault.
  • When a hardware device interrupts the CPU through a signal.

Process Implementation

The kernel isn’t a process, but a process manager. The implementation of a process involves the “pause” and “resume” of one. To pause a running process, its states are dumped into a data structure, the process descriptor. To resume, the kernel loads the CPU with those states, thus the process can continue running like it never stopped. The information that is dumped and loaded include:

  • The program counter (PC) and stack pointer (SP) registers
  • The general purpose registers
  • The floating point registers
  • The memory management registers used to keep track of memory accessed by the process

Reentrant Kernels

All Unix kernels are re-entrant. This means a kernel execution path can be suspended, while another takes over the CPU, and then resumed when the latter finishes.

Process Address Space

Every process runs in its private address space. When switched into kernel mode, it runs in its private kernel address space.

There are a couple of exceptions where the same memory region may be shared among processes:

  • A commonly used program’s code section is loaded into memory and shared among all processes that use it. Examples are editors, and shared libraries. When this happens it is usually done automatically by the kernel without the user noticing.
  • Two user processes could explicitly request that a region of memory to be shared. This is enabled for Inter-Process Communication (IPC), and is usually referred to as “shared memory”.

  • User processes could use the mmap() system call to map a file, or a block device (essentially also a file) into its address space. This is usually to facilitate data read or write. The mapped file can be shared with other processes.

Signals and Interprocess Communication

Signals is a way for the kernel to notify user processes of system events: asynchronous events like a terminal interrupt (SIGINT) or synchronous events like failure to access a memory location (SIGSEGV). POSIX defines about 20 signals, two of which are user-definable. User processes can write code to handle the receipt of a signal, either to ignore it, or execute something asynchronously. If none is defined, the kernel has some default actions depending on the type of the signal.

While signals are one-way communication from the kernel to user processes, Unix provides another communication mechanism to allow two user processes to pass information to one another, and that is called Inter-Process Communication or IPC in short. Three main implementations are Shared Memory, Semaphores, and Message Queues. The kernel implements these constructs as IPC resources. Like files, IPC resources are persistent. User processes need to destroy them when unused.

Process Management

Note the distinction between a process and the program it executes. A process can be created with the fork() system call and terminated by _exit() while a program can be loaded into a newly created process by exec().

Zombie processes

When a process is terminated, it is put into the zombie status, until its parent calls the wait4() syscall or something similar, upon which the kernel will release the resources held by the child process.

If the parent process terminates without calling wait4() on its children, the kernel makes the init process parent of those child processes, which periodically calls wait4() on all its children.

Process groups and login sessions

A process group, or a job, is a convenient way to manage multiple processes that represent a single “job” abstraction. For example, $ ls | sort | more.

A login session is a higher level container that has information about all processes, or process groups, started in the same shell login session. Only one group of them can be in the foreground, and multiple can be in the background. The shell internal commands bg and fg can be used to toggle a process group between background and foreground.

Memory Management

Memory management (MM) is the most complex activity in a Unix kernel. A third of this book talks about memory management. The essence of MM is to efficiently share a fixed amount of resource (memory frames) between a large number of concurrent processes.

Virtual memory

The concept of virtual memory makes sharing easier: each process thinks it is given a large, contiguous, private memory chunk to operate on. With the help of a dedicated hardware unit, the Memory Management Unit or MMU, whose sole purpose is to translate virtual memory addresses to physical memory addresses, virtual memory management can be very efficient.

A major problem that virtual memory must solve is fragmentation: though a memory allocation request should only fail when there is not enough free pages left, the kernel is often forced to use contiguous physical memory.

Random access memory usage

Editor’s note: I personally think this section can be better titled as “RAM partition”.

Modern Unix operating systems partition the entire physical memory space into two areas: a few megabytes reserved for the kernel code and static data structures (the kernel image), and the rest that can be used for:

  • To satisfy kernel requests for dynamic memory like buffers, file descriptors, etc.
  • To satisfy user process memory allocation requests, e.g., malloc().
  • Cache for disk contents or other buffered devices.

Kernel Memory Allocator

The Kernel Memory Allocator (KMA) is the subsystem that satisfies memory allocation requests from all parts of the system. A good implementation has to be fast, make efficient use of memory, minimize fragmentation, and work well with other memory management subsystems. There are a few algorithms, and the Linux kernel uses a Slab allocator on top of a buddy system.

Process virtual address space handling

An address space is the set of addresses assigned to a process. When the kernel does this assignment, it assigns a list of address regions such as code region, data region (uninitialized and initialized), shared memory, and the heap. User processes can expand their address space by calling malloc() or brk().

Copy-on-write. As mentioned earlier, a process is created by forking its parent process. To minimize page duplication, the parent’s page frames are assigned to the child process but with read-only permission. When the child first tries to write to one of the pages, it gets duplicated and a new frame is assigned to the child.

Caching

Disks are very slow compared with RAM, and it is not uncommon that a process may read data once read or written by a process that doesn’t exist any more. Caching disk contents in RAM can save a lot of time reading disks.

The sync() syscall exists for explicit flushing of “dirty” pages (cache pages that differ from its on-disk copy) on to disks.

Device Drivers

The last but not least part of a Unix operating system is device drivers. Linux has a good separation between the device driver code and the rest of the kernel code, through a well defined interface. This interface makes it possible for the kernel to access ALL drivers in a uniform way, and allows device manufactures to add new devices without knowing the kernel source code.