wonderfly Blog

Event Loops

Event loops has been something I’ve always wanted to take a deeper look into as it has appeared in so many places, ranging from networking servers to system daemons. In this blog I will take a case study on some of the popular event loop implementations, and summarize common patterns.

NGINX

Design

Nginx achieves its high performance and scalability by multiplexing a single process for hundreds of thousands of connections, as opposed to using one process/thread per connection as traditional servers used to do. The reason is that a connection is cheap (it is a matter of a file descriptor plus a small amount of memory) while a process is much more heavy weight. A process is heavier not only in the sense that a process object takes more memory, but also because context switches between processes are costly - TLB flush, cache pollution.

Nginx’s design choice is to use no more than one (worker) process per CPU core, and run a event loop on each of these processes. The events are changes happening on the connections (data available for read, for write, timeout, or connection closed) assigned to each process. They are put into a queue by the operating system. The worker process takes an event at a time, processes it, and goes on to the next. When there is nothing to do, it sleeps and relinquishes the CPU. So in this way, if there is traffic on most connections, all CPUs can be busy processing packets, thus reaching their full capacity (the best as far as performance tuning goes). Unlike the traditional one process per connection model, with Nginx the CPUs are spending most of their time doing meaningful work, as opposed to on useless overhead such as context switches.

Note that Nginx assumes that the processing of each event is fast, or at least not blocking. Should that be the case, for example, a disk read is needed, it uses separate threads to process the event asynchronously, without blocking other events in the queue that are fast to process. Since the operating system kernel usually handles the heavy lifting like transmitting packets on the wire, reading data from disk into the page cache, when the Nginx worker process is getting notified, the data is usually readily available and it will be a matter of copying them from one buffer to another.

Implementation

The main loop

Nginx supports a wide variety of operating systems, but in this blog we will only look at Linux. The Linux version of the main loop is implemented by the ngx_worker_process_cycle function in src/os/unix/ngx_process_cycle.c. As we can see, the main body of the function is an infinite for loop, which terminates only when the process itself is terminating. In the loop, other than some edge condition checking, the gut of it is a function call to ngx_process_events_and_timers. Let’s ignore the timer handling part for now. The core of the function, is a call to ngx_process_events, which is a macro defined in src/event/ngx_event.h. As we can see, it merely points to the process_events field of the ngx_event_actions external variable.

As many OSes as Nginx supports, it supports even more event notification mechanisms (select, epoll, kqueue, etc.). For an analyses of these various options, see The C10K problem. Nginx refers to these options as “modules”, and has an implementation for each in the src/event/modules sub directory. Again, for brevity, we will only examine the epoll based solution here: ngx_epoll_module.c.

In this module, the process_events action is defined by the ngx_epoll_process_events function. It calls the epoll_wait system call, with a statically assigned ep, and a list of events event_list to listen on. If the system call returns with a non negative value, some events have happened, and which will be stored in the passed in event_list array, with their types saved in the returned events bitmask. The rest of the function processes the events one by one, and returns to the outer main loop which will then invoke another call to ngx_process_events. As we can see, events are processed sequentially, so it is important to not have blocking operations in the middle. As mentioned earlier, separate threads will be used for those operations and that is hidden behind the handler of each event.

Events

The main event interfaces are declared in src/event/ngx_event.h and implemented in src/event/ngx_event.c Two most important data structures are the ngx_event_s and ngx_event_actions_t. The former defines a shared data structure that works with many different underlying event types (select, epoll, etc.). The latter defines the methods on an event.

Various event types are registered at various places throughout the code base. As one example, the events of the “listening sockets” are registered by the ngx_enable_accept_events. It loops through the array of all listening sockets, and registers a read event for each of them.

Systemd

Event sources

While Nginx supports many notification mechanisms, sd-event supports many more event types: I/O, timers, signals, child process stage changes, and inotify. See the event source type definitions at sd-event.c. All these event sources are represented by a uniform data structure, sd_event_soure, which has a big union field for eight possible types, each for one event source. Each event type has its own specific fields, but all of them have an event handler, named sd_event_x_handler_t (x is the event type).

Systemd is able to handle these many event types thanks to the Linux APIs timerfd, signalfd, and waitid, which are epoll like APIs for multiplexing on timers, signals, and child processes.

Event loop

Event loops objects are of the type sd_event, which has a reference counter, a fd used for epoll, another fd for watchdog, a few timestamps, a hash map for each event source type, three priority queues (pending, prepare and exit), and lastly, a counter for the last iteration. To get a new loop object, the function sd_event_new can be called. However, since it is recommended that one thread only to possess one loop, the helper function sd_event_default is recommended which creates a new loop object only when there hasn’t been one for the current thread.

Working with events

An event loop is allocated by sd_event_new. As can be seen from its implementation, three objects are allocated: the loop object of type sd_event, a priority queue for pending events, and an epoll fd. Events can be added to a given loop via one of the sd_event_add_xxx methods. A callback function can be added which will be called when an event of the added type is dispatched. They are put in to a priority queue. An event loop can be executed for one iteration by sd_event_run, or loop until there’s no more events by sd_event_loop. An event is taken out of the priority queue at each iteration, and dispatched by calling the callbacks of a particular source type registered at the sd_event_add_xxx call.

Libevent

We’ve so far looked two event loop implementations specialized for their use in a bigger project. Now let’s look at some generic event loop libraries that do the job just fine, so you won’t have to reinvent the wheel if you need an event loop in your project.

libevent is a generic event loop library written in C. As its document says, it allows a callback function to be called when a certain event on a file descriptor happens, or a timeout happens. It also supports callbacks due to signals and regular timeouts.

As with nginx, libevent supports most of the available event notification mechanisms mentioned in the C10K problem: /dev/poll, kqueue, select, poll, epoll, and Windows select. Because of its great portability, its user list includes a number of high profile C/C++ projects: Chromium, Memcached, NTP, tmux, just to name a few.

Another major contribution of the libevent project is it produced a number of benchmarks, comparing performance between the various event notification mechanisms.

APIs

Event Loop

In order to abstract the underlying event notification API out, libevent provides a struct event_base for the main loop, as opposed to a fd or a fd set used by select and others. A new event base can be created by:

struct event_base *event_base_new(void);

Unlike other event loop implementations mentioned so far, libevent allows the creation of “advanced” loops by using a configuration, event_config. Each event notification mechanism supported by an event base is called a method, or a backend. To query for the supported methods, use

const char **event_get_supported_methods(void);

Like sd-event, libevent also supports event priorities. To set the priority levels in an event base, use:

int event_base_priority_init(struct event_base *base, int n_priorities);

Once the event base is set up and all ready to go, you can run it by

int event_base_loop(struct event_base *base, int flags);

The function above will run the event base until there is no more event registered in it. If you will be adding events in another thread, and want to keep the loop running even when there are no more active events temporarily, there is the flag EVLOOP_NO_EXIT_NO_EMPTY. When the flag is set, the event base will keep running the loop will until somebody calls event_base_loopbreak(), or calls event_base_loopexit(), or an error occurs.

Events

An event in libevent is simply a fd plus a flag specifying what events to watch on that fd: read, write, timeout, signal, etc.

About event states, this is what the libevent documentation has to say:

Events have similar lifecycles. Once you call a Libevent function to set up
an event and associate it with an event base, it becomes initialized. At
this point, you can add, which makes it pending in the base. When the event
is pending, if the conditions that would trigger an event occur (e.g., its
file descriptor changes state or its timeout expires), the event becomes
active, and its (user-provided) callback function is run. If the event is
configured persistent, it remains pending. If it is not persistent, it stops
being pending when its callback runs. You can make a pending event
non-pending by deleting it, and you can add a non-pending event to make it
pending again.

As mentioned earlier, libevent support event priorities. To set the priority of an event, use:

int event_priority_set(struct event *event, int priority);

Utilities

The libevent library also provides a number of helper functions that provide a compat layer when there are multiple implementations by the underlying operating systems: intergers, socket API, random number generator, etc. It also provides convenient helper functions for common networking server implementation patterns like “listen, accept, callback”, buffered I/O, DNS name resolution, etc.

Implementation

The main data structure for events is struct event. As you can see, it is focused on networking IO and doesn’t define as many event source types as sd-event. Its main body contains a callback, a fd (the socket or signalfd), a bitmask of events (read, write, etc.), and a union between ev_io and ev_signal.

Event loop, event_base, is defined in event-internal.h. Its core is a backend object, of type struct eventop, a map from monitored fds to events on them, and a map from signal fds to events.

eventop is the uniform interface for various event notification mechanism, very much like nginx’s ngx_event_actions . Each method/backend implements the common set of functions, in their own files - epoll.c, select.c, etc. For example, this is the epoll based implementation:

const struct eventop epollops = {
  "epoll",
  epoll_init,
  epoll_nochangelist_add,
  epoll_nochangelist_del,
  epoll_dispatch,
  epoll_dealloc,
  1, /* need reinit */
  EV_FEATURE_ET|EV_FEATURE_O1|EV_FEATURE_EARLY_CLOSE,
  0
};

And a new event is added via epoll_nochangelist_add, which calls epoll_ctl with EPOLL_CTL_ADD under the hood.

Libev

Libev claims to be another event loop implementation, loosely modeled after libevent, with better performance, more event source types (child PID, file path, wall-clocked based timers), simpler code, and bindings in multiple languages. As with libevent, it also supports a variety of event notification mechanisms including select, epoll and all.

Libev’s API is documented at http://pod.tst.eu/http://cvs.schmorp.de/libev/ev.pod. Extensive documentation is listed as one of the advantages of libev over libevent. At a glance, it has very similar APIs as libevent.

The implementation. I was shocked when I first opened its source directory - there is only one level of directories! All code is in the 10+ files in the root directory. It indeed is a small code base. I suppose it’s designed to be embedded in small applications.

Summary

As you can see from the case studies above, an event loop can be useful when:

  • you have to monitor changes on multiple sources, and
  • processing of each change (event) is fast

The implementation of an event loop would include:

  • One or more event sources that put events in a queue
  • A cycle that dequeues one element at a time and processes it

References

  1. A tiny introduction to asynchronous IO

UTLK - CHAPTER 3 Processes

Processes, Lightweight Processes, and Threads

In this book, the term “process” refers to an instance of a program in execution. From the kernel’s point of view, a process is an entity to which system resources (CPU time, memory, etc.) are allocated.

In older Unix systems, when a process is created, it shares the same code as its parent (the text segment) while having its own data segment so that changes it makes won’t affect the parent and vice versa. Modern Unix systems need to support multithreaded applications, i.e., multiple execution flows of the same program need to share some section of the data segment. In earlier Linux kernel versions, multithreaded applications are implemented in User Mode, by the POSIX threading libraries (pthread).

In newer Linux versions, Light Weight Processes (LWP) are used to create threads. LWPs are processes that can share some resources like open files, the address space, and so on. Whenever one of them modifies a shared resource, the others immediately see the change. A straight forward way to implement multithreaded applications is to associate a LWP to each user thread. Many pthread libraries do this.

Thread Groups. Linux uses thread groups to manage a set of LWPs, which together act as a whole with regards to some system calls like getpid(), kill(), _exit(). We are going to discuss these at length later, but basically these syscalls all use the thread group leader’s ID as process ID, treating a group of LWPs as one process.

Process Descriptor

Process descriptor is the data structure that encodes everything the kernel needs to know about a process - whether is running on a CPU or blocked on some events, its scheduling priority, address space allocated, open files, etc. In Linux, the type task_struct is defined for process descriptors (figure 3-1). The six data structures on the right hand side refer to specific resources owned by the process, and will be covered in future chapters. This chapter focuses on two of them: process state and process parent/child relationships.

fig-3-1 Figure 3-1. The Linux Process Descriptor

Process State

This field describes the state of the process - running, stopped etc. The current implementation defines it as an array of flags (bits), each describes a possible state. All states are mutually exclusive, hence at any given time, only one flag is set and others are all cleared. Possible states (source):

  • TASK_RUNNING. The process is either running on a CPU or waiting to be executed.
  • TASK_INTERRUPTIBLE. The process is suspended (sleeping) until some condition is met - an hardware interrupt (e.g., disk controller), a lock is released, a signal is received, etc.
  • TASK_UNINTERRUPTIBLE. Similar to above, except that delivering a signal to the sleeping process leaves its state unchanged. That is, it can be waken up only when if certain condition is met, say, a hardware interrupt.
  • TASK_STOPPED. Process execution has been stopped; upon receiving SIGSTOP, SIGTSTP, SIGTTIN or SIGTTOU.
  • TASK_TRACED. Process execution has been stopped by a debugger - think ptrace.

Apart from these five states, the state field (and the exit_state field) can have two additional states. As their name suggests, a process can reach on of these states only when they are terminated:

  • EXIT_ZOMBIE. Process is terminated, but its parent hasn’t called wait4() or waitpid() yet. This state is important because the kernel cannot discard a process’s descriptor until its parent has called a wait()-like syscall as per a Unit design principle. This will be detailed later in this chapter.
  • EXIT_DEAD. The final state: the process is being removed from the system because its parent has called wait4() or waitpid() for it. The next phase of the process’ life is the descriptor being deleted. So you may ask, “why not just delete the descriptor already?”. Well, having this last state is useful to avoid race conditions due to other threads that execute wait()-like calls on the same process (see Chapter 5).

The value of the state field is usually set with a simple assignment, e.g.:

p->state = TASK_RUNNING;

The kernel also uses the set_task_state and set_current_state macros to set the state of a specified process or the current running process. These macros also ensure atomicity of these instructions. See Chapter 5 for more details on this.

Identifying a Process

Process descriptor pointers. Each execution context that can be scheduled independently must have its own process descriptor; even lightweight processes (threads) have their own task_struct. The strict one-to-one mapping between processes and process descriptors makes the address of the task_struct structure a useful means for the kernel to identify processes. These addresses are referred to as process descriptor pointers.

PIDs. On the other hand, users like to address a process by an integer called the Process ID or PID. To satisfy this requirement the kernel stores PID in the pid field of the task_struct. PIDs are numbered sequentially and there is an upper limit (determined by /proc/sys/kernel/pid_max). When the kernel reaches this limit, it must start recycling the lower, unused PIDs. If the rate of process/thread creation exceeds the rate of PID recycle, process creation is going to fail with likely an EAGAIN (-11) error. The default for /proc/sys/kernel/pid_max is usually 32K, and can be lifted up to 4,194,303 on a 64-bit system.

pidmap_array. When recycling PID numbers, the kernel needs to know which of the 32K numbers are currently assigned and which are unused. It uses a bit map for that, called the pidmap_array. On 32-bit systems, a page contains 32K bits (4K bytes), which is perfect for storage of a pidmap_array. On 64-bit systems, multiple pages might be needed for the pidmap_array depending on the value of /proc/sys/kernel/pid_max. These pages are never released.

Threads. Each lightweight process has its own PID. To satisfy the user application requirement that multi-threaded applications should have only one PID, the kernel groups lightweight processes that belong to the same application (thread group), and makes system calls like getpid() return the PID of the thread group leader. In each LWP’s task_struct, the tgid field stores the PID of the thread group leader and for the thread group leader itself, that value is equal to its pid value.

PID to process descriptor pointer. Many system calls like kill() use PID to denote the affected process. It is thus important to make the process descriptor lookup efficient. As we will see soon enough, Linux has a few tricks to do that.

Relationships Among Processes

UTLK - CHAPTER 2 Memory Addressing

Note: The subtitles in this chapter are reorganized for the ease of my own understanding. I find them more logically ordered this way than they are in the book.

Memory Addresses

This chapter focuses on the Intel 80x86 microprocessor memory addressing techniques, but they should generally apply to most other processors too.

At the highest level, you have a great amount of memory cells that can be used to store instructions to be executed by the CPU and data to be read by the CPU. You need to tell the CPU how to access the contents of each cell so you give them “addresses”, just like every house in America has an address so mails can be delivered. Each memory address typically points to a single byte in memory.

Segmentation

Intel thinks it is a good idea to split a big program into small logical segments, each to store partial information about the program. For example, there can be a code segment for compiled instructions of the program, while all global and static variables may reside a separate segment called the data segment. Similarly, there can be stack segment designated for a process’ stack at runtime. This, when applied to the address space of a running process, splits the entire memory into multiple segments too, each consists of a chunk of addresses.

Segment Selector

With segmentation, to address a memory location, a pair of identifiers are used: (Segmentation Selector, Offset). Segmentation Selector is a 16-bit number that identifies which of the many segments this address lies in, and “Offset” tells the distance from the start of the selected segment.

Segment Descriptor

Other than a base address, to precisely describe a segment, at least its “size” needs to be known. There are other characteristics of a segment that the kernel finds useful, such as which privilege level is required to access addresses in a segment’s range, is the segment for data or code, etc. These details together, are stored in a data structure called the Segmentation Descriptor.

Global Descriptor Table (GDT) and Local Descriptor Table (LDT)

Now that we have a set of segments, each with its descriptor, we need an array to store all these descriptors. This array is called a Descriptor Table. There is usually a Global Descriptor Table or GDT that’s shared by all processes (they use the same segments for addressing), and optionally a per-process table called the Local Descriptor Table.

Both tables are stored in memory themselves. To locate them, their addresses are stored in two registers: gdtr and ldtr, respectively.

Logical Address and Linear Address

In Intel’s design, all addresses used in programs should be “Logical Addresses”, which are (Segmentation Selector, Offset) pairs. But the CPU doesn’t understand these pairs, so there is a designated hardware component that translates logical addresses into Linear Addresses, which are 32-bit unsigned integers that range from 0x00000000 to 0xffffffff, for a memory of 4 gig bytes. This hardware component is called the Segmentation Unit.

Segmentation Unit

Segmentation unit’s sole purpose is to translate logical addresses into linear addresses. It takes a (Segment Selector, Offset) pair, from which it gets the index into the Descriptor Table (array) of the segment to select, which then gives the starting (base) address of that segment. Then by multiplying Offset by 8 (each descriptor in the Descriptor Table is 8-byte long) and adding the product to the base, it gets a linear address.

Fast Address Translation

As you can see from above section, each address translation involves one read of the Descriptor Table (to get the descriptor of the selected segment), which sits in memory. Because the descriptor of a segment rarely changes, it’s possible to speed up translations by loading the descriptor (8 byte) into a dedicated register, so subsequent translations don’t need to read memory, which is a magnitude slower than reading a register.

Segmentation in Linux

Though the concept of segmentation and splitting programs into logically related chunks is kind of enforced by Intel processors, Linux doesn’t use segmentation much. In fact, it defines barely a handful of segments even though the maximum allowed number is 2^13 (13-bit used for index in Segment Selector).

Four main segments Linux uses are: Kernel Code Segment, Kernel Data Segment, User Code Segment, and User Data Segment. The Segment Selector values of these are defined by the macros __KERNEL_CS, __KERNEL_DS, __USER_CS and __USER_DS. To address the kernel code segment, for example, the kernel simply loads the value defined by __KERNEL_CS into the segmentation register.

The descriptors of each segment are as follows:

Segment Base G Limit S Type DPL D/B P
user code 0x00000000 1 0xffff 1 10 3 1 1
user data 0x00000000 1 0xffff 1 2 3 1 1
kernel code 0x00000000 1 0xffff 1 10 0 1 1
kernel data 0x00000000 1 0xffff 1 2 0 1 1

As you can see all four segments start at address 0. This is not a mistake. In Linux, a logical address always coincides with a linear address. Again, Linux is not a big fan of segmentation and spends more of its design in an alternative paging mechanism - paging - which we will talk about next.

The Linux GDT

As mentioned above, the GDT is an array of Segment Descriptor structs. In Linux, the array is defined as cpu_gdt_table, while the address and size of it are defined in the cpu_gdt_descr array. It is an array because there can be multiple GDTs: on multi-processor systems each CPU has a GDT. These values are used to initialize the gdtr register.

Both arrays are 32-element long, which include 18 valid segment descriptors, and 14 null, unused or reserved entries. Unused entries are there so that segment descriptors accessed together are kept in the 32-byte hardware cache line.

The Linux LDTs

These are similar to GDTs and are not commonly used by Linux applications except for those that run Windows applications, e.g., Wine.

Paging in Hardware

Just as the segmentation unit translates logical addresses into linear addresses, another hardware unit, the paging unit, translates linear addresses into physical addresses.

Page, Page Frame and Page Table

For efficiency, linear addresses are grouped into fixed-length chunks called pages. Contiguous linear addresses are mapped into contiguous physical addresses. Each of these groups is called a physical page frame. Note that while pages are blocks of data, page frames are physical constitute of the main memory.

The data structure that maps pages to page frames is called page table.

Regular Paging

The word “regular” is relative to the extended or more advanced paging techniques that are required for special cases that we will cover later in this chapter. This section is about the fundamental idea of how paging is done in Intel hardware.

As previously mentioned, the paging unit translates pages as the smallest unit and not individual addresses. The default page size on Linux is 4KB (2^12 bytes). So the 20 most significant bits of a 32-bit address alone can uniquely identify a page (the last 12 bits will be within a page). An rudimentary idea of implementing paging would be to have a table of (linear page, physical page) tuples, and try to match the 20 most significant bits of a linear address in the table. However, that implies that we would need 2^20 entries in the array, or at least reserve space for that many entries if the implementation is a linear array, even when the process doesn’t use all addresses in the range. With each entry taking up 4 bytes (32 bits), our table would take 4MB of memory. Of course there is room for improvement.

The idea is to split the 20 most significant bits into two 10-bit segments; the first 10 to be used as index to a higher level Page Directory whose entries are physical addresses to secondary Page Tables, and the rest 10 bits as index to a particular Page Table. With demand paging (page frames are only assigned when actually requested), the secondary Page Tables are added as the process requests more memory. In the extreme case that the process uses all 4GB address space, the total RAM used for the Page Directory and all Page Tables will be 4MB, no more than the basic algorithm above.

This is called 2-level paging and you will see the idea getting generalized into multi-level paging on systems with bigger address spaces, e.g., 64 bit systems.

The physical address of the Page Directory is stored in cr3. From there the paging unit gets the physical address to a Page Table, and in turn it gets the physical address of a page frame a page is mapped into. Then by adding the least significant 12 bits of a linear address (Offset), it gets a 32 bit physical address.

Page Directory/Table Data Structure

In Linux, it is a 32-bit compound integer of the following bits:

Present flag. If set, the referred-to page or page table is in main memory. If not, the linear address requested is saved to cr2 and exception 14 (Page Fault) is generated. The kernel relies on page faults to assign additional page frames to a process.

20 most significant bits of a page frame physical address. Because of the 4KB page size, the last 12 bits of a physical address doesn’t matter.

Accessed flag. Whether or not the referred-to page or page table is accessed. This flag may be used to decide which pages to swap out in case the system is running low on memory.

Dirty flag. Whether or not the referred-to page or page table is written to. Similar use as the accessed flag.

Read/Write flag. Access right of the page frame or page table: read-only or writable.

User/Supervisor flag. Privilege needed to access the page or page table: user process or kernel.

PCD and PWT flags. Controls some hardware cache behaviors. Explained below at “Hardware Cache”.

Page Size flag. Controls the page size, if not 4KB. See “Extended Paging” for details.

Global flag. Controls TLB flush behaviors.

Extended Paging

On newer 80x86 processors, larger pages are supported (4MB instead of 4KB) and thus more bits are needed for the Offset part and fewer for Page Directory and Page Table. In fact, since the Offset takes 22 bits, the 10 most significant bits all be used as an index into the Page Directory, thus eliminating the need for secondary Page Tables.

Hardware Protection Scheme

Different from the segmentation unit, which allow multiple privilege levels (0, 1, 2, 3), the paging unit only allows two: User/Supervisor. Furthermore, only two rights are associated with pages: Read/Write.

Physical Address Extension (PAE) Paging Mechanism

The amount of RAM supported by a processor is limited by the number of address pins connected to the address bus. As the servers grow, Intel extended the address pins of their 80x86 processors from 32 to 36. This makes them able to address up to 2^36 = 64GB memory. Adapting to this change, Linux tweaked the number of bits used to index Page Directory, Page Table and as Offset, and in some cases, added an additional table before Page Directory - Page Directory Pointer Table. But the same general idea applies.

Paging for 64-bit Architectures

On 64-bi systems, there are 64 address pins. It certainly doesn’t make sense to use all 64-12 = 52 most significant bits (assuming a usual page size of 4KB) as that would give a maximum of 256TB address space! So fewer bits are used, and multi-level paging is used. The actual split of bits varies from architecture to architecture.

Hardware Cache

Typical RAMs access time is in the order of hundreds of clock cycles. That is a LOT of time comparing to instruction executions. To remedy the huge speed mismatch between the CPU and the memory, fast hardware cache memories are introduced. Based on locality principle, data recently fetched from memory, or those next to them, are kept in the cache. And over the years, multiple levels of cache are added, each with different speed and cost characteristics.

A new unit was introduced along with the hardware cache: line. It refers to a certain block of contiguous data in memory or in cache. Data are always read into cache or cleared in multiple of line. A line is usually several dozens of bytes.

On multi-processor systems, each CPU has its own hardware cache. The synchronization between two caches is important, and usually is done by a dedicated hardware unit so the OS doesn’t have to care. But one interesting feature of some cache is that the hardware allows the cache sync policy (Write-Through v.s. Write-Back) to be selected and set by the OS. In Linux, Write-Back is always selected.

Translation Lookaside Buffers (TLB)

Besides general purpose cache, 80x86 processors include another cache called the Translation Lookaside Buffer (TLB), to speed up linear address translation. When a linear address is used for the first time, the corresponding physical address is computed through page table lookups, and stored in the TLB for future reference. Note that because each process has its own page tables, the TLB is flushed during a context switch.

On multi-processor systems, each CPU has its own TLB. Unlike hardware cache, two TLBs don’t need to be synchronized because they are being used by two processes that have different page tables any way.

Paging in Linux

Linux adopts a common paging model that fits both 32-bit and 64-bit architectures. A four-level paging model is used, and consists of the following:

  • Page Global Directory
  • Page Upper Directory
  • Page Middle Directory
  • Page Table

For 32-bit architectures, two paging levels are sufficient so the Page Upper Directory and Page Middle Directory are eliminated - by saying that they contain zero bits.

Linux’s handling of processes relies heavily on paging. The automatic linear to physical address translation makes a few design objectives feasible:

  • Assign a different physical address space to each process, ensuring isolation.
  • Distinguish pages from page frames. This allows the same page to be written to a page frame, or swapped to disk, and then read back into memory at a different page frame.

Process Switch. The essence of process switch is saving the value of cr3 into the old process’ descriptor, and loading the new process’ Page Global Directory physical address into cr3.

The Linear Address Fields

This and the following section (Page Table Handling) detail a few important macros and functions related to the Page Table data structures. I did not read them line by line and thus don’t have a lot to note down. I can always come back to them when needed.

Page Table Handling

Ditto.

Physical Memory Layout

This is the fun part about memory management. To start managing memory, the kernel first needs to know where all the memory chunks are, and how big each is. Note that not all physical memory is usable by the kernel. Some of them may be reserved by the BIOS, some reserved for IO device memory mapping, for example. In general, the kernel considers the following reserved (that cannot be dynamically assigned or swapped to disk):

  • Those falling in the “unavailable” physical address ranges
  • Those containing the kernel’s own code and data

Kernel address. In general, the Linux kernel is installed in RAM starting from the physical address 0x00100000 - i.e., from the second megabyte. The total number of page frames required depends on the configuration, but typically a Linux kernel image can fit in 3MB of memory.

Why is the kernel not loaded to the first megabyte?:

  • Page frame 0 is used by BIOS to store system hardware configurations during its Power-On Self-Test (POST).
  • Physical address 0x000a0000 to 0x000fffff are usually reserved for BIOS routines amd to map internal memory of ISA grapic cards.
  • Additional frames within the first megabyte may be reserved by specific computer models.

During boot, the kernel queries BIOS to learn the size of the physical memory, and the list of physical address ranges. It does this in the machine_specific_memory_setup() function. Right after this, in the setup_memory() function, it establishes the following variables:

Variable name Description
num_phypages Page frame number of the highest usable page frame

TODO: Fill the table.

Figure 2-13 shows how the first 3MB of RAM are filled by Linux. fig-2-13

Starting at the physical address 0x00100000 is the symbol _text, which is the address of the first byte of kernel code. Following it there are _etext, _edata, and _end. They denote: end of text, end of data, and end of kernel. Between _etext and e_data is the initialized data structures, and the next is uninitialized data.

Process Page Tables

A process’ linear address space is divided into two parts:

  • User or kernel mode: 0x00000000 to 0xbfffffff
  • Kernel mode only: 0xc0000000 to 0xffffffff I.e., the lowest 896MB are pages used for kernel code and data.

Every process has a different Page Global Directory with different entries, but all of them share the entries for the lowest 896MB of linear address space - kernel.

Kernel Page Tables

The kernel maintains a set of page tables of its own, rooted at a master kernel Page Global Director. After system initialization, this set of page tables is never directly used by any process or kernel thread.

Chapter 8 will explain how the kernel ensures changes to the master kernel Page Global Directory always reflect in each process’ Page Global Directory. This chapter will next explain how the kernel initializes its own page tables. It is a two-phase activity, and note that right after when kernel is loaded into memory, the CPU is running in real mode, with paging disabled.

In the first phase the kernel creates a limited address space including kernel’s code and data segments, the initial Page Tables, and 128KB for some dynamic data structures.

In the second phase, the kernel sets up Page Tables for all of the existing RAM.

Provisional kernel Page Tables

A provisional Page Global Directory is initialized during kernel compilation time, while provisional Page Tables are initialized at runtime, by the startup_32() assembly function defined in arch/i386/kernel/head.S.

The provisional Page Global Directory is contained in the swapper_pg_dir variable. Again, this is computed at compile time and is known at run time. The provisional Page Tables are stored starting from pg0, right after _end in the kernel image. For simplicity, we will assume the kernel image, the provisional Page Tables, plus the 128KB dynamic space can fit into the first 8MB of RAM. In order to map 8MB of RAM, two Page Tables are required (??).

The objective here is to allow these 8MB of RAM to be addressed in both real mode and in protected mode. Therefore, a mapping from both the linear addresses 0x00000000 through 0x007fffff (first 8MB in real mode) and the linear addresses 0xc0000000 through 0xc07fffff (first 8GB in protected mode) is needed. In other words, the kernel during initialization can address the first 8MB of RAM by either linear addresses identical to the physical ones, or 8MB worth of linear addresses, starting from 0xc0000000.

Identity Mapping. The mapping of linear addresses 0x00000000 through 0x007fffff to the same physical addresses is usually referred to as Identity Mapping. It is important because most CPUs are pipelined; that is, multiple instructions spit by the CPU might be in action in multiple stages. As soon as the MMU/paging unit is turned on, some old instructions spit by CPU with physical addresses may be in flight. For them to work, the first 8MB RAM range needs to be mapped too. StackOverflow answer.

The kernel creates that mapping by filling all the swapper_pg_dir (the provisional Page Global Directory) entries with zeroes, except for entries 0, 1, 0x300 and 0x301; the latter two entries span all linear addresses between 0xc0000000 and 0xc07fffff: The address field of entries 0 and 0x300 is set to the physical address of pg0 (the byte immediately following the _end symbol), while the address field of entries 1 and 0x301 is set to the physical address of the page frame following pg0. Again, two page tables are needed to map 8MB of RAM.

Final kernel Page Table when RAM size is less than 896MB

The final kernel Page Tables should map linear addresses from 0xc0000000 through 0xffffffff to physical addresses starting at 0x0.

Final kernel Page Table when RAM size is between 896MB and 4096MB

Final kernel Page Table when RAM size is more than 4096MB

Handling the Hardware Cache and the TLB

Handling the hardware cache

As mentioned earlier, hardware caches are addressed by cache lines. The L1_CACHE_BYTES defines the size of the cache line. To optimize for cache hit rate, the kernel does the following tricks when dealing with data structures:

  • Most frequently used fields are placed at the lower offset within a data structure, so that they can be cached in the same line. Compare this to ording fields alphabetically or in groups of types.
  • When allocating a large set of data structures, the kernel tries to store each of them in memory in such a way that all cache lines are used uniformly.

Cache synchronization. As mentioned earlier, on 80x86 microprocessors hardware cache synchronization (between CPUs) are taken care of by the hardware and is transparent to the kernel.

Handling the TLB

Contrary to that, TLB flushes have to be done by the kernel because it is the kernel that decides when a mapping between a linear address and a physical address is no longer valid.

TLB synchronization between CPUs are done by the means of a Interprocessor Interrupts. That is, a CPU that is invalidating its TLB sends a signal to others to force them to do a flush as well.

Generally speaking, a process switch indicates a TLB flush. However there are a few exceptional cases in which a TLB flush is not necessary:

  • When the switch is to another process that uses the same set of page tables as the process being switched from.
  • When performing a switch between a user process and a kernel thread. That is because kernel threads do not have their own page tables; rather, they use the set of page tables owned by the user process that was scheduled last for execution on the CPU.

Lazy TLB flushes. We mentioned earlier that TLB synchronization is done through one CPU sending an Interprocessor Interrupt to others. When a CPU that is running a kernel thread gets such an interrupt, it simply ignores it and skips the requested TLB flush as kernel threads don’t access user processes’ address space (the lower 3GiB) - they won’t reference the TLB entries any way. This is called the lazy TLB mode. However, the kernel remembers that an interrupt has been received so when it switches back to a user process, it correctly issues the TLB flush.

Possessive Pronouns

Definition

Mine, yours, his, hers, theirs. Note the distinction between a possessive pronoun and a possessive adjective (e.g., my, your, his, her, their).

Gender and Number

Possessive pronouns have to agree with both the gender and number of the object owned or possessed, and not the owner. For example, “The girls read the books of theirs” - Las muchachas leen los libros suyos. Full table:

Singular Plural
mío, míos mine nuestro, nuestros ours
mía, mías mine nuestra, nuestras ours
tuyo, tuyos, yours vuestro, vuestros yours
tuya, tuyas, yours vuestra, vuestras yours
suyo, suyos, his, hers, theirs, (formal) yours suyo, suyos, theirs, (formal) yours
suya, suyas, his, hers, theirs, (formal) yours suya, suyas, theirs, (formal) yours

In contrast, these are the possessive adjectives. You may have noticed that the first and second person informal plural have the same form as adjective as well as pronoun.

Singular Plural
mi, mis my nuestro, nuestros our
  nuestra, nuestras our
tu, tus your vuestro, vuestros your
  vuestra, vuestras your
su, sus his, her, (formal) your su, sus their, your

Possessive Pronouns Following ser

It is the same form as in English. For example, “This book is mine” - Este libro es mío, “This table is yours” - Esta mesa es tuya.

Possessive Pronouns Expressing “of mine/yours/his/her/ours/theirs”

In English we like to say “a friend of mine” or “an adventure of yours”. In Spanish, there is no need to insert a de between the object owned and the possessive pronoun. To say “a friend of mine” it is simply un amigo mío.

Possessive Pronouns in Comparisons

When two things are compared against each other, there are two forms that can be used. The first is the “regular comparison” that ranks the two subjects, i.e., “A is better/worse/taller/shorter than B”. In the second form, two subjects are brought up together but no explicit ranking is drawn. For example, “this house is mine, while that one is yours”, or “I like my house, as well as yours”.

Regular Comparisons

more … than más ... que
less … than menos ... que
as … as tan ... como

Examples:

My friend is taller than yours Mi amigo es más alto que el tuyo
Their house is less elegant than yours Su casa es menos elegante que la tuya

Irregular Comparatives

Unlike English, there are only four irregular comparatives in Spanish. They are:

mejor, mejores    better        mayor, mayores      older
peor, peores      worse         menor, menores      younger

When used with these comparatives, possessive pronouns follow the same rules above.

Comparative Statements

In the second form of comparison possessive pronouns follow the same rules as well. For example:

Their house is dirty but ours is clean Su casa está sucia pero la nuestra está limpia
Her books are in the kitchen and mine are in the dinning room Su libros están en la cocina y las míos están en el comedor

Numbers as Pronouns

Definition

When the noun after a number is omitted, the number serves as a pronoun. For example, “How many books do you have?”, “I have three (kids)”. Or, “Which book is your favorite?”, “The third (book)”.

Both cardinal and ordinal numbers can be used as pronouns.

Numbers Below 10

Cardinal Numbers Ordinal Numbers
uno, una primero, primera
dos segundo, segunda
tres tercero, tercera
cuatro cuarto, cuarta
cinco quinto, quinta
seis sexto, sexta
siete séptimo, séptima
ocho octavo, octava
nueve noveno, novena
diez décimo, décima

Notes:

  • Ordinal numbers take gender.
  • When referring to a masculine noun, “first” and “third” drop the trailing o and become primer and tercer, respectively.

Numbers Above 10

Cardinal numbers are the same weather below or above 10, but ordinal numbers can be seen in two ways:

  • “number-avo”, e.g., onceavo (11th), doceavo (12th) and treceavo (13th)
  • but more commonly, “el number”, e.g., el doce (11th), el doce (12th) and el trece (13th)

Cardinal Numbers as Pronouns

Pretty much the same as in English. For example, ¿Cuántos libros tienes?, Tengo tres..

Ordinal Numbers as Pronouns

The only additional thing needs to be noted is that ordinal numbers have to agree on gender with the omitted noun. For example, El árbol es el séptimo (the tree is the seventh) and La lámpara es la segunda (the lamp is the second). Note that the definitive articles used need to agree on gender too.