How to earn free bitcoins and altcoins through binances referral programaffiliate program
30 commentsBitcoin pill warning
Other operating systems have objects which manage the underlying physical pages such as the pmap object in BSD. Linux instead maintains the concept of a three-level page table in the architecture independent code even if the underlying architecture does not support it.
While this is conceptually easy to understand, it also means that the distinction between different types of pages is very blurry and page types are identified by their flags or what lists they exist on rather than the objects they belong to. Architectures that manage their Memory Management Unit MMU differently are expected to emulate the three-level page tables. For example, on the x86 without PAE enabled, only two page table levels are available.
Unfortunately, for architectures that do not manage their cache or Translation Lookaside Buffer TLB automatically, hooks for machine dependent have to be explicitly left in the code for when the TLB and CPU caches need to be altered and flushed even if they are null operations on some architectures like the x These hooks are discussed further in Section 3.
This chapter will begin by describing how the page table is arranged and what types are used to describe the three separate levels of the page table followed by how a virtual address is broken up into its component parts for navigating the table.
Once covered, it will be discussed how the lowest level entry, the Page Table Entry PTE and what bits are used by the hardware. After that, the macros used for navigating a page table, setting and checking attributes will be discussed before talking about how the page table is populated and how pages are allocated and freed for the use with page tables.
The initialisation stage is then discussed which shows how the page tables are initialised during boot strapping. The page tables are loaded differently depending on the architecture. The page table layout is illustrated in Figure 3. Any given linear address may be broken up into parts to yield offsets within these three page table levels and an offset within the actual page. The SHIFT macros specifies the length in bits that are mapped by each level of the page tables as illustrated in Figure 3.
The MASK values can be ANDd with a linear address to mask out all the upper bits and is frequently used to determine if a linear address is aligned to a given level within the page table. The SIZE macros reveal how many bytes are addressed by each entry at each level. For the calculation of each of the triplets, only SHIFT is important as the other two are calculated based on it. For example, the three macros for page level on the x86 are:. Even though these are often just unsigned integers, they are defined as structs for two reasons.
The first is for type protection so that they will not be used inappropriately. The second is for features like PAE on the x86 where an additional 4 bits is used for addressing more than 4GiB of memory. Where exactly the protection bits are stored is architecture dependent.
For illustration purposes, we will examine the case of an x86 architecture without PAE enabled but the same principles apply across architectures. A number of the protection and status bits are listed in Table?? To navigate the page directories, three macros are provided which break up a linear address space into its component parts. The remainder of the linear address provided is the offset within the page. The relationship between these fields is illustrated in Figure 3. There are many parts of the VM which are littered with page table walk code and it is important to recognise it.
The following is an excerpt from that function, the parts unrelated to the page table walk are omitted:. The third set of macros examine and set the permissions of an entry. The permissions determine what a userspace process can and cannot do with a particular page. For example, the kernel page table entries are never readable by a userspace process. The fourth set of macros examine and set the state of an entry. There are only two bits that are important in Linux, the dirty bit and the accessed bit.
This set of functions and macros deal with the mapping of addresses and pages to PTEs and the setting of the individual entries. This is important when some modification needs to be made to either the PTE protection or the struct page itself.
The last set of functions deal with the allocation and freeing of page tables. Page tables, as stated, are physical pages containing an array of entries and the allocation and freeing of physical pages is a relatively expensive operation, both in terms of time and the fact that interrupts are disabled during page allocation.
The allocation and deletion of page tables, at any of the three levels, is a very frequent operation so it is important the operation is as quick as possible. Hence the pages used for the page tables are cached in a number of different lists called quicklists.
Each architecture implements these caches differently but the principles used are the same. For example, not all architectures cache PGDs because the allocation and freeing of them only happens during process creation and exit. As both of these are very expensive operations, the allocation of another page is negligible. Architectures implement these three lists in different ways but one method is through the use of a LIFO type structure. Ordinarily, a page table entry contains points to other pages containing page tables or data.
While cached, the first element of the list is used to point to the next free page table. During allocation, one page is popped off the list and during free, one is placed as the new head of the list. A count is kept of how many pages are used in the cache. If a page is not available from the cache, a page will be allocated using the physical page allocator see Chapter 6. Obviously a large number of pages may exist on these caches and so there is a mechanism in place for pruning them.
Each time the caches grow or shrink, a counter is incremented or decremented and it has a high and low watermark. When the high watermark is reached, entries from the cache will be freed until the cache size returns to the low watermark. When the system first starts, paging is not enabled as page tables do not magically initialise themselves.
Each architecture implements this differently so only the x86 case will be discussed. The page table initialisation is divided into two phases. The bootstrap phase sets up page tables for just 8MiB so the paging unit can be enabled. The second phase initialises the rest of the page tables. We discuss both of these phases below. The first megabyte is used by some devices for communication with the BIOS and is skipped. It then establishes page table entries for 2 pages, pg0 and pg1.
This means that when paging is enabled, they will map to the correct pages using either physical or virtual addressing for just the kernel image. Once this mapping has been established, the paging unit is turned on by setting a bit in the cr0 register and a jump takes places immediately to ensure the Instruction Pointer EIP register is correct.
The call graph for this function on the x86 can be seen on Figure 3. If the CPU supports the PGE flag, it also will be set so that the page table entry will be global and visible to all processes. There is a requirement for Linux to have a fast method of mapping virtual addresses to physical addresses and for mapping struct page s to their physical address. All architectures achieve this with very similar mechanisms but for illustration purposes, we will only examine the x86 carefully.
As we saw in Section 3. Next we see how this helps the mapping of struct page s to physical addresses. This would imply that the first available memory to use is located at 0xC but that is not the case.
No macro is available for converting struct page s to physical addresses but at this stage, it should be obvious to see how it could be calculated. Initially, when the processor needs to map a virtual address to a physical address, it must traverse the full page directory searching for the PTE of interest. To avoid this considerable overhead, architectures take advantage of the fact that most processes exhibit a locality of reference or, in other words, large numbers of memory references tend to be for a small number of pages.
They take advantage of this reference locality by providing a Translation Lookaside Buffer TLB which is a small associative memory that caches virtual to physical page table resolutions. Linux assumes that the most architectures support some type of TLB although the architecture independent code does not cares how it works. Instead, architecture dependant hooks are dispersed throughout the VM code at points where it is known that some hardware with a TLB would need to perform a TLB related operation.
For example, when the page tables have been updated, such as after a page fault has completed, the processor may need to be update the TLB for that virtual address mapping. Not all architectures require these type of operations but because some do, the hooks have to exist. If the architecture does not require the operation to be performed, the function for that TLB operation will a null operation that is optimised out at compile time.
For example, when context switching, Linux will avoid loading new page tables using Lazy TLB Flushing , discussed further in Section 4. To avoid having to fetch data from main memory for each reference, the CPU will instead cache very small amounts of data in the CPU cache. CPU caches are organised into lines.
Each line is typically quite small, usually 32 bytes and each line is aligned to it's boundary size. In other words, a cache line of 32 bytes will be aligned on a 32 byte address. How addresses are mapped to cache lines vary between architectures but the mappings come under three headings, direct mapping , associative mapping and set associative mapping. Direct mapping is the simpliest approach where each block of memory maps to only one possible cache line.
With associative mapping, any block of memory can map to any cache line. Set associative mapping is a hybrid approach where any block of memory can may to any line but only within a subset of the available lines. Regardless of the mapping scheme, they each have one thing in common, addresses that are close together and aligned to the cache size are likely to use different lines. Hence Linux employs simple tricks to try and maximise cache usage. If the CPU references an address that is not in the cache, a cache miss ccurs and the data is fetched from main memory.
The cost of cache misses is quite high as a reference to cache can typically be performed in less than 10ns where a reference to main memory typically will cost between ns and ns. The basic objective is then to have as many cache hits and as few cache misses as possible. The hooks are placed in locations where the virtual to physical mapping changes, such as during a page table update. The CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when the virtual address is being flushed from the cache.
The three operations that require proper ordering are important is listed in Table 3. It does not end there though. A second set of interfaces is required to avoid virtual aliasing problems. The problem is that some CPUs select lines based on the virtual address meaning that one physical address can exist on multiple lines leading to cache coherency problems.