Virtual Memory


Contents

  1. Background: OS concepts
  2. Virtual memory motivation
  3. Virtual-to-physical address translation
  4. Page tables and page faults
  5. Speeding up translation: the TLB

Background: OS concepts

Any computer will have multiple active processes at any given time. You can run "top" on cwolf to get a list of the processes that are using the most resources at any time, along with the user IDs of the people running them. On your own computer, you can use a tool like Mac's Activity Monitor or Windows's Task Manager to get a list of the current processes. Whatever system you choose, you'll see two things: (1) The processes are run by different users. Even on your own machine, some of them will probably run with root/system/admin privileges, and some with only your user privileges. (2) There are WAY more processes running at any given time than there are processor cores to run them.

From these two observations, we can make one more: these processes must be sharing the underlying hardware, and there must be some mechanism to keep them from interfering with each other or bringing down the whole system.

From the quality time we've spent with the processor datapath and with caches, we have some sense of what happens when a process runs: the PC points to each instruction in turn, the hardware fetches and executes the instructions, and as a consequence it updates the register file and the memory subsystem with new values as our program executes. Hold up, though: what happens when our process is finished? Or what if our process misbehaves or goes on for a really long time -- what happens to all the other processes that need to share the hardware? Finally, if our process is allowed to just read or write any memory location it wants, how can we possibly guarantee any kind of security or correctness when there are multiple running processes?

The fundamental answer is that interactions between user programs and the hardware are mediated by the OS, or more precisely, the OS kernel. It turns out that there's more to the processor than the registers and functionality we've discussed so far. For starters, the hardware supports different privilege levels to distinguish between user programs and the kernel (in practice, it may have more levels than that, but we'll stick with 2 for now). When your computer starts up, the kernel is in charge. When you run a user process, the kernel (conceptually) flips the privilege bit and turns control over to that process. When your process exits, it transfers control back.

Transferring control is a complicated process, and the implementation details can vary. For our purposes, you can think of it as making sure that the privilege bits, PC, and register values of the old process get saved (if applicable), and then get loaded up with the correct values for the new process. There are a number of ways that control can transfer between a user process and the kernel. Some relevant examples:

This last part -- switching control from one process to another (really, from one process to the kernel to the other process) -- is called a context switch. People use this term metaphorically: "Ugh, I was deep into debugging my 370 project and then my groupmate had this urgent question about our 351 homework...the context switch was painful!" You can see the analogy: you were keeping a lot of CS 370 information in your head, and you had to flush it all and try to restore the CS 351 information, just like a context switch saves the context of one process and then restores the context of another. And yes, it's a little bit painful in hardware too.

These mechanisms help answer the first two questions we asked at the beginning: what happens when our process is finished? (It restores control to the OS.) What if our process misbehaves or goes on for a really long time? (Control returns to the OS either when our process does something illegal, or when its time is up.) But what they don't answer is the third: when our process does have control of the hardware, if our process is allowed to just read or write any memory location it wants, how can we possibly guarantee any kind of security or correctness when there are multiple running processes? Answering that question is what virtual memory is all about.

Additional resources for this piece (with more detail than you need):

Virtual memory motivation

A couple of good takes -- pick your favorite:

Virtual-to-physical address translation

We can solve these problems by adding a layer of indirection (and thus complexity). We'll let any user process access (almost) the entire 32- or 64-bit address space. If that overpromises the system's actual, physical RAM, that's OK. If multiple processes choose to store their stuff at the exact same addresses, that's OK too. The reason it's OK is that the kernel and hardware team up to spin a comforting web of lies for user processes' benefit. This web of lies is called virtual memory.

The core idea is that the addresses your program refers to -- the addresses from which it fetches instructions and loads or stores data -- are now only virtual addresses. The dynamic duo of the kernel and hardware are responsible for getting them translated to actual, physical addresses before going to the cache hierarchy to get the actual data.

It would be counterproductive, in terms of both storage space and performance, to try and work out separate translations for every single byte of memory. So we tend to do these address translations for chunks of contiguous memory, called pages, which are analogous to cache blocks but much larger (and stored directly in main memory).

For example, let's say you have 32-bit virtual addresses (reasonable) and 24-bit physical addresses (ludicrously small in this day and age). Furthermore, let's say your page size is 256 B. This is how you might picture a particular process's virtual memory:

AddressesVirtual page number
0x00000000 to 0x000000ff (0 to 255)0
0x00000100 to 0x000001ff (256 to 511)1
0x00000200 to 0x000002ff2
......
0xffffff00 to 0xffffffff16777215 (0xffffff)

When the kernel decides how to translate one of these process's virtual address to physical memory, it only has to decide how the virtual page number maps to a physical page number. The remaining bits of the address are the page offset (analogous to a cache block offset), and you can just concatenate them to the physical page number to get the physical address.

Imagine that this is how the kernel has mapped the virtual pages for this particular process to physical pages:

AddressesVirtual page numberPhysical page number
0x00000000 to 0x000000ff (0 to 255)0N/A (you haven't allocated this page)
0x00000100 to 0x000001ff (256 to 511)1N/A (not currently in physical memory; saved on disk)
0x00000200 to 0x000002ff212
.........
0xffffff00 to 0xffffffff16777215 (0xffffff)5

Here are some sample translations:

Here are the design tradeoffs surrounding how big to make a page.

Arguments for big pages:

Arguments for smaller pages:

These arguments may sound familiar: except for the disk part, these are similar to the pros and cons of large cache blocks. Just to give you an idea of where the sweet spot is, 4 KB is a pretty common page size.

More resources:

Page tables and page faults

The table in the last section is very close in spirit to the data structures the OS actually uses to store these virtual-to-physical page mappings. To get a little more formal, here's the information the OS needs to store for each virtual page:

The information in these bullet points makes up a page table entry. In our model of the world, each process has its own page table, with one page table entry per virtual page. Some notes about this:

More resources:

Speeding up translation: the TLB

From everything I've said so far, virtual memory must be a horrendous drag on performance. For example, a simple load from some address, which used to have an AMAT of a few cycles, now looks like this:

  1. Access the page table in main memory to learn the physical page number (main memory access: ~200 cycles)
  2. If the page isn't in memory right now, get it from disk, fix the page table, and restart (1M+ cycles)
  3. Now that you have the physical address, go to the memory hierarchy and access L1, L2, etc. as needed to find the data (AMAT of a few cycles).

Because of that unskippable first step, we are now adding about 200 cycles to our AMAT of 3-5 cycles. How is that acceptable??

It's not. It's totally not. We circumvent it by adding yet another specialized structure on the processor chip: a translation lookaside buffer or TLB. This TLB stores a handful of virtual-to-physical mappings, and it's highly associative for maximum flexibility.

It turns out that, because of spatial locality, the TLB's hit rate is usually very high, and you don't need to store that many mappings. So now your lookup process is as follows:

  1. Access the TLB to see if it has an entry for this virtual page. If so, great! (This part usually fits within the normal cycle time.)
  2. If not, then access the page table in main memory to learn the physical page number (main memory access: ~200 cycles)
  3. If the page isn't in memory right now, get it from disk, fix the page table, and restart (1M+ cycles)
  4. Now that you have the physical address, go to the memory hierarchy and access L1, L2, etc. as needed to find the data (AMAT of a few cycles).

Now, if the address translation hits in the TLB and the data itself hits in L1, you don't stall at all. That's a best-case scenario, of course, but it's likely to be the common case.

However, you now have two potential sources of stalls: one from the address translation (which can be really painful!), and one for the data itself, which you're used to.

More resources: VM video 10 and video 11 from David Black-Schaffer