Virtual Memory

Background: OS concepts
Virtual memory motivation
Virtual-to-physical address translation
Page tables and page faults
Speeding up translation: the TLB

Background: OS concepts

Any computer will have multiple active processes at any given time. You can run "top" on cwolf to get a list of the processes that are using the most resources at any time, along with the user IDs of the people running them. On your own computer, you can use a tool like Mac's Activity Monitor or Windows's Task Manager to get a list of the current processes. Whatever system you choose, you'll see two things: (1) The processes are run by different users. Even on your own machine, some of them will probably run with root/system/admin privileges, and some with only your user privileges. (2) There are WAY more processes running at any given time than there are processor cores to run them.

From these two observations, we can make one more: these processes must be sharing the underlying hardware, and there must be some mechanism to keep them from interfering with each other or bringing down the whole system.

From the quality time we've spent with the processor datapath and with caches, we have some sense of what happens when a process runs: the PC points to each instruction in turn, the hardware fetches and executes the instructions, and as a consequence it updates the register file and the memory subsystem with new values as our program executes. Hold up, though: what happens when our process is finished? Or what if our process misbehaves or goes on for a really long time -- what happens to all the other processes that need to share the hardware? Finally, if our process is allowed to just read or write any memory location it wants, how can we possibly guarantee any kind of security or correctness when there are multiple running processes?

The fundamental answer is that interactions between user programs and the hardware are mediated by the OS, or more precisely, the OS kernel. It turns out that there's more to the processor than the registers and functionality we've discussed so far. For starters, the hardware supports different privilege levels to distinguish between user programs and the kernel (in practice, it may have more levels than that, but we'll stick with 2 for now). When your computer starts up, the kernel is in charge. When you run a user process, the kernel (conceptually) flips the privilege bit and turns control over to that process. When your process exits, it transfers control back.

Transferring control is a complicated process, and the implementation details can vary. For our purposes, you can think of it as making sure that the privilege bits, PC, and register values of the old process get saved (if applicable), and then get loaded up with the correct values for the new process. There are a number of ways that control can transfer between a user process and the kernel. Some relevant examples:

The process makes a system call to get access to a low-level device (for example, to open a file). This will cause a switch into the kernel code to do this for you, and the kernel will switch back to your process when the work is done.
The process does something it shouldn't (like execute an illegal instruction). This will cause a hardware exception, which jumps straight into an exception handler in the kernel. Depending on how badly your process screwed up, the kernel might clean up and then restore control to to your process, or it might just kill it altogether. The discussion of the datapath in your textbook shows you a sample implementation of exception handling, although we didn't cover it.
Since so many processes are typically trying to share the hardware, the kernel gives each of them a little bit of time to run and then switches them out. It typically does this by literally setting up a hardware timer that will cause an interrupt when it expires. As before, this will jump to an interrupt handler within the kernel, which will decide which process gets the next time slice.

This last part -- switching control from one process to another (really, from one process to the kernel to the other process) -- is called a context switch. People use this term metaphorically: "Ugh, I was deep into debugging my 370 project and then my groupmate had this urgent question about our 351 homework...the context switch was painful!" You can see the analogy: you were keeping a lot of CS 370 information in your head, and you had to flush it all and try to restore the CS 351 information, just like a context switch saves the context of one process and then restores the context of another. And yes, it's a little bit painful in hardware too.

These mechanisms help answer the first two questions we asked at the beginning: what happens when our process is finished? (It restores control to the OS.) What if our process misbehaves or goes on for a really long time? (Control returns to the OS either when our process does something illegal, or when its time is up.) But what they don't answer is the third: when our process does have control of the hardware, if our process is allowed to just read or write any memory location it wants, how can we possibly guarantee any kind of security or correctness when there are multiple running processes? Answering that question is what virtual memory is all about.

Additional resources for this piece (with more detail than you need):

Virtual memory motivation

A couple of good takes -- pick your favorite:

Video from David Black-Schaffer (U. Uppsala, Sweden)
The introduction to this book chapter

Virtual-to-physical address translation

We can solve these problems by adding a layer of indirection (and thus complexity). We'll let any user process access (almost) the entire 32- or 64-bit address space. If that overpromises the system's actual, physical RAM, that's OK. If multiple processes choose to store their stuff at the exact same addresses, that's OK too. The reason it's OK is that the kernel and hardware team up to spin a comforting web of lies for user processes' benefit. This web of lies is called virtual memory.

The core idea is that the addresses your program refers to -- the addresses from which it fetches instructions and loads or stores data -- are now only virtual addresses. The dynamic duo of the kernel and hardware are responsible for getting them translated to actual, physical addresses before going to the cache hierarchy to get the actual data.

It would be counterproductive, in terms of both storage space and performance, to try and work out separate translations for every single byte of memory. So we tend to do these address translations for chunks of contiguous memory, called pages, which are analogous to cache blocks but much larger (and stored directly in main memory).

For example, let's say you have 32-bit virtual addresses (reasonable) and 24-bit physical addresses (ludicrously small in this day and age). Furthermore, let's say your page size is 256 B. This is how you might picture a particular process's virtual memory:

Addresses	Virtual page number
`0x00000000` to `0x000000ff` (0 to 255)	0
`0x00000100` to `0x000001ff` (256 to 511)	1
`0x00000200` to `0x000002ff`	2
...	...
`0xffffff00` to `0xffffffff`	16777215 (0xffffff)

When the kernel decides how to translate one of these process's virtual address to physical memory, it only has to decide how the virtual page number maps to a physical page number. The remaining bits of the address are the page offset (analogous to a cache block offset), and you can just concatenate them to the physical page number to get the physical address.

Imagine that this is how the kernel has mapped the virtual pages for this particular process to physical pages:

Addresses	Virtual page number	Physical page number
`0x00000000` to `0x000000ff` (0 to 255)	0	N/A (you haven't allocated this page)
`0x00000100` to `0x000001ff` (256 to 511)	1	N/A (not currently in physical memory; saved on disk)
`0x00000200` to `0x000002ff`	2	12
...	...	...
`0xffffff00` to `0xffffffff`	16777215 (0xffffff)	5

Here are some sample translations:

Virtual address 0x00000011 (17): Virtual page 0, offset 0x11 (17). Your process hasn't claimed any memory on this page; its stack isn't here, its instruction stream isn't here, and nothing from the heap (allocated with new or malloc) is here either. Your process shouldn't be here. SEGFAULT. (More on this in a bit.)
Virtual address 0x00000101 (257): Virtual page 1, offset 1. There's currently no valid mapping for this process. This is also a page fault (aka segfault), but it's recoverable. More on that later.
Virtual address 0x00000201 (513): Virtual page 2, offset 1. We map this onto physical page 12, offset 1. Remembering that physical addresses are only 24 bits (see above), this gives us physical address 0x000c01.
Virtual address 0xffffff18 (4294967064): Virtual page 16777215 (0xffffff), offset 24 (0x18). This maps to physical page 5, offset 24, or physical address 0x000518.

Here are the design tradeoffs surrounding how big to make a page.

Arguments for big pages:

The bigger each page is, the fewer of them we have. That means less metadata, and less time spent on creating appropriate mappings if we're walking sequentially through memory.
We're using the disk effectively, since it's SUPER slow to initiate an access to disk in the first place, but not that slow to access data afterward. Given those constraints, it makes sense to grab as much as possible in each access.
Programs have spatial locality, so why not keep data together if it's likely to be accessed together?

Arguments for smaller pages:

Spatial locality gets weaker as you get farther away from the originally accessed data. Sure, if you access byte X, you're super-likely to want byte X+1 soon. Is the same true for byte X+10000?
The bigger each page is, the fewer of them we have -- meaning that if you're not really using all the data in a page, you're taking up space in memory that another process (or another part of your own process's memory) might be able to use.

These arguments may sound familiar: except for the disk part, these are similar to the pros and cons of large cache blocks. Just to give you an idea of where the sweet spot is, 4 KB is a pretty common page size.

More resources:

VM video 3 and video 4 from David Black-Schaffer
Sections 9.1-9.3 of the online book chapter

Page tables and page faults

The table in the last section is very close in spirit to the data structures the OS actually uses to store these virtual-to-physical page mappings. To get a little more formal, here's the information the OS needs to store for each virtual page:

A valid bit: is this page in physical memory right now?
Multiple protection bits: what is the process allowed to do with this page (e.g. read/write/execute)?
The actual physical page number
A dirty bit that tells us whether this page has been written while it was in physical memory. If so, that means we need to write it to disk if we kick it out. If not, we could use the version the disk already has. Recall that processor caches can be either write-back or write-through, so they may not need a dirty bit. For page tables, because going to disk is sooooooo painful, you want to minimize writes at all cost -- so write-back is the only reasonable design choice.
One or more reference bits that give you some idea of how recently the page has been accessed. If physical memory space gets tight, the OS needs to choose which pages to kick out, and this reference bit will help. It's not feasible to do perfect LRU across the vast number of physical pages, so instead the OS does something approximate. For example, it might clear the reference bits every so often and see which ones get set (because the page is accessed) and which ones stay clear.

The information in these bullet points makes up a page table entry. In our model of the world, each process has its own page table, with one page table entry per virtual page. Some notes about this:

Unlike processor caches, which are specialized on-chip structures, page tables are stored in main memory. That means that each PTE should occupy an integer number of bytes -- you do NOT want to try to be indexing within a byte to find your PTE. So if you think you need 29 bits for a PTE, round that up to 32.
With this design, you don't actually store the virtual page number in the page table. It's analogous to the index bits of a cache.
Some quick math (which you'll see in the Moodle quiz) will illustrate that if page tables were really organized this way, they would take up a downright absurd amount of space. In CS 450 (operating systems), you'll learn some nicer ways to organize page tables. In architecture, we'll use this simple model, while quietly rolling our eyes.
One way to do this is to have a special hardware register called the page table base register. Part of the kernel's job on a context switch is to make sure that this register contains the address of the beginning of the new process's page table. So each process can have its own page table with its own mapping of virtual to physical pages.
If you try to access a virtual page whose valid bit isn't set, that triggers an exception called a page fault (segfault is a rough synonym), which switches control back to the kernel. If the page fault is due to your program's misbehavior (because you're not allowed to have that page), the kernel will terminate your program with an oh-so-painful segfault error. If the page fault is due to the VM's web of lies, and your page is on disk at the moment, the kernel will hastily (OK, kinda slowly) load it into physical memory, fix up the PTE, and switch back to your program. The instruction that caused the page fault will restart from the beginning, and now everything will be cool -- everything except the million cycles you just lost by going to disk.

More resources:

VM video 5, video 6, and video 7 from David Black-Schaffer
Sections 9.3-9.6 (intro) of the online book chapter

Speeding up translation: the TLB

From everything I've said so far, virtual memory must be a horrendous drag on performance. For example, a simple load from some address, which used to have an AMAT of a few cycles, now looks like this:

Access the page table in main memory to learn the physical page number (main memory access: ~200 cycles)
If the page isn't in memory right now, get it from disk, fix the page table, and restart (1M+ cycles)
Now that you have the physical address, go to the memory hierarchy and access L1, L2, etc. as needed to find the data (AMAT of a few cycles).

Because of that unskippable first step, we are now adding about 200 cycles to our AMAT of 3-5 cycles. How is that acceptable??

It's not. It's totally not. We circumvent it by adding yet another specialized structure on the processor chip: a translation lookaside buffer or TLB. This TLB stores a handful of virtual-to-physical mappings, and it's highly associative for maximum flexibility.

It turns out that, because of spatial locality, the TLB's hit rate is usually very high, and you don't need to store that many mappings. So now your lookup process is as follows:

Access the TLB to see if it has an entry for this virtual page. If so, great! (This part usually fits within the normal cycle time.)
If not, then access the page table in main memory to learn the physical page number (main memory access: ~200 cycles)
If the page isn't in memory right now, get it from disk, fix the page table, and restart (1M+ cycles)
Now that you have the physical address, go to the memory hierarchy and access L1, L2, etc. as needed to find the data (AMAT of a few cycles).

Now, if the address translation hits in the TLB and the data itself hits in L1, you don't stall at all. That's a best-case scenario, of course, but it's likely to be the common case.

However, you now have two potential sources of stalls: one from the address translation (which can be really painful!), and one for the data itself, which you're used to.

More resources: VM video 10 and video 11 from David Black-Schaffer