Cache Write Policies

Introduction: Cache Reads

So far, we've traced sequences of memory addresses that work as follows, if you'll let me anthropomorphize a little bit:

The processor asks the memory subsystem, "Hey, do you have the data at Address XXX?"
The L1 cache tackles this question first by checking the valid bit and tag of whatever block(s) could possibly contain Address XXX's data.
One of two things will happen:
- Hit: If the L1 determines that it is currently holding Address XXX's data, then the L1 cheerfully returns that data to the processor (and updates its own LRU information, if applicable).
- Miss: If the L1 determines that it doesn't have Address XXX's data, then it starts talking to the next level of the hierarchy (probably an L2 cache). The conversation between the L1 and L2 looks a lot like the conversation between the processor and the L1 we've outlined so far.
  
  But eventually, the data makes its way from some other level of the hierarchy to both the processor that requested it and the L1 cache. The L1 cache then stores the new data, possibly replacing some old data in that cache block, on the hypothesis that temporal locality is king and the new data is more likely to be accessed soon than the old data was.

Throughout this process, we make some sneaky implicit assumptions that are valid for reads but questionable for writes. We will label them Sneaky Assumptions 1 and 2:

Sneaky assumption 1: It's OK to unceremoniously replace old data in a cache, since we know there is a copy somewhere else further down the hierarchy (main memory, if nowhere else).
Sneaky assumption 2: If the access is a miss, we absolutely need to go get that data from another level of the hierarchy before our program can proceed.

Why these assumptions are valid for reads:

Sneaky assumption 1: Bringing data into the L1 (or L2, or whatever) just means making a copy of the version in main memory. If we lose this copy, we still have the data somewhere.
Sneaky assumption 2: If the request is a load, the processor has asked the memory subsystem for some data. In order to fulfill this request, the memory subsystem absolutely must go chase that data down, wherever it is, and bring it back to the processor.

Why these assumptions are questionable for writes:

Sneaky assumption 1: Let's think about the data that's being replaced (the technical term is evicted) when we bring in the new data. If some of the accesses to the old data were writes, it's at least possible that the version of the old data in our cache is inconsistent with the versions in lower levels of the hierarchy. We would want to be sure that the lower levels know about the changes we made to the data in our cache before just overwriting that block with other stuff.
Sneaky assumption 2: If the request is a store, the processor is just asking the memory subsystem to keep track of something -- it doesn't need any information back from the memory subsystem. So the memory subsystem has a lot more latitude in how to handle write misses than read misses.

In short, cache writes present both challenges and opportunities that reads don't, which opens up a new set of design decisions.

Design Decision #1: Keeping Track of Modified Data

More wild anthropomorphism ahead...

Imagine you're an L1 cache (although this discussion generalizes to other levels as well). The processor sends you a write request for address XXX, whose data you're already storing (a write hit). As requested, you modify the data in the appropriate L1 cache block.

Oh no! Now your version of the data at Address XXX is inconsistent with the version in subsequent levels of the memory hierarchy (L2, L3, main memory...)! Since you care about preserving correctness, you have only two real options:

Option 1: Write-through. You and L2 are soulmates. Inconsistency with L2 is intolerable to you. You feel uncomfortable when you and L2 disagree about important issues like the data at Address XXX. To deal with this discomfort, you immediately tell L2 about this new version of the data.
Option 2: Write-back. You have a more hands-off relationship with L2. Your discussions are on a need-to-know basis. You quietly keep track of the fact that you have modified this block. If you ever need to evict the block, that's when you'll finally tell L2 what's up.

Write-Through Implementation Details (naive version)

With write-through, every time you see a store instruction, that means you need to initiate a write to L2. In order to be absolutely sure you're consistent with L2 at all times, you need to wait for this write to complete, which means you need to pay the access time for L2.

What this means is that a write hit actually acts like a miss, since you'll need to access L2 (and possibly other levels too, depending on what L2's write policy is and whether the L2 access is a hit or miss).

This is no fun and a serious drag on performance.

Write-Through Implementation Details (smarter version)

Instead of sitting around until the L2 write has fully completed, you add a little bit of extra storage to L1 called a write buffer. The write buffer's job is to keep track of all the pending updates to L2, so that L1 can move on with its life.

This optimization is possible because those write-through operations don't actually need any information from L2; L1 just needs to be assured that the write will go through.

However, the write buffer is finite -- we're not going to be able to just add more transistors to it if it fills up. If the write buffer does fill up, then, L1 actually will have to stall and wait for some writes to go through.

The bottom line: from a performance perspective, we'll treat a write hit to a write-through cache like a read hit, as long as the write buffer has available space. When the write buffer is full, we'll treat it more like a read miss (since we have to wait to hand the data off to the next level of cache).

Write-Back Implementation Details

As long as we're getting write hits to a particular block, we don't tell L2 anything. Instead, we just set a bit of L1 metadata (the dirty bit -- technical term!) to indicate that this block is now inconsistent with the version in L2.

So everything is fun and games as long as our accesses are hits. The problem is whenever we have a miss -- even if it's a read miss -- and the block that's being replaced is dirty.

Whenever we have a miss to a dirty block and bring in new data, we actually have to make two accesses to L2 (and possibly lower levels):

One to let it know about the modified data in the dirty block. We'll treat this like an L1 miss penalty.
Another to fetch the actual missed data. We'll treat this like a second miss penalty.

What this means is that some fraction of our misses -- the ones that overwrite dirty data -- now have this outrageous double miss penalty.

Design Decision #2: Write Misses: Should You Care?

Again, pretend (without loss of generality) that you're an L1 cache. You get a write request from the processor. Your only obligation to the processor is to make sure that the subsequent read requests to this address see the new value rather than the old one. That's it.

If this write request happens to be a hit, you'll handle it according to your write policy (write-back or write-through), as described above. But what if it's a miss?

As long as someone hears about this data, you're not actually obligated to personally make room for it in L1. You can just pass it to the next level without storing it yourself.

(As a side note, it's also possible to refuse to make room for the new data on a read miss. But that requires you to be pretty smart about which reads you want to cache and which reads you want to send to the processor without storing in L1. That's a very interesting question but beyond the scope of this class.)

So you have two basic choices: make room for the new data on a write miss, or don't.

Write-allocate

A write-allocate cache makes room for the new data on a write miss, just like it would on a read miss.

Here's the tricky part: if cache blocks are bigger than the amount of data requested, now you have a dilemma. Do you go ask L2 for the data in the rest of the block (which you don't even need yet!), or not? This leads to yet another design decision:

Fetch-on-write: If the cache is fetch-on-write, then an L1 write miss triggers a request to L2 to fetch the rest of the block. This read request to L2 is in addition to any write-through operation, if applicable.
No-fetch-on-write: If the cache isn't fetch-on-write, then here's how a write miss works: L1 fills in only the part of the block that's being written and doesn't ask L2 to help fill in the rest. This eliminates the overhead of the L2 read, but it requires multiple valid bits per cache line to keep track of which pieces have actually been filled in.

In this class, I won't ask you about the sizing or performance of no-fetch-on-write caches. I might ask you conceptual questions about them, though.

No-write-allocate

This is just what it sounds like! If you have a write miss in a no-write-allocate cache, you simply notify the next level down (similar to a write-through operation). You don't kick anything out of your own cache.

Generally, write-allocate makes more sense for write-back caches and no-write-allocate makes more sense for write-through caches, but the other combinations are possible too.

Help! Nothing Makes Sense!

Maybe these other sources will help:

The material on handling writes on pp. 405–407 of your textbook
These notes from Duke

These notes and these notes from NC State
These notes from the University of Washington