The fresh uncover of Meltdown and Spectre reminded me of the time I chanced on a linked originate malicious program in the Xbox 360 CPU – a newly added instruction whose mere existence became as soon as unhealthy.
Wait on in 2005 I became as soon as the Xbox 360 CPU guy. I lived and breathed that chip. I tranquil have a 30-cm CPU wafer on my wall, and a four-foot poster of the CPU’s structure. I spent so powerful time working out how that CPU’s pipelines labored that when I became as soon as requested to investigate some impossible crashes I became as soon as in a position to intuit how a originate malicious program need to be their motive. But first, some background…
The Xbox 360 CPU is a three-core PowerPC chip made by IBM. The three cores sit down in three separate quadrants with the fourth quadrant containing a 1-MB L2 cache – it is advisable presumably witness the assorted formula, in the image at true and on my CPU wafer. Every core has a 32-KB instruction cache and a 32-KB data cache.
Trivialities: Core 0 became as soon as closer to the L2 cache and had measurably decrease L2 latencies.
The Xbox 360 CPU had high latencies for everything, with memory latencies being notably defective. And, the 1-MB L2 cache (all that would fit) became as soon as fairly small for a three-core CPU. So, conserving apartment in the L2 cache in justify to diminish cache misses became as soon as crucial.
CPU caches toughen efficiency attributable to spatial and temporal locality. Spatial locality ability that whenever you’ve passe one byte of data then you definately’ll presumably spend other nearby bytes of data shortly. Temporal locality ability that whenever you’ve passe some memory then you definately will possible spend it yet again in the shut to future.
But typically temporal locality doesn’t genuinely happen. In the event you are processing a large array of data as soon as-per-physique then it will perhaps be trivially provable that it will all be gone from the L2 cache by the time you’d like it yet again. You tranquil need that data in the L1 cache so that it is advisable profit from spatial locality, nonetheless having it ingesting fantastic apartment in the L2 cache factual ability it will evict other data, per chance slowing down the opposite two cores.
Generally right here’s unavoidable. The memory coherency mechanism of our PowerPC CPU required that all data in the L1 caches additionally be in the L2 cache. The MESI protocol passe for memory coherency requires that when one core writes to a cache line that any other cores with a replica of the equal cache line need to discard it – and the L2 cache became as soon as to blame for conserving note of which L1 caches were caching which addresses.
But, the CPU became as soon as for a on-line game console and efficiency trumped all so a brand contemporary instruction became as soon as added – xdcbt. The standard PowerPC dcbt instruction became as soon as a normal prefetch instruction. The xdcbt instruction became as soon as an prolonged prefetch instruction that fetched straight from memory to the L1 d-cache, skipping L2. This meant that memory coherency became as soon as no longer guaranteed, nonetheless hi there, we’re on-line game programmers, each person knows what we’re doing, it will perhaps be dazzling.
I wrote a widely-passe Xbox 360 memory reproduction routine that optionally passe xdcbt. Prefetching the source data became as soon as most well-known for efficiency and in general it could most likely spend dcbt nonetheless pass in the PREFETCH_EX flag and it could most likely prefetch with xdcbt. This became as soon as no longer smartly-idea-out.
A game developer who became as soon as the spend of this fair reported unfamiliar crashes – heap corruption crashes, nonetheless the heap structures in the memory dumps regarded normal. After staring on the rupture dumps for awhile I spotted what a mistake I had made.
Memory that is prefetched with xdcbt is toxic. Whether it’s miles written by one other core earlier than being flushed from L1 then two cores have varied views of memory and there’s no longer any such thing as a vow their views will ever converge. The Xbox 360 cache traces were 128 bytes and my reproduction routine’s prefetching went true to the end of the source memory, that means that xdcbt became as soon as utilized to some cache traces whose latter parts were piece of adjoining data structures. Typically this became as soon as heap metadata – on the least that’s where we saw the crashes. The incoherent core saw worn data (despite cautious spend of locks), and crashed, nonetheless the rupture dump wrote out the staunch contents of RAM so that we couldn’t witness what came about.
So, the ideal powerful manner to make spend of xdcbt became as soon as to be very cautious now to no longer prefetch even a single byte beyond the end of the buffer. I mounted my memory reproduction routine to steer clear of prefetching too some distance, nonetheless whereas expecting the repair the game developer stopped passing the PREFETCH_EX flag and the crashes went away.
The exact malicious program
Previously so normal, true? Cocky game developers play with fire, waft too shut to the sun, marry their moms, and a game console practically misses Christmas.
But, we caught it in time, we got away with it, and we were all reputation to ship the video games and the console and streak house fully cheerful.
And then the equal game began crashing yet again.
The signs were equal. Aside from that the game became as soon as no longer the spend of the xdcbt instruction. I’d step by the code and witness that. We had a extreme anguish.
I passe the old debugging approach to staring at my veil with a clean mind, let the CPU pipelines bear my subconscious, and I all exact now realized the anguish. A short email to IBM confirmed my suspicion about a refined inner CPU component that I had by no means idea of earlier than. And it’s the equal wrongdoer in the befriend of Meltdown and Spectre.
The Xbox 360 CPU is an in-justify CPU. It’s fairly easy genuinely, relying on its high frequency (no longer as high as hoped despite 10 FO4) for efficiency. But it does have a branch predictor – its very lengthy pipelines build that critical. Right here’s a publicly shared CPU pipeline device I made (my cycle-correct version is NDA handiest, nonetheless looky right here) that shows the general pipelines:
You’ll want to perhaps witness the branch predictor, and likewise it is advisable presumably witness that the pipelines are very lengthy (large on the device) – loads lengthy adequate for mispredicted instructions to upward thrust as a lot as traipse, even with in-justify processing.
So, the branch predictor makes a prediction and the predicted instructions are fetched, decoded, and performed – nonetheless no longer retired till the prediction is legendary to be gorgeous. Sound familiar? The conclusion I had – it became as soon as contemporary to me on the time – became as soon as what it meant to speculatively attain a prefetch. The latencies were lengthy, so it became as soon as crucial to get the prefetch transaction on the bus as shortly as possible, and as soon as a prefetch had been initiated there became as soon as no manner to waste it. So a speculatively-performed xdcbt became as soon as equal to a exact xdcbt! (a speculatively-performed load instruction became as soon as factual a prefetch, FWIW).
And that became as soon as the anguish – the branch predictor would typically motive xdcbt instructions to be speculatively performed and that became as soon as factual as defective as genuinely executing them. One of my coworkers (thanks Tracy!) prompt a artful take a look at to have a look at this – replace each and each xdcbt in the game with a breakpoint. This done two issues:
- The breakpoints weren’t hit, thus proving that the game became as soon as no longer executing xdcbt instructions.
- The crashes went away.
I knew that could perhaps be the end consequence and yet it became as soon as tranquil amazing. All these years later, and even after reading about Meltdown, it’s tranquil nerdy frigid to survey stable proof that instructions that weren’t performed were causing crashes.
The branch predictor realization made it certain that this instruction became as soon as too unhealthy to have anywhere in the code segment of any game – controlling when an instruction could perhaps be speculatively performed is too complex. The branch predictor for indirect branches could perhaps, theoretically, predict any take care of, so there became as soon as no “powerful reputation” to set aside an xdcbt instruction. And, if speculatively performed it could most likely happily kind an prolonged prefetch of regardless of memory the specified registers came about to randomly possess. It became as soon as possible to diminish the risk, nonetheless no longer cast off it, and it factual wasn’t worth it. While Xbox 360 structure discussions proceed to mention the instruction I doubt that any video games ever shipped with it.
I mentioned this as soon as one day of a job interview – “relate the toughest malicious program you’ve needed to investigate” – and the interviewer’s response became as soon as “yeah, we hit one thing the same on the Alpha processor”. The extra issues trade…
Thanks to Michael for some bettering.