
Mark Walton
While AMD’s new Ryzen processors offer impressive performance for workloads like software compilation, media encoding, 3D rendering, and indeed anything that can take advantage of the 8 cores and 16 concurrent threads, certain aspects of the game’s performance were uneven.
It still performs very well in games, especially for those who like to stream their gameplay to Twitch, but not consistently. Some games that were expected to perform well on Ryzen didn’t. Testers also noted that there were some tricky interactions with both power management and Ryzen’s simultaneous multithreading (SMT), with certain titles showing unexpectedly high performance by enabling these features. There was widespread hope that a combination of game patches and perhaps even OS changes would improve Ryzen’s gaming performance, or at least make Ryzen perform in a more consistent way.
Over the past few weeks, a number of game patches designed to address certain Ryzen issues have been released. AMD has also released guidance for game developers on how best to use the processor, as well as a new power management profile for Windows 10. Together, we can gain some insight into some of the complexities of developing game software for modern processors and some insight into what performance gains gamers could hope to see.
Patches make everything better
The first major Ryzen patch was for Axis of the singularity. Ryzen’s performance in Ash was perhaps one of the more surprising findings in the initial benchmarking. The game has been widely used as a sort of showcase for the benefits of DirectX 12 and the multithreaded scaling it showcases. We spoke to the developers of the game and they told us that the engine automatically splits the work it needs to do across multiple cores.
Overall, the Ryzen 1800X performed at about the same level as Intel’s Broadwell-E 6900K. Both parts are 8-core, 16-thread chips, and while Broadwell-E has a modest per-cycle advantage on most workloads, Ryzen’s higher clock speed is enough to make up for that shortfall. But in Axis of the singularity under DirectX 12, the 6900K had average frame rates that were about 25 percent better than the AMD chip.
In late March, Oxide/Stardock released a Ryzen performance update for: Ash, and it has come a long way in closing that gap. PC Perspective tested the update and depending on graphics settings and memory clock speeds, Ryzen’s average frame rate increased by 17 to 31 percent. The 1800X still lags behind the 6900K, but now the gap is about 9 percent, or even less with overclocked memory (but we’ll talk more about memory later).

It’s not quite known what Oxide and Stardock changed in the patch (we’ve asked but are still waiting for an answer), but there are some credible speculation that there are two (possibly intertwined) issues at play, both related to how data is loaded and stored in memory.
Out-of-order execution is a complicated thing
While much has been made of Ryzen’s cache layout, especially the large, split level 3 cache, the Ash changes are not supposed to be around the cache, but instead involve a processor’s load and save queues. The processor does not just read and write directly to and from cache or memory. Instead, read (load) and write (save) are buffered. This is because the processor executes the instructions speculatively and in the wrong order, but the Results of that execution – the actual reading and writing to memory – must occur in the order the program specifies, and speculative writes that do not actually occur must be canceled. The buffers are where all this happens.
For example, branch prediction means that the processor can start executing a series of instructions without being sure whether to skip them instead. When those instructions perform writes to memory, the write is placed in the storage buffer. If the processor then determines that the branch predictor was correct, the memory can be pulled back and written into memory. But if it finds that the branch predictor was wrong, and the instructions should never have been executed at all, it can invalidate the storage in the storage buffer and cancel the write to memory before another core can see it.
The processor may also use the storage buffer to fulfill load requests; if a storage in the buffer needs to come before a load, the buffered storage can be used to provide the value that would otherwise be read from memory, a process called store forwarding.
Managing these buffers and their interactions with out-of-order execution is complex. The processor must ensure that, for example, writes to the same location are handled correctly and that the writes appear in the correct order.
Certain sets of instructions can cause performance issues. Intel’s optimization guides contain tables showing which combinations can forward and which cannot; the exact results depend not only on the architecture of the chip, but also on the size of the store and the memory addresses used. The patterns are not always simple. For example, with a storage of 32 bytes, a load of 4 bytes can be forwarded if the memory address divided by 32 has a remainder of 0 to 4, 8 to 12, 16 to 20, or 24 to 28. But if the remainder is 5 to 7, 13 to 15, 21 to 23, or 29 to 31, the tax will not are forwarded.
Optimizing compilers should know the rules about things like store forwarding and should strive to produce code that follows the rules as closely as possible. If it goes wrong, the result can be poor performance. Sometimes this can be unavoidable, but often the compiler has several options for generating equivalent code, and has to figure out which one is best.
Reportedly, Visual C++ 2015’s Visual C++ compiler could produce sequences of: two shops followed by a load in such a way that the store queue stops and writes to memory are blocked until the processor can clear its pending instructions. Visual C++ 2017 has a new optimizer, which apparently avoids the bad order of instructions, and thus the performance issue.
Bypass the cache; sometimes it’s good, sometimes it’s terrible

The other suggestion about the Ash patch is about using a set of instructions called non-temporal statements. These are a series of load and save instructions designed to bypass the cache.
For most data, the cache is a great thing because the cache is so much faster than the main memory. But sometimes the programmer knows that after reading to or writing to a certain memory address, the data will not be used quickly, so there is no point in caching it. In fact, caching the data would be a waste of cache space; caching that data just means moving something else out of the cache, something else that might be needed.
The non-temporary instructions allow the processor to store data in main memory, bypassing the cache on the way out. They also have a number of other properties. Non-temporary stores are write combinations: multiple writes to the same 64-byte memory are combined into a single 64-byte write. They are writing collapse: multiple writes to the same byte are collapsed into a single write (so that only the last value written is ever visible to other applications). They are also weakly ordered: non-temporary writes do not interact with the normal save and load buffers and thus may appear in memory in an order that does not match the program’s write order.
When used correctly, they can provide some of the fastest writes to memory, 64 bytes at a time, without disrupting valuable cached data. But if you use them incorrectly, the performance can drop off a cliff. The non-temporary writes are buffered in buffers the size of a cache line, and there are a limited number of these buffers. If a program tries to perform non-temporary writes to too many different cache lines at once, the processor eventually has to do some partial cache line writes instead of pretty big 64-byte writes, and performance drops. If a program combines regular and non-temporary stores on the same cache line, performance decreases. If the program combines loads with non-temporary stores to the same cache line, performance decreases. Although the writes are intended to collapse, on at least some processors, if a byte on a cache line is written more than once, performance is degraded.
The nature of the performance hit will also vary by processor. A penalty that may be negligible on one chip may be significant on another; some processors seem to completely empty their caches when non-temporal instructions are used “poorly”, which, unsurprisingly, saps performance. The belief is that Axis of the singularity did something with non-temporary instructions that were harmless or perhaps even desirable for other chips, but especially detrimental to Ryzen. The performance update changes how the instructions are used to avoid the issue.
Most of the non-temporal statements are stored, but there is also a non-temporal load statement and a set of prefetching statements that contain a non-temporal variant. These are supposed to load data in some cache levels, without having to load it in other cache levels. The precise meaning of the prefetch instructions varies from processor to processor – it is at best a hint rather than a well-defined instruction. It is difficult to put the prefetch instructions in the right place; do it too early, and the pre-fetched data will be thrown away by the time it’s needed anyway. If you do it too late, the prefetched data will still not be loaded when you need it. Worse, pre-fetched data can displace data that would are used from the cache, decreasing performance.