Feature Story

Exploring how Cache Coherency Accelerates Heterogeneous Compute

by Neil Parris

The Heterogeneous System Architecture (HSA) Foundation is a not-for profit consortium for SoC IP vendors, OEMs, academia, SoC vendors, OSVs and ISVs whose goal is to make it easier for software developers to take advantage of all the advanced processing hardware on a modern SoC. The CPU and GPU on a typical applications processor occupy a significant proportion of die area and applying these resources efficiently across multiple applications can improve the end user experience. Done right, efficiency can be gained in power, performance, programmability and portability.

This blog focuses on some of the hardware innovations and changes that are relevant to shared virtual memory and cache coherency, which are components of the HSA hardware specification.

What is Shared Virtual Memory?

Traditional memory systems defined separate memory for CPU and GPU. In the case of PCs, the GPU may have completely separate discrete memory chips on different boards. In these systems, any application that wants to share data between CPU and GPU will need to copy it from CPU memory to graphics memory at a significant cost of latency and power.

Mobile systems have had a unified memory system for many years where all processors can access the same physical memory. However, even though this is physically possible, the software APIs and memory management hardware and software may not allow this. Graphics buffers may still be defined separately from other memory regions and data sharing may still require an expensive copy of data between buffers.

Shared virtual memory (SVM) allows processors to see the same view of memory; specifically, the same virtual address on the CPU and GPU will point to the same physical memory location. With this architecture, an application only needs to pass a pointer between processors that are sharing data.

There are multiple ways to implement SVM; it doesn't mean you have to share the exact same page table. The only requirement is that if a buffer is to be shared between processors then it must appear in the page tables for both memory management units (MMUs). With SVM in place, sharing data becomes as simple as passing a pointer between processors.

So What is Cache Coherency?

Let's go back to basics and ask what does coherency mean? Coherency is about ensuring all processors, or bus masters in the system see the same data. For example if I have a processor which is creating a data structure in its local cache then passing it to a GPU, both the processor and GPU must see the same data. If the GPU reads from external DDR, the GPU will read old, stale data.

There are three mechanisms to maintain coherency:

Disable caching is the simplest mechanism but may cost significant processor performance. To get the highest performance, processors are pipe-lined to run at high frequency and access caches which offer a very low latency. Caching of data that is accessed multiple times increases performance significantly and reduces DRAM accesses and power. Marking data as "non-cached" could impact performance and power and in reality is not used unless there is no other choice.
Software managed coherency is the traditional solution to the data sharing problem. Here the software, usually device drivers, must clean dirty data from caches and invalidate old data to enable sharing with other processors or masters in the system. This takes processor cycles, bus bandwidth, and power.
Hardware managed coherency offers an alternative to simplify software. With this solution any cached data marked 'shared' will always be up to date, automatically. All processors and bus masters in that sharing domain see the exact same value.

Challenges with Software Coherency

A cache stores external memory contents close to the processor to reduce the latency and power of accesses. On-chip memory accesses are significantly lower power than external DRAM accesses.

Software managed coherency manages cache contents with two key mechanisms:

Cache Cleaning:
- If any data stored in a cache is modified, it is marked as 'dirty' and must be written back to DRAM at some point in the future. The process of cleaning will force dirty data to be written to external memory. There are two ways to do this: 1) clean the whole cache which would impact all applications, or 2) clean specific addresses one by one. Both are very expensive in CPU cycles.
- With modern multi-core systems this cache cleaning must happen on all cores.
Cache Invalidation:
- If a processor has a local copy of data, but an external agent updates main memory then the cache contents are out of date, or 'stale'. Before reading this data the processor must remove the stale data from caches, this is known as 'invalidation' (a cache line is marked invalid).
- An example is a region of memory used as a shared buffer for network traffic which may be updated by a network interface DMA hardware; a processor wishing to access this data must invalidate any old stale copy before reading the new data.

Complexity of Software Coherency

"We would like to connect more devices with hardware coherency to simplify software and accelerate product schedules"

"50% of debug time is spent on SW coherency issues as these are difficult to find and pinpoint"

Quotes from a system architect at an application processor vendor.

Software coherency is hard to debug; the cache cleaning and invalidation must be done at the right time. If done too often it wastes power and CPU effort. If done too infrequently it will result in stale data which may cause unpredictable application behavior, if not a crash. Debugging this is extremely difficult as it will present occasional data corruption.

Looking specifically at CPU and GPU sharing, this software cache maintenance will be difficult to optimize and applications on these systems will try and avoid sharing data due to cost and complexity. One middleware vendor using GPU compute with software coherency noted that the developers spent around 30% of their time architecting, implementing and debugging the data sharing including breaking down image data into sub-frames and careful timing of the mapping and unmapping functions.

When sharing is used with software coherency, the size of the task running on the GPU must be large enough to make it worthwhile, taking into account the cost of software coherency.

Hardware Coherency Requires an Advanced Bus Protocol

Extending hardware coherency to the system requires a coherent bus protocol, and in 2011 ARM® released the AMBA® 4 ACE specification which introduces the "AXI Coherency Extensions" on top of the popular AXI protocol. The full ACE interface allows hardware coherency between processor clusters and allows an SMP operating system to extend to more cores.

With the example of two clusters, any shared access to memory can 'snoop' into the other cluster's caches to see if the data is already on chip; if not, it is fetched from external memory (DDR). In mobile, this has enabled the big.LITTLE™ processing model which improves performance and power efficiency by utilizing the right core to suit the size of the task.

The AMBA 4 ACE-Lite interface is designed for IO (or one-way) coherent system masters like DMA engines, network interfaces and accelerators. These devices may not have any caches of their own, but they can read shared data from the ACE processors. Alternatively, they may have caches but these would still need to be cleaned and invalidated by software.

While hardware coherency may add some complexity to the interconnect and processors, it massively simplifies the software and enables applications that would not be possible with software coherency such as big.LITTLE processing.

Adding Hardware Coherency to the GPU

While processor clusters have implemented cache coherency protocols for many years, this is a new area for GPUs. As applications look to share more data between CPU and GPU, hardware cache coherency ensures this can be done at a low cost in power and latency, which in turn makes it easier, more power efficient and higher performance than any software managed mechanism. Most importantly it makes it easy for the software developer to share data.

There are two ways a GPU could be connected with hardware coherency:

IO coherency (also known as one-way coherency) using ACE-Lite where the GPU can read from CPU caches. Examples include the ARM Mali™-T600, 700 and 800 series GPUs.
Full coherency using full ACE, where CPU and GPU can see each other's caches.

The Powerful Combination of SVM and Hardware Coherency

The following diagrams summarize what we've learned so far and also describe the coarse and fine grain shared virtual memory. These charts approximate elapsed time on the horizontal axis, and address space on the vertical axis.

The above chart shows traditional memory systems where software coherency required data to be cleaned from caches and copied between processors to 'share' the data. In additional to cache cleaning, the target cache would also need to invalidate any old data before reading new data from DRAM. This is time consuming and power hungry and limits the applications that can take advantage of heterogeneous processing.

With shared virtual memory the CPU and GPU can now share physical memory and operate on the same virtual address, which eliminates the copy. If we have an IO coherent GPU, in other words one-way coherent where GPU can read CPU caches, then we remove the need to clean data from CPU caches. However, because this is one-way, the CPU cannot see the GPU caches. This means the GPU caches must be cleaned with cache maintenance operations after processing completes. This 'coarse-grain' SVM means the processors must take turns accessing the shared buffer.

Finally, if we enable a fully coherent memory system then both CPU and GPU can see exactly the same data at all times, and we can use 'fine-grained' SVM. This means both processes can access the same buffer at the same time instead of taking turns. Handshaking between processors uses cross-device atomics. By removing all of the cache maintenance overheads we can get the best overall performance.

Connecting Hardware with Software: Compute APIs

At this point it's useful to map these hardware technologies to the software APIs. Compute APIs like OpenCL 2.0 can take full advantage of SVM and hardware coherency, and can run on HSA platforms. Not all OpenCL 2.0 implementations are the same; there are a number of optional features that can be enabled if the hardware supports it. These features can also be mapped to the HSA profiles: base profile and full profile, as shown in the table below.

OpenCL Feature	Shared Virtual Memory	Fully Coherent Memory	HSA Profile
Fine Grained Buffer	Required, buffer level	Required, fully coherent	Base Profile
Fine Grained System	Required, full memory	Required, fully coherent	Full Profile
Coarse Grain	Required, buffer level	Not required (legacy, software or IO coherency)

HSA always requires hardware coherency, and with the base profile the scope of shared virtual memory can be limited to the shared buffers. This means only the shared buffers would appear in both CPU and GPU page tables, not the full system memory. This may be easier and lower cost to implement in hardware.

Full coherency is required for fine grain, and this enables both CPU and GPU to work on different addresses with the same data buffer at the same time.

Full coherency also allows the use of atomic operations, which allows processors to work on the same address within the same buffer. Atomic operations allow synchronization between threads, much like in a multi-core CPU. Atomics are optional for OpenCL but required for HSA.

For coarse grain, if hardware coherency is not present then it would need to use software managed coherency including cache maintenance operations, or optionally IO coherency for the GPU.

Hardware Requirements for Cache Coherency and Shared Virtual Memory

The hardware required to implement these technologies already exists today in the form of fully coherent processors and cache coherent interconnects. The interconnect is responsible for connecting processors, peripherals and memory together on the system on chip (SoC). The AMD Kavari APU already has a fully coherent memory between the CPU and GPU. ARM offers IP such as the CoreLink™ CCI-550 Cache Coherent Interconnect, Cortex®-A72 processor and the Mali Mimir GPU, which together support the full coherency and shared virtual memory techniques described above.

Interconnect innovations, such as snoop filters, are essential to support scaling to higher performance memory systems. The snoop filter acts as a directory of processor cache contents and allows any memory access to be targeted directly to the processor that holds that data. More detail on this can be found on the ARM community pages.

Cache Coherency Brings Heterogeneous Compute One Step Closer

HSA, with full coherency and shared virtual memory, is all about delivering new, enhanced user experiences through advances in computing architectures that bring improvements across key areas:

performance
power efficiency
reduced software complexity

Application developers now have access to the complete compute potential on an SOC, where workloads can be moved seamlessly between computing devices enabling right sized computing for the given task.

Return to: 2016 Feature Stories