Feature Story

AMD, ARM, Imagination, Samsung Alliance Publish Official Shared GPU-CPU Blueprints

HSA Foundation rolls out 1.0 spec to grab devs' attention

An effort to tightly knit together graphics chips, processors and other hardware to boost things like video search on your desktop has taken a step forward.

The HSA Foundation today officially published version 1.0 of its Heterogeneous System Architecture specification, which (if we were being flippant) describes how GPUs, DSPs and CPUs can share the same physical memory and pass pointers between each other. (A provisional 1.0 version went live in August 2014.)

In other words, general compute, graphics and digital-signal processors can each directly address the same banks of physical RAM in a cache-coherent manner, without having to go over tedious external buses and loosely coupled interconnects, and work on data at the same time. A GPU and CPU can work on the same bits of memory in an application in a multi-threaded way, for example, which is supposed to increase performance and shave energy use. The spec refers to GPUs and DSPs as "kernel agents."

There's more to the spec than that, of course: the blueprints support 64-bit and 32-bit models, and lay out virtual memory mappings, memory coherency, message passing, and more. The specification sets out a programming model, and requirements for the hardware.

What's crucial is that this is a standard: system-on-chips in the embedded and handheld world will map on-die GPUs, DSPs and other peripherals into their physical memory maps, although where device registers appear in memory, and the organization of the RAM, will vary wildly across SoCs.

The HSA 1.0 spec is supposed to wrangle all that under one standard definition, making life much easier for programmers to produce portable code, and it's encouraging that big names are members of the foundation: AMD, ARM, Imagination Technologies, MediaTek, Qualcomm, Samsung, and others. Imagination is the biz behind the PowerVR GPUs in various Apple iPhones and iPads; ARM designs the CPU cores in the vast majority of handheld things; Samsung is everywhere, and, well, you get the picture.

It's encouraging because it means the standard has a chance of being adopted in a good number of devices and computers, potentially reaching a critical mass so that software developers can justify building HSA-compliant games and tools knowing there's enough of a userbase out there to take advantage of the technology.

One major stumbling point here is the absence of Intel and Nvidia in all of this, which is why AMD and its pals are steering their shared-memory architecture through the gaps in the Invidia empire: smartphones, tablets, future consoles, and so on.

There's no word on exactly what will support HSA this or next year, apart from AMD's 28nm Carrizo: this stuffs four Excavator x86 cores and eight Radeon GPU cores into a processor package aimed at touchscreen laptops, and is HSA 1.0 compliant. It won't ship until later this year. We're told meetings between HSA members focus on technical issues, and must avoid discussing product announcements and rollouts, or else the foundation will start to look like a cartel. Having said that, seeing HSA 1.0-compliant stuff on the shelves next year is likely.

The specification's programmers' reference manual starts out with:

The HSA system architecture defines a consistent base for building portable applications that access the power and performance benefits of the dedicated kernel agents. Many of these kernel agents, including GPUs and DSPs, are capable and flexible processors that have been extended with special hardware for accelerating parallel code. Historically these devices have been difficult to program due to a need for specialized or proprietary programming languages. HSA aims to bring the benefits of these kernel agents to mainstream programming languages using similar or identical syntax to that which is provided for programming multi-core CPUs.

We're told software developers can use C, C++, Fortran, Java, and Python to write HSA-compliant applications: the code is compiled down into an intermediate language called HSAIL, which is then converted by a finalizer into a binary executable for a particular hardware target. According to the documentation, "different implementations can choose to invoke the finalizer at various times: statically at the same time the application is built, when the application is installed, when it is loaded, or even during execution."

Software can call hsa_agent_iterate_regions() to find out which bits of memory are available to code running on the CPU and, say, a GPU. Then the software can call hsa_memory_allocate() to allocate blocks of shared memory in those regions. This example code will find the region of an agent, allocate some shared buffer space, and store a message signal in that buffer.

Mapping GPUs and CPUs into the same space ... AMD's view of HSA

The specifications' writers hope the design will encourage efficient use of GPUs and CPUs, without having to repeatedly copy chunks of data over slow buses, for example. The reduction in complexity is also supposed to make batteries last a little longer.

Diagram of the CPU-GPU PCIe architecture

Picture of a CPU-GPU with separate pools of memory

Differences between today's systems and HSA devices ... top: the classic PC-style design with GPUs on the other side of the PCIe bus. Middle: Integrated silicon that keeps memory for the GPUs and CPUs separate or not coherent. Bottom: A HSA-compliant device that ensures memory remains coherent and available equally to all processor cores.

Phil Rogers, president of the HSA Foundation and an AMD fellow, gave The Register some examples of where the architecture can be used: one being video search, where software can perform image or facial recognition on video files, and categorize them so they can be found quickly from a database of keywords, filenames and time references.

"Some people have a lot of video on their PCs, and the video they’ve recorded is opaque, in terms of search, so giving people a way to navigate archives of home video would be very powerful," he said.

"On smartphones and tablets, video chat is very popular, but there are limitations, as a lot of data has to be copied between the GPU. With HSA, it it's possible to do video chat running at lower power, maintaining a longer battery life. A server could handle more simultaneous clients without running out of memory."

"It's actually more of an evolutionary path than a revolution," Rogers added, referring to the fact that system-on-chips today already map GPU cores and CPU cores into the same physical address space, forcing them to communicate by memory anyway.

"We've created an architecture for putting everything together in the right way. Rather than stick to the legacy architecture of separate GPUs, processors and DSPs, we've stood back and thought, what it would look like if we started from scratch today."

Rogers said Linux kernel patches to allow the operating system to support HSA 1.0 have already been submitted and are available from kernel.org.

The 1.0 specification will be unveiled today (Monday) at 4.30pm (Pacific Time) in San Jose, California, ahead of the GPU Technology Conference starting on Tuesday.

Return to: 2015 Feature Stories