Feature Story

HSA 1.1 Specification Launched: Extending HSA to More Vendors & Processor Types

For the last few years now we have been keeping tabs on the development of the Heterogeneous System Architecture, a set of technical standards and an associated instruction language designed to allow efficient heterogeneous compute. Originally envisioned by AMD, HSA has been the cornerstone of their efforts to develop a fully functional ecosystem for heterogeneous hardware and software. Define a standard, make it easy(ier) for developers to create software around it, and, if all goes well, AMD's big bet on GPU technology made almost 10 years ago will pay off.

However while AMD was the birthplace of what would become HSA, the standard as a whole has been about more than just one company. Which is why HSA has been under development of the HSA Foundation since 2012. The foundation is composed of many members, including a number of CPU/GPU design heavyweights such as ARM, Qualcomm, Imagination, and Samsung, all of whom are contributing to the development of HSA, and in turn we expect are at least giving themselves the option to leverage it in the future.

With all of that said, while HSA itself is a group project, in practice the first iteration of the standard was AMD-focused. This owed to not only AMD's early involvement, but also due to the fact that the standard needed hardware to be developed against – to have a proof of concept, which was AMD's Carrizo APU. As a result the HSA 1.0 specification did not offer as much flexibility with other architectures as the rest of the foundation would like. This is something that they have been working behind the scenes to address, and today the Foundation will finally be taking their next step with the publication of the HSA 1.1 standard.

The big (though not sole) focus of the 1.1 standard then is extending it to better work with non-AMD hardware. This includes not only SoCs and other integrated devices from other vendors (e.g. ARM), but additional classes of accelerators such as DSPs. The HSA foundation wants to make the HSA standard truly heterogeneous for more than just AMD APUs, and for more than just CPU + GPU combinations. The 1.1 standard, in turn, is their effort to form a more perfect standard for heterogeneous compute.

What' on the table for HSA 1.1 then is not a radical departure from HSA 1.0, but it is an extension and a further tightening of the standard to meet the goals of the Foundation. On the hardware side this is fully backwards compatible – meaning it will work on 1.0 hardware such as Carrizo – while setting up the updated standard to work on additional hardware. This is especially the case with additional classes of accelerators, as 1.0 was primarily focused on abstracting away the GPU (per AMD's APU needs).

As the first version of the true multi-vendor specification then, 1.1 is designed to help vendors be able to mix and match HSA-capable blocks in an effective manner. The Foundation itself has a heavy mobile SoC presence (be it integrators or IP providers), and it's easy to imagine how someone like MediaTek would want to be able to ensure that they can easily make HSA-capable SoCs using both Imagination and ARM GPUs, or how Qualcomm may want to use HSA in the future in their Hexagon DSPs. To get there, the 1.1 specification makes transparent a number of system level issues. Information about shared page tables, signals, queues, and more are now better exposed through the updated standard, which plays a big part in bringing about multi-vendor support.

The 1.1 specification also addresses some other issues that were felt at one point or another with HSA. The memory model now has a formal definition (using cat/HERD, for those few of you who know those tools). Along those lines there is additional memory functionality such as support for non-temporal memory accesses, which specifically comes into play when you need to tell the cache to flush an item after it's used (good for one-time-use items). And on the signal side of matters, HAS code can now understand how to wait on multiple signals, an improvement over 1.0 where code could only wait on a single device.

Finally, 1.1 also includes updates to help optimize performance and make the API play nicer with other code. A new profiling API has been introduced, which exposes hardware-level counters and other information to allow better profiling of the performance of HSA program execution, which can then be fed back into profile guided optimizations. Meanwhile the HSA finalizer has been reworked so that its internal representation isn't quite so obscure, with the new, more standard-looking representation making the finalizer more suitable for linking to other tools.

Yet despite all of these updates, the change should be completely transparent to application developers. Because HSA is ultimately a means of abstracting away the work and the quirks necessary to make heterogeneous compute work well, application developers won't see these changes. Who does see these changes are the hardware developers, who write the associated runtimes that actually compile HSAIL intermediate code down to device code. And even then, we're told that updating a 1.0 implementation to 1.1 is not especially painful, particularly when compared to writing an implementation to begin with. AMD for their part already has a 1.1 implementation up and running, which, logically enough, is being used as the basis of their Radeon Open Compute Platform (ROCm), where ROCm adds the additional discrete GPU functionality AMD specifically needs.

Though with that said, the existence of ROCm and other platforms does bring up the struggle the HSA Foundation is facing on adoption. While ROCm is HSA-based, at the same time AMD is doing an end-run around the HSAIL, preferring to compile direct-to-ISA. This still utilizes the HSA Runtime, and as a result benefits from and validates the basic HSA strategy, but it's an example of how the HSAIL aspect of the standard has struggled.

Similarly, earlier this week we saw ARM pass on embracing HSAIL as well for the heterogonous aspects of their Mali-G71 GPU, following the same train of thought as AMD. The G71 GPU is for all intents and purposes an HSA 1.1 GPU - following the HSA hardware specification to implement heterogonous computing in a common and sane way - but the company is not utilizing HSAIL.

ARM for their part is following an OpenCL-centric software strategy for exposing the heterogeneous aspects of their hardware to developers. The interesting thing about the ARM implementation is that they have gone above and beyond the basic aspects of OpenCL 2.0, offering finer-grained sync that OpenCL requires at a minimum. Finer-grained sync that would otherwise be something better suited for HSA. As a result part of the HSA Foundation's efforts are focused on showcasing the additional benefits of HSA over OpenCL 2.0, to entice hardware manufacturers and developers into using HSA.

Ultimately, in our discussion with Foundation president John Glossner (who replaced Phil Rogers), despite these setbacks he's still bullish on HSAIL. It's his hope that as finalizers continue to improve, there will be less reason for companies like AMD to bypass HSAIL as they do now. And in the meantime, further success with HSA and HSAIL in general should help to encourage more hardware vendors to adopt HSAIL.

Finally, to pull one more slide out of the HSA deck as a "this is cool" subject, Continuum Analytics' Numba Python compiler has recently added direct HSA support. This makes it a lot easier for developers writing code in Python to easily add HSA-compliant vectorization to their programs. Python is a widely used language, and while even "automatic" parallelization isn't the true holy grail of no-effort program parallelization, it does get HSA one step closer, at least in this case.

Wrapping things up, today's launch of the HSA 1.1 specification, despite the minor version number increment, should in the long run prove to be a significant event for the HSA Foundation. By finally getting the specification to the point where it can more readily support multi-vendor hardware, the ecosystem as a whole will have the opportunity to evolve into a true multi-vendor ecosystem, the ultimate purpose of AMD spinning their work off into the HSA Foundation to begin with. The hard part as an outside observer will be waiting; while the specification was ratified back in April, there is still going to be some lag between the ratification and when additional hardware will be ready to support it.

Return to: 2016 Feature Stories