Feature Story

How Can We Get to Truly Immersive VR and AR?

For decades, the promise of truly immersive virtual reality (VR) and augmented reality (AR) has seemed tantalizingly close, but with every new technology introduction, it seems just out of reach. The good news is we are getting ever closer. Yet for AR and VR to be truly immersive, all of our senses must believe the experience is real.

Creating believable VR and AR experiences depends on how accurately and consistently engineers can reproduce the elements that together comprise our perception of reality, starting with understanding human physiology and neuroscience. We must understand the multisensory signals vital to perceiving 3D structures in the real world, then mimic them using technologies within headsets.

Achieving technology-based reality

VR devices occlude the user’s vision, presenting a simulated environment where sensory stimuli provide sensations of presence and interactions with virtual objects. AR devices overlay virtual objects on the physical environment, with sensory cues providing consistency between physical and augmented elements. Also known as mixed-reality devices, 3D AR systems blend real-world elements within a virtual environment.

Each configuration has unique requirements, but common developments driving these systems forward include real-time 3D sensing and tracking, powerful and energy-efficient computational processing, high-fidelity graphics rendering and displays, immersive audio, machine-learning and AI algorithms, intuitive human interfaces and novel applications.

An immersive visual experience

With innovative graphics and display technologies, we can render higher-fidelity digital objects and pack more pixels into smaller areas with greater clarity and illumination than ever before, but there’s more to do. It’s not only about rendering lifelike images but doing so with a wide-enough field of view (FOV) on small, near-eye displays with the required visual cues.

Today’s high-resolution smartphone displays render 500+ pixels per inch (PPI). But for immersive headset visuals, measuring PPI isn’t good enough. Pixels per degree (PPD) of the visual field covered by the display is a more relevant metric.

At the point of central vision, the typical human eye has an angular resolution of about 1/60 of a degree. Each eye has a horizontal FOV of about 160° and vertical FOV of about 175°. The two eyes work together for stereoscopic depth perception over about 120° wide and about 135° high FOV. All of this means we need to provide approximately 100 megapixels (MP) for each eye and about 60 MP for stereo vision to provide visual acuity of 60 PPD. Compare this with a state-of-the-art mainstream VR headset display today at approximately 3.5 MP.

Because manufacturing technology won’t yet support this pixel density, designers must make tradeoffs in rendering the salient parts of visual scenes in high resolution, based on an understanding of how the human visual system works.

Eye tracking and foveated rendering

High human visual acuity is limited to a very small visual field—about ±1° around the optical axis of the eye, centered on the fovea. This means vision is sharpest in the center and blurrier around the edges. Using real-time sensors to track a user’s gaze, we can render a higher number of polygons in the central gaze area—concentrating computing power there—and exponentially drop the graphical fidelity (polygon density) elsewhere. This foveated rendering can significantly reduce the graphics workload and associated power consumption.

The human eye has a high density of cone photoreceptors on the fovea, resulting in high visual acuity at central vision. Photoreceptor density drops significantly at the periphery, leading to lower visual acuity. (Source: E. Bruce Goldstein, “Sensation and Perception”)

Researchers around the world are studying this, and device designers are exploring multi-display configurations, in which a high-resolution display covers the foveal vision and relatively lower-pixel–count displays cover peripheral vision. Future display architectures will enable dynamic real-time projection of higher-resolution visual content in and around the gaze direction.

Accommodation and convergence mismatch

Another key concern is ensuring oculomotor cue consistency to correct for eye accommodation and convergence mismatch. Humans view the world stereoscopically, with their two eyes converging on an object. Through accommodation, each eye’s lens changes shape to focus light originating at different depths. The distance at which the two eyes converge is the same as the distance to which each eye accommodates.

In today’s commercial VR and AR headsets, there is a mismatch between convergence and accommodation distances. Real-world light is modified through reflections and refractions from various sources at varying distances. In a headset, all light is generated through one source at one distance. As the eyes converge to view a virtual object, their lens shapes must constantly adjust to focus the fixed-distance light emanating from the display, causing varying degrees of mismatch between distances, often resulting in eye fatigue or disorientation.

Convergence-accommodation mismatch for a 3D display (Source: Martin Banks)

Various approaches are being explored, such as dynamically movable optics and focus-tunable liquid crystal lenses that can change focal length as voltage is adjusted.

3D spatial audio

For true immersion, the AR/VR audio experience must correspond and coordinate with the visual experience so that the location of a sound perfectly aligns with what the user sees. In the real world, most people can close their eyes and understand the approximate location of the sound. This is based on the brain perceiving and translating the “time of arrival” and intensity of a sound. This happens immediately and automatically in the real world, but in VR headsets, 3D spatial audio must be programmed and processed.

The challenge is that each person experiences sound signals differently, with the signal spectrum modified based on factors including head and ear size, shape and mass. This is known as head-related transfer function—something that today’s technologies aim to approximate. Ongoing research to personalize this function will enable headset users to perceive sounds emanating from virtual objects with correct spatial cues.

Low-latency inside-out tracking

Tracking a user’s head movement in real time is a clear necessity in VR/AR. At all times, systems must be able to determine the position of the headset within 3D space relative to other objects, all while ensuring high accuracy and low latency to render and present the corresponding visual and aural information according to the user’s head position and orientation and rapidly update it as the user moves.

Until recently, VR headsets tracked head movements through “outside in” tracking methods, using external sensors that a user placed around their environment. Today, however, “inside out” tracking provides simultaneous localization and mapping technology and visual-inertial odometry, based on a combination of computer vision and finely tuned motion sensors, enabling movement tracking from within a headset.

With “inside-out” tracking, modern headsets can precisely track the user’s movements in real time using built-in sensors.
With “inside out” tracking, modern headsets can precisely track the user’s movements in real time using built-in sensors. (Source: Meta)

An ongoing challenge, however, is in achieving low motion-to-photon latency—the delay between the onset of a user’s motion to the emission of photons from the last pixel of the corresponding image frame in the display. In other words, it’s the total time taken by sensor data acquisition and processing, interfaces, graphical computations, image rendering and display updates.

In the real world, we track our head movement based on changes in the visual field determined from our sight as well as motion information detected by our vestibular sensory system. Long latencies in a VR headset can cause a visual-vestibular mismatch, resulting in disorientation and dizziness. Today’s systems can typically achieve motion-to-photon latencies of 20 to 40 ms, but perceptually seamless experiences require this to be less than 10 ms.

Human inputs and interactions

The immersive experience also requires that users can interact realistically with virtual objects. They must be able to reach out and grab an object, and it must respond in real time following the laws of physics.

Today’s state-of-the-art headsets let users select objects with basic hand gestures, and as computer-vision technology continues to improve with rapid progress in AI, future headsets will include richer gesture-control features.

Next-generation devices will also offer multimodal interactions, where eye-tracking technology will allow users to make selections by focusing their gaze on virtual objects, then activate or manipulate them with hand gestures. Soon, as AI technology continues to develop and local low-latency processing becomes a reality, headsets will also have real-time voice recognition.

Advances in computer vision and AI technology enable natural user interactions using gestures, eye gaze and voice commands. (Source: David Cardinal)

Looking ahead

Today, we can experience some mainstream VR and promising industrial AR applications, but they aren’t fully immersive. While the path isn’t immediate, with billions of dollars of investment in related technologies, the potential is almost limitless. For example, McKinsey estimates the metaverse may generate $4 trillion to $5 trillion by 2030.

By persistently attacking technical hurdles, we will be able to reproduce lifelike experiences through technology, ultimately diminishing the differences between the real world and the virtual world as we experience them.

You can learn more about such developments and see the latest AR and VR products at Display Week 2023.

—Achin Bhowmik is president of the Society for Information Display, as well as the CTO and executive VP of engineering at Starkey.

Return to: 2023 Feature Stories