ARCore and ARKit
The augmented reality frameworks ARCore and ARKit, released respectively by Google and Apple, have popularized the use of augmented reality. They rely on the computing power and the 3D scanning capabilities of latest high end mobile devices to provide a smooth experience. The main technological breakthrough is their SLAM (Simultaneous Localization and Mapping) algorithms which reconstructs the 3D environment and concomitantly computes the position and orientation of the device.
They prepare the arrival of the next generation of augmented reality glasses. These devices will be costly because they will require high embedded computing power, high resolution display and expensive optical components. So people will buy them only if enough experiences are available. Thanks to the current mainstream augmented reality frameworks, the applications for AR glasses will be already developed and tested.
But their main limitation is the portability. Developing an augmented reality application is very expensive. It requires an intricated user experience, 3D assets, and testing can be difficult. ARCore and ARKit are plateform specific, so an application developped for a framework should be rewritten for the other framework. Furthermore, the user will have to install an application to access to the augmented reality experience. So it is not sharable only by clicking on a single link.
The futur: WebXR
Until now, only 8th Wall did an augmented reality framework working fully in the browser, without depending on external augmented reality engines like ARKit or ARCore. Their work is absolutely remarkable, and their demos works smoothly even with Android smartphone too old for ARCore. We just wish they will implement the WebXR interface soon.
As WebXR is a web standard, only web based libraries will be able to work with it. We cannot use CoreML, Cuda and other very powerful but native technologies. WebGL is the only way to access the GPU hardware acceleration, both for computing and rendering.
The augmented reality consists in overlaying virtual content over the reality. The more the application understands the surrounding world, the more we can narrow the gap between the real and the virtual:
- If we understand the 3D geometry of the room (SLAM), we can place 3D objects at the right place,
- If we understand the lighting of the scene, we can render them coherently,
- If we recognize people faces, we can display custom information above each face…
In particular, object detection and tracking allows a deep understanding of the scene. We can imagine these scenarios:
- Put a virtual avatar on a chair because we have detected that a chair is a chair,
- Replace all cars in the street by mammoths,
- Increase the size of road signs for a driving help…
There are different kinds of object detections. We enumerate them from the easiest to the hardest.
QRCode are flat objects made to be easily detectable and decodable. There are many efficient libraries to read them. But they have major drawbacks:
- They are flat, so they don’t fit on any surface,
- They require to be paint on the object you want to detect,
- They are ugly.
Unlike a QRcode, an image can be beautiful and displayed in harmony with the environment. Image recognition algorithms are quite effective. For a given image, they compute a signature which should be robust to lighting conditions and geometric transformations. The SIFT (Scale Invariant Features Transform) algorithm is maybe the most famous. These algorithms are often included into the augmented reality frameworks. But they still have their limitations:
- They require a flat surface on the object to detect, otherwise they may work only for a specific view angle,
- They are not able to generalize: you can detect the painted portrait of a specific person, but you cannot detect all the painted portraits whoever is painted.
3D Object detection
This is the hardest and this is where our technology can be helpful. The goal is to detect any 3D object with a specific level of generalisation. The difficulty of the task depends on the chosen level of generalization. For instance if we want to detect all vehicles, it will be quite hard because the level of generalization is very high. There is a big difference between a motorbike, a truck and a car. If we want to detect all cars, it will be easier. But if we want to detect a specific model of car, it may be difficult especially if other car models have a very close body shape. In the later case we face a high level of specialization.
Some object detection algorithms rely on 3D scanning to compare the scanned data with a reference mesh. But this approach requires 3D scanning capabilities embedded on the device, and a 3D model of the reference object. And it won’t work with deformable objects or objects with a inherent variability (plants for example).
So we rather bet on deep learning:
- First we train a neural network to detect a specific 3D object,
- Then we load the pre-trained neural network on the final augmented reality application.
In this example, we have trained a neural network to detect mugs. We load it in the final application, and we play a 3D animation when a mug is detected. The application is based on JeelizAR:
We have released a library for augmented reality, JeelizAR. It is available on github on github.com/jeeliz/jeelizAR. We regularly add new neural network models to the repository and we offer neural network design and training as a service.