Why we are faster

We do deep learning in the browser with our own JavaScript/WebGL engine.
We are often asked: Why not use Tensorflow.js? It also runs in the browser in JavaScript/WebGL, the developer community is huge and it is made by Google.
Tensorflow.js is a great tool, but it is still too slow to run in real-time in the browser with a video input for a real use-case.

The goal of Tensorflow.js is to bring Tensorflow models to the web. These models have not been developed specifically for the web. This framework was built with a top-down approach: JavaScript/WebGL is just a solution to execute the model in the browser. It answers the question:

How can I run or train a deep learning model in the browser?

On the contrary, we started from what we have in a web development environment: the JavaScript/WebGL workflow. Deep learning algorithms are massively parallelisable so they fit to a GPU architecture. WebGL is the only API to get a low level access to the GPU in the web browser. But it is first suited for 2D and 3D rendering and texture operations and not all deep learning operations can run fast with WebGL. So we thought out how can we adapt deep learning algorithms and the structure of the neural network to the low-end GPUs we access through the WebGL API. Our deep learning engines answers the question:

How can I do deep learning in the browser as efficiently as possible?

A deep learning model is like an engine: it should fit its execution environment to be useful and efficient. @Jeeliz it is as if we build the engine at the same time than the chassis.

Why processing client side matters

If we want to make the web smarter, we need to integrate machine learning based algorithms into the web applications. Currently the state of the art of machine learning is reached by deep learning. Our web application will collect the data from the web browser of the user. Then it will either process it directly in the web browser (client side) or send it to the server which will return the result later (server side).

The server-side processing is often easier because it does not depend on the user’s configuration. We can then use almost unlimited computing power. If the data is lightweight and if the application does not require a very low latency, we should then process the data server side. For example a conversational agent should be implemented server side: the data consists in text chunks, and the accepted delay is around 1 second.

But if we want to analyze a video stream, the server-side processing is not the best solution. The data is heavy so it will require a large bandwidth and powerful servers. The best way is to process it client-side. But since the user may have a cheap device, we need a very efficient deep learning engine. That is why we developed Jeeliz deep learning technology. We focus mainly on video applications because this is where client-side processing is almost mandatory.

Some competitors still bet on server-side processing arguing that the generalization of 5G networks and cheap cloud GPU computing power will make it interesting, but:

  • Even with 5G there will be network cuts. If the user enters in a metal elevator for example. And the 5G is very fast in a short range, so it will be mostly for urban areas.
  • The Virtual Reality is coming with its higher display resolutions. The 4K then 8K resolutions are the next standards. As the video resolutions and framerates will increase, they will require higher bandwidths for their streaming…
  • It is true that GPU cloud computing prices decrease and power increases. But the same trend is observed for laptops and mobile devices GPUs.
  • Using the user’s GPU will always be free while the GPUs in the cloud will never cost nothing.

Why want to do everything on the server side while client devices are more and more powerful?

Why speed matters

Currently the client side deep learning demonstrations and applications relying on other engines than Jeeliz don’t have any commercial use. We can enumerate :

  • toy demonstrations learning on video,
  • drawing and music generation weird experiments,
  • image processing demos, using GAN models. The GAN model weights more than 100MO and the model takes more than 200ms to process one image on a gaming laptop. It would be better to do this server side,
  • chatbots which would run better server side.

They cannot process effectively enough the video stream for the most interesting and valuable use-cases. It is quite frustrating because if we look at the state of the art of deep learning video processing we see many incredible applications like deep fakes, style transfer or face generation. But most of them:

  • Don’t run in real-time or require a $5000 graphic card with Cuda,
  • Run in a controlled environment using the video from a quality external webcam and a nice illumination of the scene,
  • Use a model with tens of convolutionnal layers, which weighs sometimes gigabytes, unloadable for a web browser.

Our deep learning technology allows us to get closer of the state of the art in the browser. We were the first company to offer a commercial product based on deep learning in the web browser with our glasses virtual try-on application in 2016. Now we have explored other tracks like:

  • Expression recognition to do like Apple Animoji application
  • Object recognition and tracking for augmented reality
  • Pupillometry

We keep improving our deep learning engine and our models for these use cases. And at the same time we keep exploring new possibilities.

Our first application: a virtual mirror to try sunglasses using the webcam.

An integrated environment

For real-time video analysis, the developer requiring a deep learning based solution is not a deep learning specialist. He needs more an integrated API than a deep learning API. If he wants to detect an track a face from the webcam video feed to build a funny face filter, he will be able to use Jeeliz FaceFilter without any ML knowledge.

So we propose solutions which mask the complexity of deep learning. These 3 aspects are hidden for the final user:

  • The complexity of the training: to avoid overfitting we exclusively train our model using 3D generators. We use an internal powerful training application to be able to monitor training of network models, to easily change hyperparameters without having to write, test and maintain hundred of Python scripts,
  • The structure of the neural network: if the final user is not a deep learning expert, he may not be able to guess the best structure for the neural network. In particular the balance between the accuracy of the result and the computationnal complexity is hard to determinate,
  • The implementation of the trained neural network: the outputs of a neural network are always noisy so we need to filter them, and we also need to get the video stream, to move the detection window over the full video frames until an object is detected, etc…

Tensorflow is like a factory. A state of the art factory of course, not like this one. You can build all kind of cars, good cars like bad ones. But most users just want a car ready to hit the road. @Jeeliz we don’t make the factory accessible to the user, but we deliver a well built car fitted for its environment.