Deep Render recently introduced the first AI codec into FFmpeg and VLC. This marks a significant step for AI codecs, making them readily available in tools widely used in the compression industry. Deep Render will continue to push the frontier of AI codecs by further improving compression performance, providing support for more hardware platforms, and improving feature diversity.
However, AI codecs hold far more promise and opportunities than the initial 40% compression efficiency gains they have already achieved. The one I’ll be highlighting in this blog is called specialisation. If you would like to brush up on your knowledge of AI codecs before reading, I can recommend my primer course on AI codecs here.
Since AI codecs are learnt from data, we can be creative with the data they’re learned from to affect change in the model's behaviour. Specialisation is a method developed at Deep Render that limits the training data to a subset of all the data in the world. For example, it is possible to train an AI codec only on sequences of dogs. This makes the codec extremely good at compressing dogs and degrades performance elsewhere.
This is a contrived example, but this ability to control training data has some significant real-world applications. Specialisation can already provide a 30% efficiency gain, on top of the 40% gain that Deep Render’s codec already provides, leading to a total of ~50% compression efficiency gains. When pairing the capability of specialisation with the fact that AI codecs are essentially software based as they utilise general purpose neural processing units (NPUs) widely available in modern devices, the future of codecs starts to look interesting.
To set the scene, film titles are encoded once and streamed tens to hundreds of millions of times. On a high level, the process works as follows: A film title is encoded, and the bitstream is stored on many local servers around the world. When a user wants to view this title, it’s fetched from a local server, decoded and displayed on the screen.
With AI codecs and their ability to specialise, we can train our SOTA (state-of-the-art) AI codec on this single title by limiting the training data to only sequences from this film title and realise an additional 30% compression efficiency gains, leading to a total of 50% bitrate reduction. Once this model is trained, we can use it to compress sequences and create bitstreams. The total model size is anywhere between 300KB to 2MB, which means it’s small enough to stream over the network.
With this in mind, the new content delivery pipeline would look as follows: An AI codec is used to specialise on a film title, giving us a title-specific codec. The encoder of this specialised model is used to create bitstreams. When a user wants to view this title, we first stream the specialised decoder and then the bitstreams. On the client end, we use the streamed decoder to decode the arriving bitstreams. Since AI codecs are software based, the streamed decoder can be initialised on the NPU effortlessly. This gives rise to per-title codecs, which are streamable and capable of realising an additional 30% on top of the already excellent compression efficiency provided by AI codecs. In this specialised codec world, we would now have a codec per film title, but since there is a power law in the popularity of films, we can reap most of the benefits by applying this only for the top titles.
As an extension to this, we can also utilise the method described above on film chunks, as Netflix does with their dynamic optimiser, to achieve a per-chunk codec. This will unlock even more gains since we’re further restricting the data domain on which our model is trained. This, of course, has to be traded off with having to stream a codec per chunk, which could offset any savings.
Similar to the example above, we can also restrict the training data of an AI codec to gameplay from a specific game. This would result in per-title codecs for games; for example, we could have a codec specialised to Call of Duty used to encode only Call of Duty gameplay and generate the respective bitstreams to be sent across the network.
For cloud gaming platforms, when a user requests to play a specific game, the platform would first stream the game-specific AI codec and then the bitstreams generated by this codec. This would significantly reduce the bits needed for cloud gaming by providing a 30% additional gain, leading to a total of 50% bitrate reduction.
Additionally, games often have visually distinct maps and regions. For example, you can select to play Call of Duty in different terrains, which essentially means the game loads and renders a certain set of textures. We can generate a codec per map or terrain, which allows us to specialise our codec even further, reaping further compression efficiency gains. You can extend this concept to any visually distinct regions of a game, such as different sections of a given game map.
We can also apply specialisation in video conferencing to achieve further bitrate reductions.
Imagine the following addition to your most popular video conferencing app. When you download and set up a video conferencing app, it asks you to record a 10 second clip of yourself repeating some sentences. This recorded clip is used to create an AI codec specialised to you, which enables high-quality and smooth video conferencing for you. Would users want this?
How can a video conferencing provider achieve these additional gains? They would take the 10 second clip and apply domain randomisation to create more data samples. Next, they would use this data to train an AI codec specialised to the person. Once this is complete, a video conferencing call will look as follows. At the start of a call, say a 1-1 call, each user would stream their personalised AI decoder to the other. They would then use their personalised AI encoders to encode their webcam stream and send the bitstream to the other user, who would decode it with the AI decoder specialised to the person who sent the bitstream.
This has the potential to significantly improve compression performance. Currently, Deep Render provides a 70% improvement over AV1 in talking heads, and a per-person specialised codec would add another 30%, leading to 5x improvement in video conferencing compared to AV1, a remarkable feat.
As evident from previous sections, due to the paradigm change, AI codecs and specialisation afford unbounded creativity and compression efficiency gains. To date, Deep Render has verified that specialisation can give up to 30% efficiency gains but we don’t know the limit. On the other hand, we have not implemented any of the above pipelines in production. Inevitably, there will be challenges when productionising, and some trade-offs will have to be made, but I think it will be worth it. I firmly believe that AI codecs are the future of compression and that specialised AI codecs are the future of AI codecs.