Part of the journey to build the best video codec involves diving deep into the competition. AOMedia Video 1 (AV1) created by the Alliance of Open Media (AOM) is regarded as one of the best codecs available today, and unlike comparable codecs, AV1 is royalty-free and open-source.
If AV1 is so good, why are H.264/AVC and H.265/HEVC much more common? Well, the pre-existing encoder for AV1 called libaom was too slow for most use cases. Scalable Video Technology AV1 (SVT-AV1) drastically reduced encoding time compared to libaom, making AV1 a candidate for use in real-time streaming. With the SVT-AV1 encoder, the AV1 codec is an enticing codec choice for many in the video industry going forward.
With that in mind, this blog explores the performance of SVT-AV1 and walks through its capabilities in different use cases. We will match encoding settings to different use cases, show some benefits and drawbacks of the different choices we can make, and explain some video codec benchmarking techniques.
Within SVT-AV1 there are two main modes: low-delay and random access. Each mode is designed for a specific use case.
Low-delay mode is used for many applications like video conferencing, cloud gaming, and screen casting. In low-delay mode, the sender encodes videos frame by frame at a rate faster than the playback speed (e.g. >30 frames per second for a 30FPS video stream). Generally, the encoder only uses previous frames to encode the next. This limits the video quality for a given bitrate, but makes real-time encoding possible, as the recipient can receive each frame in order and immediately display it once decoded.
Random access mode is best suited for the Video-on-Demand setting, where videos only need to be encoded once. This means encoding time can be arbitrarily long and the goal is to get the best resulting video for a given amount of compression. Random access mode references both previous and future frames during encoding, resulting in a lower bitrate. However, this also means that random access mode can’t be used in a low-latency live streaming setting because the future frames are unknown.
In the still image shown above, this parrot is in the middle of abruptly moving its head. In low-delay mode, the encoder only has access to the previous location of the head, resulting in a worse image with some pixels out of place to the left of the beak. On the other hand, in random access mode, the location of the bird’s head in the next frame can be used to make the motion look smoother, removing these artefacts.
Distortion refers to the difference between the compressed video and the original video. There are many different visual quality metrics, but the one that has the highest correlation with human perception is VMAF, which includes both frame-wise and temporal components (between frames). VMAF ranges from 0 to 100, with 100 being the highest visual quality. In addition to VMAF, the frame-wise peak signal-to-noise ratio (PSNR) computed in YUV colour space is also commonly used. It is computed using the mean squared error between the distorted and the original image.
In an ideal world, every video would have the maximum visual quality, but we also care about the size of the resulting video file. The file size measurement we care about is bitrate, which is the number of bits of information per second in the video. For a given visual quality (e.g. a VMAF score of 85), the best codec will have the lowest bitrate.
Rate-distortion curves plot the measured distortion of a video against its bitrate and are used to understand the codec performance. To control the visual quality to get a representative sample across bitrates, a rate control mode is applied. In this case, it fixes the amount of compression applied to each subsection of frame in the video to make a fair comparison across modes using the quantization parameter (QP). Each point on these rate-distortion curves corresponds to a chosen QP. Whilst there exist more sophisticated rate control modes that are more commonly used in practice (such as constant quality, average bitrate and constant bitrate modes) this rate control mode is very useful in video encoding research as it is the simplest form of rate-distortion control.
These rate-distortion curves are generated by encoding the videos from the MCL-JCV dataset with a set of 7 QP values, computing the quality metrics for each video, and averaging them across the 30 videos. On the rate-distortion curves shown above, it’s clear that the random access mode achieves a lower bit rate at the same quality level. Thus, when encoding times are unrestricted, random access mode yields higher compression efficiency than low-delay mode.
SVT-AV1 can be run through FFmpeg or through their standalone encoder and decoder called SvtAv1EncApp and SvtAv1DecApp. In addition to choosing between modes, several preset options are also available which enable or disable the features used during encoding. Lower preset numbers have more time-consuming features enabled; such as using a larger range of macroblock sizes, using non-square macroblocks, and performing global motion compensation. These features decrease the overall bitrate but maintain a higher visual quality through smarter compression.
Above, we show a visual comparison of two different presets. The image encoded with preset 11 has less detail and some artefacts near Lady Blue Bird’s face and tail, while the image encoded with preset 5 has no clarity issues. Of the two presets, preset 5 has a higher compression efficiency but a longer encoding time, as it uses more encoding features. In addition to choosing between modes, it’s important to choose a preset that is fast enough for the use case.
In addition to finding visual differences, we compare the VMAF rate-distortion curves across the different presets in random access mode. Here we can see that for both metrics, the curve shifted furthest to the left and closest to the top is the slowest option, preset 5, meaning it has the highest quality and the lowest bit rate across the same set of QP values. As expected, the fastest option preset 11 has the lowest quality-to-bitrate ratio of the displayed presets.
Encoding in 10-bit offers several advantages over encoding in 8-bit. 10-bit increases the visual quality by increasing the available range of colours, making gradients smoother, and reducing banding artefacts. In terms of compression efficiency, using 10-bit encoding reduces quantization and rounding errors. With a small cost in file size and decoding efficiency, 10-bit is generally known to perform better than 8-bit in terms of both objective metrics and subjective quality.
Above is an example of banding artefacts, which may appear when the range of colours available to the codec is not sufficient to represent a smooth gradient. The left gradient represents what might happen with 8-bit encodes, where there are clear rings of colours. The right gradient shows how 10-bit encodes can allow for a smooth change from light to dark.
In addition to visual quality, increasing the bit depth can improve objective quality.
According to VMAF and PSNR YUV, random access mode performs better in 10-bit than 8-bit, matching the expected relationship that 10-bit improves visual quality. Unexpectedly, for low-delay mode, the PSNR YUV and VMAF plots suggest that 8-bit is better than 10-bit.
At the end of the day, the subjective visual quality of the videos is the most important and these videos must be visually inspected to understand this result. Objective metrics are necessary to standardise the measurement of compression quality between codecs, but they must be used in conjunction with subjective quality scores.
Throughout this blog, we explored some of the different modes and features available in SVT-AV1. We showed that varying the codec parameters can cause major shifts in compression ratios, visual quality, and encoding times, which helps show why it’s crucial to align the codec parameters with different use cases. Through common benchmarking techniques like analysing rate-distortion curves and visually inspecting the videos, we learned about the performance of SVT-AV1 in a variety of use cases. Now that we’ve assessed the performance of SVT-AV1 for different scenarios, we can make fair comparisons to other codecs.
Thanks for reading!