If you've ever worked with digital colour before, you've probably noticed that computers typically represent colours as a combination of three intensities: red, green, and blue (RGB). Not only does this reflect how screens actually display colours, it's a decent model for how the human eye physically perceives them as well. Our retinas contain three kinds of cone cell that respond to different wavelengths of light - long (e.g. red), medium (green), and short (blue) wavelengths - and it's through these “tristimulus values” that our brains create the psychological phenomenon of colour. Therefore, it's pretty sensible that we digitally store colour information in this way.
Clockwise from top left: a colour photograph, its resulting R channel, G channel, and B channel.
However, representations that work well in one context may suffer in another; RGB values may be handy for computer display, but they don’t tell you much about a colour. For example, consider the RGB representation for pure white: (255, 255, 255) - this is computer science, so colour components (“channels”) are often given whole-number values from 0 to 255. Under our model, this tells us that white is the most red that any colour could be, the most green, and the most blue possible. But we know that white could easily be more red - it could be pink! What’s going on here?
The crux of the issue is that our physical perception of colour does not align well with our mental experience of it. Trying to accurately model how the mind perceives something is a terrifically complex task, but in colour science a different triplet of channels has emerged to describe how we experience colour - one for its lightness, one for how blue or yellow it is, and one for how red or green it is. We'll refer to the first channel as describing “luma” and the latter two together as “chroma” (the actual content of the colour). While this description seems arbitrary at first, it accurately models how people interpret colours in context and corresponds extremely well to how people with colour-deficient vision see the world - each kind of colour deficiency can be represented by reducing the range of one or both chroma channels. So when taking a human perspective on colour, we really need to be using a luma-chroma system.
YUV is one such family of luma-chroma colour systems (often called “colour spaces”) wherein the first channel represents luma and the second and third represent chroma. "Y" represents luma for historical reasons, and "U" and "V" were selected because they weren't being used for any other colour space. YUV itself used to be its own colour space used in TV signal transmission, but nowadays the term is bandied about to refer to a number of possible spaces. With that in mind, we’ll mostly be discussing “YUV” as if it’s just one system, as all its versions follow the same logic.
Clockwise from top left: the original RGB image, its resulting Y (luma) channel, U channel, and V channel. For this version of YUV, we used the BT.601 standard.
As a luma-chroma space, YUV is more useful for representing colour information in certain situations and has found several applications in the field of image compression. While this is partly because separating luma from chroma makes images inherently easier to compress, a more important reason is that the human visual system doesn't care about chroma all that much. Most perception-relevant image information is actually reconstructed from luma, which is why black-and-white images are just as understandable as those in colour despite missing two thirds of their channels! As a result, YUV image data is usually stored in what's referred to as "420" format. In YUV420, luma is stored at full resolution and each chroma channel is stored at half resolution, meaning that there are four times as many luma values stored than for either of the chroma channels. The process of capturing half-resolution chroma channels is called “chroma subsampling”.
420 format on its own halves the size of an image file, and this can be further reduced by compressing the luma and subsampled-chroma channels. This is exactly what JPEG does, converting RGB images to YUV420 and compressing each channel individually. Due to the perceptual qualities of the YUV colour space, compression artefacts in YUV are generally less noticeable to human observers than those produced by compressing images in RGB space. The human visual system is also roughly six times more sensitive to changes to luma than to chroma, therefore those channels can be compressed to an even greater degree without losing image quality.
The catch is that there are several different versions of YUV that a particular camera manufacturer could be referring to, and you have to use the same version for converting into and out of YUV space. In general, we convert between colour spaces using geometric matrix transformations. The conversion matrices for the YUV spaces all have the same mathematical structure, but their exact values differ by the "luma coefficients" - numbers that describe how to form the luma channel from an RGB colour - that their standard specifies. For example, BT.601 uses the luma coefficients (0.299, 0.587, 0.114), whereas BT.709 uses (0.2126, 0.7152, 0.0722). These look like small differences, but a mismatch in the conversions can cause an image to veer sickly green or yellow.
The other catch is that YUV can represent more colours than RGB can, in a sense. Strictly speaking it's the same amount - each space digitally stores colours as a triplet of whole numbers ranging from 0 to 255, so they have exactly the same number of possible combinations. However, although every colour in RGB space has a corresponding colour in YUV space (rounding any decimals as needed), around three-quarters of all possible colours in YUV don't have an adequate representation in RGB. Let's call the cube of allowable colours in a space its “gamut”. When the RGB gamut is transformed into YUV space it is stretched, shifted, and shrunk by a factor of four, meaning that the entire RGB gamut fits easily within the YUV gamut. However, this also reveals that the YUV gamut is four times larger than the RGB gamut even when you shift them both back into RGB space, so there's just no way that the majority of YUV colours can fall within the RGB gamut. Therefore, they do not have a valid representation in RGB.
This causes an asymmetry in the colour conversions. Because all colour information we store must lie within the appropriate gamut to be meaningful, out-of-gamut channel values are clipped to be in-gamut. RGB colours don’t have to be clipped when they are transformed into YUV space because they are all in-gamut, and of course do not have to be clipped when they are shifted back into RGB space. Most YUV colours however do have to be clipped when they are converted to RGB, and once that information is gone there's no way to recover it. Now this may not become a serious problem depending on the set-up. Cameras don't capture images in YUV directly - they capture them in RGB, and then convert them into YUV at some point in their image processing pipeline - and so may never produce an out-of-gamut colour. With that being said, the exact details of this process will change from use case to use case, so your mileage may vary.
Overall, YUV is a family of very similar spaces that seek to represent colour in a more perceptually-relevant way, and thereby reduce the amount of information needed to store an image. Due to these qualities the space is used all over the place in digital image processing, although not everyone can agree on exactly what version of the space they're using. Here at Deep Render our compression pipeline runs in YUV natively, exploiting the mixed-resolution nature of 420 data in a runtime-friendly manner. In this way, we are the first to produce a real-time runnable AI codec that can work efficiently with YUV420 data, unlocking countless applications with real-world impact.
All the best, and thanks for reading!