Stacking: 90% of success of mobile cameras

True computational photography began with stacking — a method of combining several photos on top of each other. It's not a big deal for a smartphone to shoot a dozen pics in half a second. There're no slow mechanical parts in their cameras: the aperture is fixed, and there is an electronic shutter instead of the "moving curtain". The processor simply tells the sensor how many microseconds it should catch the wild photons, and reads the result.

Technically, the phone can shoot photos at a speed of video, and it can shoot video in a photo resolution, but all that is slowed down to the speed of the bus and processor. Therefore, there is always a software limitation.

Stacking has been with us for a while. Even the founders' fathers used plugins for Photoshop 7.0 to gather some crazy-sharpened HDR photos or to make a panorama of 18000x600 pixels, and… no one figured out what to do with them next. Good wild times.

Now, as grown-ups, we call it "epsilon photography", which means changing one of the camera parameters (exposure, focus, or position) and putting images together to get something that couldn't be captured in one shot. Although, in practice, we call it stacking. Nowadays, 90% of all mobile camera innovations are based on it.

There's a thing many people don't care about, but it's crucial for understanding the entirety of mobile photography: A modern smartphone camera starts taking photos as soon as you open it. Which is logical, since it should show the image on screen somehow. But in addition to that, it saves high-resolution images to its cyclic buffer and stores them for a couple more seconds. No, not only for NSA.

When you tap the "take a photo" button, the photo has actually already been taken, and the camera is just using the last picture from the buffer.

That's how any mobile camera works today. At least the top ones. Buffering allows implementing not only zero shutter lag, which photographers begged for so long, but even a negative one. By pressing the button, the smartphone looks in the past, unloads 5-10 last photos from the buffer and starts to analyze and combine them furiously. No need to wait till the phone snaps shots for HDR or a night mode — let's simply pick them up from the buffer, the user won't even realize.

In fact, that's how Live Photo is implemented in iPhones, and HTC had it back in 2013 under a strange name Zoe.

Exposure Stacking: HDR and brightness control

The old and hot topic is whether the camera sensors can capture the entire brightness range available to our eyes. Some people say no, as the eye can see up to 25 f-stops and even the top full-frame sensor can be stretched out to a maximum of 14. Others call the comparison incorrect, since our eyes are assisted by the brain, which automatically adjusts your pupils and completes the image with its neural networks. So the instantaneous dynamic range of the eye is actually no more than 10-14 f-stops. Too hard. Let's leave these disputes to scientists.

The fact remains — taking pictures of friends against a bright sky, without HDR, with any mobile camera, you get either a natural sky and dark faces of friends, or natural faces, but completely burned sky.

When you tap the "take a photo" button, the photo has actually already been taken, and the camera is just using the last picture from the buffer.

The solution was found a long time ago — to expand the brightness range using HDR (High-dynamic-range) process. When we can't get a wide range of brightness right away, we can do it in three steps (or more). We can shoot several pictures with different exposure — "normal" one, brighter, and darker one. Then we can fill in the shady spots using the bright photo, and restore overexposed spots from the dark one.

One last thing that needs to be done here is solving the problem of automatic bracketing. How far do we shift the exposure of each photo so as not to overdo it? However, any second-year tech student can do it today using some Python libraries.

The latest phones turn on HDR mode automatically when a simple algorithm inside their cameras detects you're shooting a high contrast scene. Some, like the Google Pixel, use a different strategy combining multiple frames of the same exposure - one chosen to avoid overexposing bright portions of the scene - to reduce noise and even increase detail. More on that later.

The main disadvantage of HDR with exposure bracketing is its incredible uselessness in poor lighting. Even in the light of a home lamp, the images can come out so dark that even the machine cannot level and stack them together. This is because the 'brightest' exposure would be too long for handheld capture, leading to motion blur. Meanwhile, the shorter exposures just don't collect enough light to yield a decent photo. To solve the problem, Google announced a different approach to HDR in a Nexus smartphone back in 2013. It was using time stacking.

Time Stacking: Long exposure and time lapse

Time stacking allows you to get a long exposure look with a series of short shots. This approach was pioneered by the guys who liked to take pictures of star trails in the night sky. Even with a tripod, it was impossible to shoot such pictures by opening the shutter once for two hours. You had to calculate all the settings beforehand, and the slightest shaking would spoil the whole shot. So they decided to divide the process into a few minute intervals and stack the pictures together later in Photoshop.

These star patterns are always glued together from a series of photos. That makes it easier to control exposure.

Thus, the camera was never shooting with a long exposure; we simulated the effect by combining several consecutive shots, each of equal exposure. For smartphone photography, these shots had short enough exposure times to allow for sharp shots handheld. Smartphones have had a lot of apps using this trick for a long time, but now almost every manufacturer has added it to standard camera tools.

A long exposure made of iPhone's Live Photo in 3 clicks.

Let's get back to Google and its night-time HDR. It turned out that using time stacking you could create a decent composite image in the dark. This technology appeared in Nexus 5 for the first time and was called HDR+. The technology is still so popular that it is even praised in the latest Pixel presentation.

HDR+ works quite simply: once the camera detects that you're shooting in the dark, it takes the last 8-15 RAW photos out of the buffer out and stacks them on top of each other. This way, the algorithm collects more information about the dark areas of the shot to minimize the noise. Technically, noise is reduced throughout the entire image, but the benefit is most noticeable in shadow and midtones.

Imagine that: you have no idea what a capybara looks like, so you decided to ask five people about it. Their stories would be roughly the same, but each will mention any unique detail, and so you'd gather more information than if asking only one person. Same happens with pixels on photo. More information — more clarity and less noise.

Combining the images captured from the same point gives the same fake long exposure effect as in the example with the stars above. Exposure of dozens of pictures is summarized, and errors on one picture are minimized on the other. Imagine how many times you would have to slam the shutter in your DSLR to achieve this.

Pixel ad that glorifies HDR+ and Night Sight.

But HDR+ doesn't only work in low light. It turns out that time stacking - or 'burst photography' as it's more commonly known - also improves quality of images across the board, from daytime to night-time to high contrast scenes. Put simply, the more total light captured the better the quality of the final result. In fact, capture of high contrast daylight (or dusk) scenes suffers from a similar challenge as night-time capture: when you choose an exposure to prevent bright areas from clipping, darker portions are made up of too few photons for a clean result. By capturing and merging multiple exposures, you can reduce the amount of noise throughout the image. This is precisely what all Pixel phones do all the time. That's right, HDR+ is now no longer a mode, it's the default camera capture mode.

Only one thing left, and this is automatic color casting. Daylight scenes can often be color corrected using traditional white balancing algorithms, but shots taken in the dark usually have broken color balance (yellowish or greenish) since it's hard to judge the dominant light source. This often needs to be fixed manually. In earlier versions of HDR+, the issue was resolved by simple auto-toning fix, à la Instagram filters. Later, they brought a neural network to the rescue.

That's how Night Sight was born — "the night photography" technology in Pixel 2, 3, and later. The description says "machine learning techniques built on top of HDR+ that make Night Sight work." In fact, it's just a fancy name for a neural network and all the HDR+ post-processing steps. The neural network was trained on a "before" and "after" dataset of night-time photos taken by Pixel cameras and hand-corrected for pleasing color. The 'learning based white balance' Google developed worked so well it is now used to color correct images across the board on Pixel phones, not just night-time shots.

Also, Night Sight calculates the motion vector of the objects in the shot to choose an appropriate exposure time to yield non-blurred results. Below are some links to a more in-depth look at Night Sight written by Marc Levoy, and a link to Google's announcement of its HDR+ dataset, which it made public to "enable the community to concentrate on comparing results... [an] approach intrinsically more efficient than expecting researchers to configure and run competing techniques themselves, or to implement them from scratch if the code is proprietary."

Motion Stacking: Panorama, super-zoom and noise control

Panorama has always been a favorite kids toy. World history knows no cases when a sausage photo was interesting to anyone but its author. However, it's still worth talking about it though because that's how stacking got into many people's lives.

The very first useful application for panorama is making super-resolution photos. By combining multiple slightly shifted images, you can get a much higher resolution image than the camera provides. Thus you can receive a photo in hundreds of gigapixels resolution, which is very useful if you need to print it for a house-sized billboard.

Another and more interesting approach is called Pixel Shifting. Some mirrorless cameras like Olympus started supporting it as early as 2015, and recent iterations of it from the likes of Panasonic even provide sophisticated motion correction.

Smartphones have succeeded here for a hilarious reason. When you take a picture, your hands are shaking. This "problem" became the basis for the implementation of native super-resolution on smartphones.

To understand how it works, we need to remember how any camera's sensor works. Each pixel (photodiode) can capture only the intensity of light, i.e., the number of photons which broke through. However, a pixel cannot measure the color (wavelength). In order to get an RGB-image, we had to hack it around and cover the whole sensor with a grid of multicolored glasses. Its most popular implementation is called Bayer filter and is used today in most sensors.

It turns out that each pixel of the sensor catches only R, G or B-component because the rest of the photons are mercilessly reflected by Bayer filter. Missing components are computed by averaging nearby pixels later.

Made by analogy with the human eye, the Bayer filter has more green cells than others. Thus, out of 50 million pixels on the sensor, about 25 million will only (!) capture the green light, while 12.5 million will capture red and another 12.5 million will capture blue. The remaining colors each pixel did not capture will be interpolated by comparing information from neighboring pixels. This process is debayering or demosaicing and this is that fat and funny kludge which keeps everything together.

In fact, each sensor has its own tricky and (of course) patented demosaicing algorithm, but in this story we don't care.

Other types of sensors (such as Foveon) didn't get that popular. Some rare smartphone manufacturers like Huawei though tried to play with non-Bayer filters to improve noise and dynamic range, to varying degrees of success.

Thanks to the Bayer filter, we lose a ton of photons, especially in the dark. Thus, we came up with the idea of Pixel Shifting — shift the sensor by one-pixel up-down-left-right to catch them all, and to capture red, green and blue information at each pixel. The photo doesn't appear to be 4 times larger, as you might think, it just means that demosaicing is no longer necessary since all color information is captured at each pixel location.

Our shaking hands make Pixel Shifting natural for mobile photography. And that's how it's implemented in the latest versions of Google Pixel. You notice it when zooming on your Pixel phone. This zooming is called Super Res Zoom (yes, I also enjoy the harsh naming). Chinese manufacturers already copied it to their phones, although it's worse than the original.

Stacking of slightly shifted photos allows us to collect more information about every pixel to reduce noise, sharpen and raise the resolution without increasing the physical number of sensor megapixels. Google's Pixel and some other modern Android phones do it automatically when zooming, while their users don't even realize. It's so effective that Google uses it in the Pixel's Night Sight mode as well, regardless of zoom, often yielding more detailed results than similar competitors under any situation (contrary to popular belief, Night Sight can be used beneficially even during the day).

Focus Stacking: DoF and refocus in post-production

The method came from macro photography, where the depth of field has always been a problem. To keep the entire object in focus, you had to take several shots, moving focus back and forth, and combine them later into one sharp shot in photoshop. The same method is often used by landscape photographers to make the foreground and background sharp as shark.

Focus stacking in macro. DoF is too small and you can't shoot it in one go.

Of course, it all migrated to smartphones. With no hype, though. Nokia released Lumia 1020 with "Refocus App" in 2013, and Samsung Galaxy S5 did the same in 2014 with "Selective Focus". Both used the same approach — they quickly took 3 photos: focused one, focus shifted forth and shifted back. The camera then aligned the images and allowed you to choose one of them, which was introduced as a "real" focus control in the post-production.

There was no further processing, as even this simple hack was enough to hammer another nail in the coffin of Lytro and analogs that used a fair refocus. We'll talk about them next, by the way.

Continue to Part II: Computational Sensors and Optics