Editor's note: This is the second article in a three-part series by guest contributor Vasily Zubarev. Series overview:

You can visit Vasily's website where he also demystifies other complex subjects. If you find this article useful we encourage you to give him a small donation so that he can write about other interesting topics.

The article has been lightly edited for clarity and to reflect a handful of industry updates since it first appeared on the author's own website.

Computational Sensor: Plenoptic and Light Fields

Well, our sensors are crap. We simply got used to it and trying to do our best with them. They haven't changed much in their design from the beginning of time. Technical process was the only thing that improved — we reduced the distance between pixels, fought read noise, increased readout speeds and added specific pixels for phase-detection autofocus systems. But even if we take the most expensive camera to try to photograph a running cat in the indoor light, the cat will win.

We've been trying to invent a better sensor for a long time. You can google a lot of research in this field by "computational sensor" or "non-Bayer sensor" queries. Even the Pixel Shifting example can be referred to as an attempt to improve sensors with calculations.

The most promising stories of the last twenty years, though, come to us from plenoptic cameras.

To calm your sense of impending boring math, I'll throw in the insider's note — the last Google Pixel camera is a little bit plenoptic. With only two pixels in one, there's still enough to calculate a fair optical depth of field map without having a second camera like everyone else.

Plenoptics is a powerful weapon that hasn't fired yet.

Plenoptic Camera

Invented in 1994. For the first time assembled at Stanford in 2004. The first consumer product — Lytro, released in 2012. The VR industry is now actively experimenting with similar technologies.

Plenoptic camera differs from the normal one by only one modification. Its sensor is covered with a grid of lenses, each of which covers several real pixels. Something like this:

If we place the grid and sensor at the right distance, we'll see sharp pixel clusters containing mini-versions of the original image on the final RAW image.

Apparently, if you take only one central pixel from each cluster and build the image only from them, it won't be any different from one taken with a standard camera. Yes, we lose a bit in resolution, but we'll just ask Sony to stuff more megapixels in the next sensor.

That's where the fun part begins. If you take another pixel from each cluster and build the image again, you again get a standard photo, only as if it was taken with a camera shifted by one pixel in space. Thus, with 10x10 pixel clusters, we get 100 images from "slightly" different angles.

The more the cluster size, the more images we have. Resolution is lower, though. In the world of smartphones with 41-megapixel sensors, everything has a limit, although we can neglect resolution a bit. We have to keep the balance.

Alright, we've got a plenoptic camera. What can we do with it?

Fair refocusing

The feature that everyone was buzzing about in the articles covering Lytro is the possibility to adjust focus after the shot was taken. "Fair" means we don't use any deblurring algorithms, but rather only available pixels, picking or averaging in the right order.

A RAW photo taken with a plenoptic camera looks weird. To get the usual sharp JPEG out of it, you have to assemble it first. The result will vary depending on how we select the pixels from the RAW.

The farther the cluster is from the point of impact of the original ray, the more defocused the ray is. Because the optics. To get the image shifted in focus, we only need to choose the pixels at the desired distance from the original — either closer or farther.

The picture should be read from right to left as we are sort of restoring the image, knowing the pixels on the sensor. We get a sharp original image on top, and below we calculate what was behind it. That is, we shift the focus computationally.

The process of shifting the focus forward is a bit more complicated as we have fewer pixels in these parts of the clusters. In the beginning, Lytro developers didn't even want to let the user focus manually because of that — the camera made a decision itself using the software. Users didn't like that, so the feature was added in the late versions as "creative mode", but with very limited refocus for exactly that reason.

Depth Map and 3D using a single lens

One of the simplest operations in plenoptics is to get a depth map. You just need to gather two different images and calculate how the objects are shifted between them. The more the shift — the farther away from the camera the object is.

Google recently bought and killed Lytro, but used their technology for its VR and... Pixel's camera. Starting with the Pixel 2, the camera became "a little bit" plenoptic, though with only two pixels per cluster. As a result, Google doesn't need to install a second camera like all the other cool kids. Instead, they can calculate a depth map from one photo.

Images which top and bottom subpixels of the Google Pixel camera see. The right one is animated for clarity (click to enlarge and see animation). Source: Google
The depth map is additionally processed with neural networks to make the background blur more even. Source: Google

The depth map is built on two shots shifted by one sub-pixel. This is enough to calculate a rudimentary depth map and separate the foreground from the background to blur it out with some fashionable bokeh. The result of this stratification is still smoothed and "improved" by neural networks which are trained to improve depth maps (rather than to observe, as many people think).

The trick is that we got plenoptics in smartphones almost at no charge. We already put lenses on these tiny sensors to increase the luminous flux at least somehow. Some patents from Google suggest that future Pixel phones may go further and cover four photodiodes with a lens.

Slicing layers and objects

You don't see your nose because your brain combines a final image from both of your eyes. Close one eye, and you will see a huge Egyptian pyramid at the edge.

The same effect can be achieved in a plenoptic camera. By assembling shifted images from pixels of different clusters, we can look at the object as if from several points. Same as our eyes do. It gives us two cool opportunities. First is we can estimate the approximate distance to the objects, which allows us easily separate the foreground from the background as in life. And second, if the object is small, we can completely remove it from the photo since we can effectively look around the object. Like a nose. Just clone it out. Optically, for real, with no photoshop.

Using this, we can cut out trees between the camera and the object or remove the falling confetti, as in the video below.

"Optical" stabilization with no optics

From a plenoptic RAW, you can make a hundred of photos with several pixels shift over the entire sensor area. Accordingly, we have a tube of lens diameter within which we can move the shooting point freely, thereby offsetting the shake of the image.

Technically, stabilization is still optical, because we don't have to calculate anything — we just select pixels in the right places. On the other hand, any plenoptic camera sacrifices the number of megapixels in favor of plenoptic capabilities, and any digital stabilizer works the same way. It's nice to have it as a bonus, but using it only for its sake is costly.

The larger the sensor and lens, the bigger window for movement. The more camera capabilities, the more ozone holes from supplying this circus with electricity and cooling. Yeah, technology!

Fighting with Bayer filter

Bayer filter is still necessary even with a plenoptic camera. We haven't come up with any other way of getting a colorful digital image. And using a plenoptic RAW, we can average the color not only by the group of nearby pixels, as in classic demosaicing, but also using dozens of its copies in neighboring clusters.

It's called "computable super-resolution" in some articles, but I would question it. In fact, we reduce the real resolution of the sensor in these some dozen times first in order to proudly restore it again. You have to try hard to sell it to someone.

But technically it's still more interesting than shaking the sensor in a pixel shifting spasm.

Computational aperture (bokeh)

Those who like to shoot bokeh hearts will be thrilled. Since we know how to control the refocus, we can move on and take only a few pixels from the unfocused image and others from the normal one. Thus we can get an aperture of any shape. Yay! (No)

Many more tricks for video

So, not to move too far away from the photo topic, everyone who's interested should check out the links above and below. They contain about half a dozen other interesting applications of a plenoptic camera.

Light Field: More than a photo, less than VR

Usually, the explanation of plenoptics starts with light fields. And yes, from the science perspective, the plenoptic camera captures the light field, not just the photo. Plenus comes from the Latin "full", i.e., collecting all the information about the rays of light. Just like a Parliament plenary session.

Let's get to the bottom of this to understand what a light field is and why we need it.

Traditional photos are two-dimensional. When a ray hits a sensor there will be a corresponding pixel in the photo that records simply its intensity. The camera doesn't care where the ray came from, whether it accidentally fell from aside or was reflected off of another object. The photo captures only the point of intersection of the ray with the surface of the sensor. So it's kinda 2D.

Light field images are similar, but with a new component — the origin and angle of each ray. The microlens array in front of the sensor is calibrated such that each lens samples a certain portion of the aperture of the main lens, and each pixel behind each lens samples a certain set of ray angles. And since light rays emanating from an object with different angles fall across different pixels on a light field camera's sensor, you can build an understanding of all the different incoming angles of light rays from this object. This means the camera effectively captures the ray vectors in 3D space. Like calculating the lighting of a video game, but the other way around — we're trying to catch the scene, not create it. The light field is the set of all the light rays in our scene — capturing both the intensity and angular information about each ray.

There are a lot of mathematical models of light fields. Here's one of the most representative.

The light field is essentially a visual model of the space around it. We can easily compute any photo within this space mathematically. Point of view, depth of field, aperture — all these are also computable; however, one can only reposition the point of view so much, determined by the entrance pupil of the main lens. That is, the amount of freedom with which you can change the field of view depends upon the breadth of perspectives you've captured, which is necessarily limited.

I love to draw an analogy with a city here. Photography is like your favorite path from your home to the bar you always remember, while the light field is a map of the whole town. Using the map, you can calculate any route from point A to B. In the same way, knowing the light field, we can calculate any photo.

For an ordinary photo it's overkill, I agree. But here comes VR, where light fields are one of the most promising areas of development.

Having a light field model of an object or a room allows you to see this object or a room from multiple perspectives, with motion parallax and other depth cues like realistic changes in textures and lighting as you move your head. You can even travel through a space, albeit to a limited degree. It feels like virtual reality, but it's no longer necessary to build a 3D-model of the room. We can 'simply' capture all the rays inside it and calculate many different pictures from within that volume. Simply, yeah. That's what we're fighting over.

Vasily Zubarev is a Berlin-based Python developer and a hobbyist photographer and blogger. To see more of his work, visit his website or follow him on Instagram and Twitter.