Are high-end camera processors better than pc processors?

If a minimalist Linux machine was built on a fairly recent laptop, and the fastest available image processing application was run, I don't think it will process anything close to 20 images per second.
Much depends on how the GPU is used. Even without tight optimization for the GPU onboard, FastRawViewer processes 10 to 20 Nikon Z 9 raw images per second on a fairly recent MacBook Pro, and that's with raw histogram calculation and 4K display of converted images.
And at this point, the PCI Express bandwidth comes into play - the more you can do on-GPU without transferring between GPU and CPU, the better. CPU-GPU transfers are quite time consuming.
Agreed about CPUs. PCIe is doable, for example GPUDirect Storage .
 
Last edited:
Seeing how many images from stacked sensors are being processed per second in modern cameras, I am guessing that these processors are more powerful for graphics processing than the mainstream PC processors. Is that accurate?

The buffer is generally a limitation otherwise the stacked sensor ML cameras are capturing, processing and saving RAW + JPEG at an insane speed. Sony's A1, latest Fuji X-H2S, Canon's R3, Nikon's Z9, Olympus OM-1, etc., come to mind.

They do run camera's OS (whatever that is), support Wi-Fi, HDMI, Ethernet, Bluetooth, external recording and a variety of video qualities.

If a minimalist Linux machine was built on a fairly recent laptop, and the fastest available image processing application was run, I don't think it will process anything close to 20 images per second.

Am I missing something?

Thanks
It is hard to make an apples-to-apples comparision. Cameras and smartphones tends to contain a System-on-a-chip that includes generic compute cores (e.g. ARM) and specialized hardware (image processing processor, machine learning accelerator). For tasks that are known at the time of silicon manufacture, these specialized compute devices can be highly efficient (in terms of battery, heat, area and unit cost).

A PC tends to have 4-16 fairly generic and fast cores, and less specialized components, Granted, a dedicated GPU may do semi-specialized tasks at great speed, but PCs typically don't include ISPs and up until recently not a ML accelerator.

If you need just the kind of processing that your camera/smartphones does today, its specialized hardware could be able to do that with less battery drain, less heat and (in some ways) less expensive than a generic computer doing the exact same task. But if you want to evolve that algorithm over the years, or do more complex work, the PC may scale somewhat "proportionately" (i.e. an algorithm that contains twice as many multiplies may be on the order of half as fast), while doing the same complexity increase in your camera may be practically impossible.
Yup. The big limitation here is storage bandwidth - Feeding the entire stack of images from a burst requires a lot of bandwidth over a high-bandwidth interface.

You just don't see these kinds of interfaces going from one device to another.

For example, Jim Kasson recorded a readout for some of the earlier A7R devices (R2 or R3, I can't remember what) that was consistent with over 1 GPixel/second at 12+ bits/pixel (I don't remember the exact number). 12 GBits/second which is higher than many USB interfaces can hit, and that was from the aging BIONZ X on cameras that only had a USB 2.0 interface externally. Note that the moment you hit the scaling/demosaicing engine you hit a 500-600 MPixel/sec bottleneck in the BIONZ X, and then even narrower bottlenecks by the time you reached storage. Newer cameras have been clocked at much higher transfer rates.

If you could get that raw sensor bandwidth straight to a PC, you'd have some interesting options open - but that's the realm of very expensive dedicated industrial cameras that have the sensor interfacing to a minimal bridge chip and then high-lane-count PCI-Express.
By computer datapipe standards that's not that much bandwidth - SSDs over PCIe are comfortably in the multiple Gbytes/sec range, and you can get 40 Gbits/s over copper ethernet cable up to 30m.

You can also see this in the RAM bandwidth available to the system - the Z9 is only around 30 GB/s (theoretical speed from 3.7 Gt/s RAM, 32 bits/package, and 2 packages) while 3-4 years ago typical PCs had similar speed RAM plumbed to a bus that's twice as wide. If the workload scales to GPU processing the data deluge gets even more ridiculous, >1 TB/s is common in high end GPUs.

It wouldn't surprise me if a stacked sensor just showed up as a PCIe device. It's a widely used standard for ultra high bandwidth over a PCB, and if the ADCs are in the stack then you're just shuffling bits around - no need to reinvent the wheel when sony semi probably has a standard PCIe interface design ready to go for a reasonable licencing fee.
 
Hi Iliah, if you could share, how much of the imaging pipeline were you able to offload onto the GPU inside FRV? How difficult was the effort for each of the major elements?
We have been told for a decade or more that GPUs offer tremendous speedup over cpus.

My (granted limited) experience has been that for _some_ tasks, GPUs are vastly more efficient than cpus. If you want to perform a floatingpoint matrix multiply of large matrixes, gigantic FFTs or other "numerical compute, HPC/scientific compute single-precision float task that could be somewhat trivially parallellized", then GPUs are the obvious choice. Further, if you do such standard operations, you could probably use some pre-built library that is hand-tuned for a particular GPU by someone else.

However, if your code is less inherently parallell, if it contains integer math (or double-precision float), branching, "bit fiddling" and stuff like that, GPUs may not always be such an obvious choice.

Any software (commercial or open) needs to trade developer effort/willingness for functionality and speed for one or many users. Given an optimization expert and possibly an algorithm/domain expert, it has been my experience that most code can be made to run 2x to 10x faster on a given set of hardware with no or minimal reduction in quality. Problem is, if that costs 3 months or 12 months of full-time effort, then it might not be worth it. If the number of affected users is small, they could just buy new hardware and be done with it. If the number of users is large, there is still the question of if this speedup is worth enough to them.

-h
Agreed. It applies more generally to parallelizing algorithms and not specific to GPUs (ie, multi-threading and OpenMP). It takes a lot of expertise, effort, and time, which most companies think is better spent on coding new features.
Looking at relic versions of libraw many are using, new features for old cameras ;)
 
If a minimalist Linux machine was built on a fairly recent laptop, and the fastest available image processing application was run, I don't think it will process anything close to 20 images per second.
Much depends on how the GPU is used. Even without tight optimization for the GPU onboard, FastRawViewer processes 10 to 20 Nikon Z 9 raw images per second on a fairly recent MacBook Pro, and that's with raw histogram calculation and 4K display of converted images.
If you want to consider a clean ingest-to-render use of a GPU in a raw processor, take a look at vkdt:


Written by the original author of darktable, it loads the raw image straight into a GPU buffer, conducts all operations in-situ with shaders, and the display is a simple designation of the buffer for display. I just installed a GeForce 1050 GPU specifically to run vkdt, and believe me, it does not disappoint...
 
Agreed. It applies more generally to parallelizing algorithms and not specific to GPUs (ie, multi-threading and OpenMP). It takes a lot of expertise, effort, and time, which most companies think is better spent on coding new features.
In the commercial space, I think that it may be generally easier to find developers that are interested in doing web development, than those that eat AVX512 instructions for breakfast. In the open source world, things may be the other way around.

There is also the thing that a domain expert or algorithms expert could invest in comprehending and improving a specific function in a more general way. While a purely optimized-focused person would typically take what is allready present and make it 2-10x faster for a more or less restricted set of platforms (e.g. Intel cpus with AVX512 engines, Nvidia 2021-era GPUs etc).

The most potential seems to lie in understanding both the problem to solve, the current/state-of-the-art algorithms to solve them, and the possibilities of a given platform to do the relevant number-crunching. Then you can do funky stuff like drop precision to 8 bits in selected routines because they do not affect visible artifacts that much, and simultaneously because that allows you to process 2x or 4x as many elements per time using the platforms SIMD instructions, and this turns out to be a significant fraction of the total compute time, and the compute time is something that your users are actually frustrated by.

-h
 
Seeing how many images from stacked sensors are being processed per second in modern cameras, I am guessing that these processors are more powerful for graphics processing than the mainstream PC processors. Is that accurate?

The buffer is generally a limitation otherwise the stacked sensor ML cameras are capturing, processing and saving RAW + JPEG at an insane speed. Sony's A1, latest Fuji X-H2S, Canon's R3, Nikon's Z9, Olympus OM-1, etc., come to mind.

They do run camera's OS (whatever that is), support Wi-Fi, HDMI, Ethernet, Bluetooth, external recording and a variety of video qualities.

If a minimalist Linux machine was built on a fairly recent laptop, and the fastest available image processing application was run, I don't think it will process anything close to 20 images per second.

Am I missing something?

Thanks
It is hard to make an apples-to-apples comparision. Cameras and smartphones tends to contain a System-on-a-chip that includes generic compute cores (e.g. ARM) and specialized hardware (image processing processor, machine learning accelerator). For tasks that are known at the time of silicon manufacture, these specialized compute devices can be highly efficient (in terms of battery, heat, area and unit cost).

A PC tends to have 4-16 fairly generic and fast cores, and less specialized components, Granted, a dedicated GPU may do semi-specialized tasks at great speed, but PCs typically don't include ISPs and up until recently not a ML accelerator.

If you need just the kind of processing that your camera/smartphones does today, its specialized hardware could be able to do that with less battery drain, less heat and (in some ways) less expensive than a generic computer doing the exact same task. But if you want to evolve that algorithm over the years, or do more complex work, the PC may scale somewhat "proportionately" (i.e. an algorithm that contains twice as many multiplies may be on the order of half as fast), while doing the same complexity increase in your camera may be practically impossible.
Yup. The big limitation here is storage bandwidth - Feeding the entire stack of images from a burst requires a lot of bandwidth over a high-bandwidth interface.

You just don't see these kinds of interfaces going from one device to another.

For example, Jim Kasson recorded a readout for some of the earlier A7R devices (R2 or R3, I can't remember what) that was consistent with over 1 GPixel/second at 12+ bits/pixel (I don't remember the exact number). 12 GBits/second which is higher than many USB interfaces can hit, and that was from the aging BIONZ X on cameras that only had a USB 2.0 interface externally. Note that the moment you hit the scaling/demosaicing engine you hit a 500-600 MPixel/sec bottleneck in the BIONZ X, and then even narrower bottlenecks by the time you reached storage. Newer cameras have been clocked at much higher transfer rates.

If you could get that raw sensor bandwidth straight to a PC, you'd have some interesting options open - but that's the realm of very expensive dedicated industrial cameras that have the sensor interfacing to a minimal bridge chip and then high-lane-count PCI-Express.
By computer datapipe standards that's not that much bandwidth - SSDs over PCIe are comfortably in the multiple Gbytes/sec range, and you can get 40 Gbits/s over copper ethernet cable up to 30m.

You can also see this in the RAM bandwidth available to the system - the Z9 is only around 30 GB/s (theoretical speed from 3.7 Gt/s RAM, 32 bits/package, and 2 packages) while 3-4 years ago typical PCs had similar speed RAM plumbed to a bus that's twice as wide. If the workload scales to GPU processing the data deluge gets even more ridiculous, >1 TB/s is common in high end GPUs.

It wouldn't surprise me if a stacked sensor just showed up as a PCIe device. It's a widely used standard for ultra high bandwidth over a PCB, and if the ADCs are in the stack then you're just shuffling bits around - no need to reinvent the wheel when sony semi probably has a standard PCIe interface design ready to go for a reasonable licencing fee.
Image sensors typically use proprietary LVDS links to offload their data.
 
Seeing how many images from stacked sensors are being processed per second in modern cameras, I am guessing that these processors are more powerful for graphics processing than the mainstream PC processors. Is that accurate?

The buffer is generally a limitation otherwise the stacked sensor ML cameras are capturing, processing and saving RAW + JPEG at an insane speed. Sony's A1, latest Fuji X-H2S, Canon's R3, Nikon's Z9, Olympus OM-1, etc., come to mind.

They do run camera's OS (whatever that is), support Wi-Fi, HDMI, Ethernet, Bluetooth, external recording and a variety of video qualities.

If a minimalist Linux machine was built on a fairly recent laptop, and the fastest available image processing application was run, I don't think it will process anything close to 20 images per second.

Am I missing something?

Thanks
It is hard to make an apples-to-apples comparision. Cameras and smartphones tends to contain a System-on-a-chip that includes generic compute cores (e.g. ARM) and specialized hardware (image processing processor, machine learning accelerator). For tasks that are known at the time of silicon manufacture, these specialized compute devices can be highly efficient (in terms of battery, heat, area and unit cost).

A PC tends to have 4-16 fairly generic and fast cores, and less specialized components, Granted, a dedicated GPU may do semi-specialized tasks at great speed, but PCs typically don't include ISPs and up until recently not a ML accelerator.

If you need just the kind of processing that your camera/smartphones does today, its specialized hardware could be able to do that with less battery drain, less heat and (in some ways) less expensive than a generic computer doing the exact same task. But if you want to evolve that algorithm over the years, or do more complex work, the PC may scale somewhat "proportionately" (i.e. an algorithm that contains twice as many multiplies may be on the order of half as fast), while doing the same complexity increase in your camera may be practically impossible.
Yup. The big limitation here is storage bandwidth - Feeding the entire stack of images from a burst requires a lot of bandwidth over a high-bandwidth interface.

You just don't see these kinds of interfaces going from one device to another.

For example, Jim Kasson recorded a readout for some of the earlier A7R devices (R2 or R3, I can't remember what) that was consistent with over 1 GPixel/second at 12+ bits/pixel (I don't remember the exact number). 12 GBits/second which is higher than many USB interfaces can hit, and that was from the aging BIONZ X on cameras that only had a USB 2.0 interface externally. Note that the moment you hit the scaling/demosaicing engine you hit a 500-600 MPixel/sec bottleneck in the BIONZ X, and then even narrower bottlenecks by the time you reached storage. Newer cameras have been clocked at much higher transfer rates.

If you could get that raw sensor bandwidth straight to a PC, you'd have some interesting options open - but that's the realm of very expensive dedicated industrial cameras that have the sensor interfacing to a minimal bridge chip and then high-lane-count PCI-Express.
By computer datapipe standards that's not that much bandwidth - SSDs over PCIe are comfortably in the multiple Gbytes/sec range, and you can get 40 Gbits/s over copper ethernet cable up to 30m.

You can also see this in the RAM bandwidth available to the system - the Z9 is only around 30 GB/s (theoretical speed from 3.7 Gt/s RAM, 32 bits/package, and 2 packages) while 3-4 years ago typical PCs had similar speed RAM plumbed to a bus that's twice as wide. If the workload scales to GPU processing the data deluge gets even more ridiculous, >1 TB/s is common in high end GPUs.

It wouldn't surprise me if a stacked sensor just showed up as a PCIe device. It's a widely used standard for ultra high bandwidth over a PCB, and if the ADCs are in the stack then you're just shuffling bits around - no need to reinvent the wheel when sony semi probably has a standard PCIe interface design ready to go for a reasonable licencing fee.
Image sensors typically use proprietary LVDS links to offload their data.
Sony has been using SLVS-EC lately.

To get it into a computer you'd have to bridge to a few lanes of PCIe, and then have an externa PCIe cable (possible but not common) from that bridge into a PCI-Express or M.2 slot.
 
it loads the raw image straight into a GPU buffer, conducts all operations in-situ with shaders, and the display is a simple designation of the buffer for display.
That's about what FastRawViewer is doing for many years now ;)
Interesting. I tried FRV a couple of years ago, not a GPU in the house back then, still pretty fast without it. Does it detect the presence and shift gears?
 
it loads the raw image straight into a GPU buffer, conducts all operations in-situ with shaders, and the display is a simple designation of the buffer for display.
That's about what FastRawViewer is doing for many years now ;)
Interesting. I tried FRV a couple of years ago, not a GPU in the house back then, still pretty fast without it. Does it detect the presence and shift gears?
Demosaicking on GPU is switched on automatically on the first launch of FastRawViewer if all of the below is true:
• Fast GPU (Intel Iris, Nvidia GTX, AMD/ATI R7/R9/RX/Vega)
• CPU supports the command to convert data to 16-bit floating (Intel Ivy Bridge and newer)
• Video mode is set to OpenGL or DirectX 11 (those will be set automatically on the first launch if the fast GPU is detected)

You can control the gears manually, in Preferences -> GPU processing.
 

Keyboard shortcuts

Back
Top