Why are RAW files so big ?

Just Looking · Dec 14, 2005

Dirk Dittert said:
Actually, you can consider a RAW file to be some form of lossy
compression. There's even different algorithms to reproduce the
original image. One is called Canon Digital Prefessional, another
Adobe Camera RAW, ...

I know you're kidding, but that makes no sense at all. There is no original image represented by the raw data. The image is a creation of the converter. If you're thinking of the scene as an original image, you're missing the essence of photography, in my opinion.

j

Just Looking · Dec 14, 2005

Todd Meucci said:
In the latest copy of American Photo Mag they have a RAW vs Jpeg
article that is extremely informative. I'm by no means an expert,
but as I understand it, the term compression is a bit of a
misnomer. The camera "compresses" files by comparing pixels on a
bevvy of criteria, and tossing ones that are similar or exact to
others around them, thereby eliminating redundancy.

That's a really poor description of how lossless (or lossy for that matter) compression works.

Todd Meucci said:
Well if
there's a way to do that without "losing" info, I'd love to see it.

It could be done that way losslessly, but there are much better approaches. Among the oldest of digital image coding techniques, DPCM, combined with an entropy coder like Huffman, works pretty well as a lossless image compressor.

j

Barry Pearson · Dec 14, 2005

Erik Magnuson wrote:
[snip]

Member said:
It's likely that a significant chunck of the size diff is due to
the different preview image sizes. I would bet that the remaining
differences are also due to different amounts/compression of meta
data (e.g. maker note, etc.) I don't think that the actual raw
data compression differs very much based on what I've seen about
lossless jpeg.

Perhaps. I just don't know the reasons.

We are told that DNG conversion preserves all the CR2 EXIF Makernote (into DNGPrivateData), and the DNG conversion itself adds some metadata.

But it appears to be the case that Canon have one of the best compressions compared to other camera makers.

Dana Jacobsen · Dec 14, 2005

BZIP2 is a great compressor, but even for general purpose compressors it isn't really state of the art any more. But the real key is that ZLIB, BZIP2, LZMA, etc. are all general purpose compressors -- they are not application specific. In this case there is an image to be compressed, which make a huge difference. Correlation between color components, non-8-bit pixels, spatial relations, etc.

JBIG on halftoned data is a great example of this, where because it understands dither patterns can give a lot better compression than most other methods on that type of data. It's quite capable of beating BZIP2 on halftoned or FAX data, because it was made to do that one task very well.

As a RAW file, we're not even storing the full color data like one would with a normal image -- we're storing the value of the Bayer filtered pixels. So one could use this information as well, which of course bzip, etc. aren't going to know.

But the poster did have a point, that at least the output of the CR2 file doesn't have any really obvious statistically redundant data. It isn't like it's storing raw pixels -- it has done a decent job of lossless compression.

Barry Pearson · Dec 14, 2005

PicOne said:
Most converters, will apply certain settings and curves when a file
is identified as a CR2 or NEF or whatnot and even more specifically
based on what camera model produced it. Do the converters pull the
camera-model out of Exif for this, or from partly the file
extension name itself?

ACR uses primarily, probably only, the EXIF data, to choose its calibration. I tried renaming some files, changing the extensions, and as long as I chose an existing extension, the conversion still worked! (For example, I renamed a Canon raw as a PEF).

But if I chose a new extension, such as CR3, it didn't recognise the file as something that ACR could handle.

(You can try this yourself with your converter to see what happens. They may not all be the same).

Dana Jacobsen · Dec 15, 2005

I would recommend that if you really care, get the source for dcraw.c from:

http://www.cybercom.net/~dcoffin/dcraw/

where you can see the source code to decode the RAW format from 197 different cameras, all in one relatively small C file. This includes Canon's CRW, CR2, Adobe DNG, Nikon NEF, and others.

Camera makers offer JPEG files for general use. This will satisfy many people, and is typically small because of the lossy compression used (and that JPEG is actually quite good at what it does). More sophisticated cameras offer more choices in the amount of compression used. For those people that want more than JPEG, there is RAW format, which pretty much means the verbatim output from the sensor, which implies lossless compression. I don't think there is a big demand for intermediate forms between high-quality JPEG and loss-less raw data (I'm not saying I don't want it or that there isn't any demand, but give RAW already there isn't a big enough need for, say, lossy 12-bit to warrant adding it).

If cameras supported JPEG2000 creatively we could see more of a spectrum from the 8-bit lossy like JPEG (but with higher quality for the same size) up through 10-16-bit lossless data, all using one algorithm. But JPEG2000 seems to still be impractical for real-time use at the bandwidths and costs of this application.

Some manufacturers are providing very poor solutions, like the above-mentioned Sony that doesn't even compress the output at all for some reason. It looks from dcraw that Nikon's compressed RAW format uses LJPEG. Canon used to use their own quite simple, but reasonably effective method, then switched to LJPEG. For their DNG, Adobe also chose LJPEG (lossless JPEG -- a completely different and unrelated algorithm than the lossy JPEG most people know of).

Lossless JPEG has the big advantage of simplicity versus some other options (JPEG2000, CALIC, etc.). It does surprisingly well, and I don't see any other practical algorithms that would be a big step forward (but I don't make cameras). Some things could be done to improve it a little, but probably not worth making something non-standard.

As a user of a few different Canon cameras, I would like to say that I love having the decent sized JPEG preview image in the CR2 files. With the old .TIF raw files like the 1Ds makes, I always do RAW+JPEG because the thumbnail is unusably small.

joebloe · Dec 15, 2005

Just Looking said:
joebloe said:

The fact that bzip2 can't compress a CR2 file AT ALL is a very
good indication that there is nothing left in that file to be
compressed, even if bzip2 isn't the best compressor for that task.

Click to expand...

Just Looking said:
How doe it work on a JPEG for example?

Why would bzip2 work on a JPEG?

AlanClements · Dec 15, 2005

Hi

There's seems to be some confusion.

Lossless compression simply means that data can be compressed and decompressed and the original restored exactly. Lossless compression works because there is 'redundant' information in the data. In real life you can compress 'January 30' to 'Jan 30' with no loss. Note you can't compress to 'Jn 30' because that looses so much infomation tow reconstructions becom possible 'January' and 'June'.

A typical lossless encoding technique is runlength encoding. If you have an image and there is a run (sequence or string) of pixels with the same value you can repacle them by (value,runlength). This works well because areas of sky etc have the same color. You can always reconstruct the orignal from such data.

Another techque is to look for sequences of data that appear over and over again and replace them by a short symbol. ZIP does this - in English you get 'the' everywhere and it can be replaced by a short code.

Lossless encoding throws away information and you can't get back to the original. Lossless encoding exploits human senses. In MP3 music you can throw away weak sounds when you have a dominent loud sounce because your ear will not notice it. Jpeg exploits similar behavior in the eye (i.e., brain).

Jpeg works by converting pixel images into an array of cosine coefficients. These coefficients reflect the amount of 'rate of change' or fine detail in the pictures. When you choose a jpeg setting, you are actually telling the system which coefficinets to throw away. One gone they are gone.

Some have suggested that it is possible to improve lossless compression and that there's a way of doing it waiting to be found. Sadly, not so. For example, you cannot compress an image made up of random dots. That's an absolute restriction (if someone does manage it, I will put it alongside my perpetual motion machine).

Perhaps in the future compression will work differently. Insead of sending, say, the pixels (image) that describe someone's hair, they will simply say (hair, color, extent, texture, in_the_syle_of) and the actual hair will be drawn locally in the software that geneate the image.

Cheers

Alan

Barry Pearson · Dec 15, 2005

AlanClements wrote:
[snip]

AlanClements said:
Perhaps in the future compression will work differently. Insead of
sending, say, the pixels (image) that describe someone's hair, they
will simply say (hair, color, extent, texture, in_the_syle_of) and
the actual hair will be drawn locally in the software that geneate
the image.

Or say "Eiffel Tower" and the software will fetch a stock image.

Wonjae Lee · Dec 15, 2005

Ray Chen said:
ioannis said:

I know it has to be a lossless compression

Click to expand...

Why? Although it is desirable, there is no law says that it has to
be a lossless compression.

Exactly. The compressed NEFs use quantization.
http://www.majid.info/mylos/weblog/2004/05/02-1.html

Ray Chen said:
--
Ray Chen
http://www.arrayphoto.com

Erik Magnuson · Dec 15, 2005

Barry Pearson said:
ACR uses primarily, probably only, the EXIF data, to choose its
calibration.

Well you have the right idea, but the wrong details. You can't use Exif because it's not stored in the same place for every raw format! So you can't even find the Exif until you have at last some idea which camera.

It it works like dcraw, it first looks for a characteristic pattern in the first few dozen bytes. That's usually enough to tell it where to look for more details. If the format is a TIFF variant, it can the parse out "Make" and "Model" tags. For others, it has to look at format specific encodings.

Barry Pearson said:
But if I chose a new extension, such as CR3, it didn't recognise
the file as something that ACR could handle.

That's probably because it's designed to process entire directories and skip non-raw files.

--
Erik

JG789 · Dec 15, 2005

The current processors in camera's lack the functionality to create zip files.Compressing to jpg is a very different process. Such a function could be added to future camera processor chips.

For now, you could zip a few RAW files in your computer and see what the savings actually are.. AND unzip them to see if the process works..

Jerry

Barry Pearson · Dec 15, 2005

Erik Magnuson said:
Barry Pearson said:

ACR uses primarily, probably only, the EXIF data, to choose its
calibration.

Click to expand...

[snip]

Member said:
It it works like dcraw, it first looks for a characteristic pattern
in the first few dozen bytes. That's usually enough to tell it
where to look for more details. If the format is a TIFF variant, it
can the parse out "Make" and "Model" tags. For others, it has to
look at format specific encodings.

OK. Thanks for that.

Member said:
Member said:

But if I chose a new extension, such as CR3, it didn't recognise
the file as something that ACR could handle.

Click to expand...

That's probably because it's designed to process entire directories
and skip non-raw files.

It is a bit more complicated. Bridge won't put ACR in its right-click menu for a "CR3" genuine raw file. But neither will it for a "CR2" text file. So the right-click menu appears to do a broad filter by file type, then a tuned filter by content. Perhaps it asks ACR to handle the file because the file type is something ACR should recognise, then when ACR fails to understand it, Bridge remembers that? The file type for CR2 reads "PHOTOSHOP.CAMERARAWFILECANON2.9" on my PC.

Not that it matters much, except out of curiosity!

ewelch · Dec 15, 2005

You're so right. My 1Ds Mark II CR2 files vary around just

If I save a 16-bit TIF file to the hard drive, it's 96 megabytes!

I store files in RAW to save space!

I don't use JPEG because I care about the future of my photographs. And it's a snap to generate jpegs whenever I might need them.

--
Eric

Ernest Hemingway's writing reminds me of the farting of an old horse. - E.B. White

lglass · Dec 15, 2005

Great way to say it.!

Thank you,
Dennis

Dana Jacobsen · Dec 15, 2005

AlanClements said:
Lossless encoding throws away information and you can't get back to
the original. Lossless encoding exploits human senses. In MP3 music
you can throw away weak sounds when you have a dominent loud sounce
because your ear will not notice it. Jpeg exploits similar behavior
in the eye (i.e., brain).

Whoops, I'm sure you meant lossy here.

One method to handle lossless compression is to apply lossy compression, then store the error term. The better your lossy compression, the closer it is to the original, which means the easier it is to store the differences. This can be done very explicitly, where either the coder or the decoder can decide what it has so far is "good enough" and just stop, or implicitely, which is how most lossless coders work internally -- for each pixel they create a prediction and then just code up the error term. If they predicted well then the error term has lots of runs of "0" in it, or at least seriously skewed toward 0. This compresses very well.

Compression is all about playing with statistics. You try to use fewer bits for things that are common, with the sacrifice being that some things have to now take more bits. But if they're very unlikely, then you usually win. With any method you can find some input that just doesn't match at all what it was meant for. You can also play around with big dictionaries, like the 'Eiffel Tower' scheme someone mentioned. Another is to make sure both the encoder and decoder have access to the contents of the Library of Congress. Then you send an ISBN number and the decoder spits out the entire contents of the book. This is great if this is your normal communication, but an unpublished book wouldn't work well unless you make it a lossy scheme that gives you the contents of what it thinks is a similar book.

Dana Jacobsen · Dec 15, 2005

Whoops, read the code wrong -- dcraw is using one of it's ljpeg functions while reading Nikon compressed files, but they're not LJPEG compressed -- just reusing some code to read variable length coded difference values. It's then using these differences as errors from a prediction based on adjacent pixels (not the LJPEG predictions or quantizations). My mistake.

As others have pointed out, the NEF problem seems to be that it runs the 12-bit output through a quantization curve with only 683 values. Copying the site from the other post: http://www.majid.info/mylos/weblog/2004/05/02-1.html

Search

Why are RAW files so big ?

Just Looking

Veteran Member

Just Looking

Veteran Member

Barry Pearson

Veteran Member

Dana Jacobsen

Leading Member

Barry Pearson

Veteran Member

Dana Jacobsen

Leading Member

joebloe

Senior Member

AlanClements

Well-known member

Barry Pearson

Veteran Member

Wonjae Lee

Well-known member

Erik Magnuson

Guest

JG789

Well-known member

Barry Pearson

Veteran Member

ewelch

Senior Member

lglass

Senior Member

Dana Jacobsen

Leading Member

Dana Jacobsen

Leading Member

Keyboard shortcuts