Here's Alvy Ray Smith's paper for those who haven't seen it:
https://alvyray.com/Memos/CG/Microsoft/6_pixel.pdf
I'm going to push back just a bit. At the level of the physical structure of our sampling device - the image sensor in our cameras - the pixel *is* a rectangle, not a dimensionless point. (If it were a dimensionless point, it wouldn't be able to record anything.)
That's a good viewpoint for folks to hear, but I'm gonna haggle a bit about the terminology used...
First off, the thing that does the sampling or capture is called a
sensel, and yeah, they are often square. Then again, the microlens typically over each one sort of isn't. What's more, even in a monochrome sensor, there are light-insensitive gaps between sensels (the active area percentage is called the
fill factor) and there is usually an
AA (antialias) filter that tries to bring that sampling process closer to meeting Nyquist constraints... In other words,
even the sampling process is complicated.
Jim's assertion that a
pixel refers to the sampled value is certainly not wrong, but
the term is ambiguous because it also is used to describe the mechanism used to render a sampled value.
Display pixels or
output pixels can have a range of strange properties, especially in color displays. For example, each color LCD pixel typically consists of Red, Green, and Blue stripes within a square area -- and the stripes are typically vertical so that the finer horizontal structures in text can be rendered more precisely by
subpixel rendering (e.g., what Microsoft falsely claimed to have invented when they introduced "
ClearType"). However, a pixel in a JPEG file is arguably even stranger, undergoing colorspace transformations and frequency-domain compression across samples; remember that JPEG compression is based on human perception, so output to a JPEG file really is a rendering process.
Personally, I usually use
sensel for the measuring device samples, and
pixel for the unit of output render processing. When I'm talking about the data, which is what us computer types would call
elements of a
matrix or an
array, I use any of several terms that more specifically describe how the data has been processed. Minimally-processed samples constitute
raw data. However, lots of modern systems perform extensive processing of samples, for example using spatio-temporal interpolation to create a
scene appearance model (or less formally
image data). That is my preferred term for data that has been significantly transformed, but not yet rendered for viewing. Scene appearance models can be created by processes as complicated as sub-pixel alignment and weighted averaging of multiple raw captures (stitching) or as simple as CFA (color filter array) demosaicking. Certainly correcting "bad pixel" samples, color bias, and lens issues like vignetting and distortion are not 1:1 mappings from sensel samples...
In sum, I would describe the flow as: