21 Seeing

Published

November 6, 2025

Work in Progress

The book is still taking shape, and your feedback is an important part of the process. Suggestions of all kinds are welcome—whether it’s fixing small errors, raising bigger questions, or offering new perspectives. I’ll do my best to respond, but please keep in mind that the text will continue to change significantly over the next two years.

You can share comments through GitHub Issues.

Feel free to open a new issue or join an existing discussion. To make feedback easier to address, please point to the section you have in mind—by section number or a short snippet of text. Adding a label characterizing your issue would also be helpful.

Last updated: November 6, 2025

21.1 Seeing overview

The science of human vision—spanning behavior, perception, and neuroscience—is fascinating, and many aspects are important for image‑systems design. This is the topic of my earlier book, Foundations of Vision (Wandell (1995)), which I plan to update after this book matures.

The following chapters in this section cover topics in human vision that are immediately relevant to image systems engineering. The topics include the limits of spatial, temporal, and wavelength encoding by the eye, along with a variety of practical, computable image quality metrics. These metrics are widely used by engineers, and the principles behind them are well-established. We will also explore how to improve upon these tools.

Before we tackle those topics, in this chapter I take a few pages to explore the question of what it means to see. Although it is a bit beyond the scope of this volume, the question of what we experience is intrinsically interesting, and understanding the relationship between the light field (loosely, the directional distribution of light in space) and our experience will surely help us engineer better image systems.

As mentioned earlier (Section 1.3), Hermann von Helmholtz was among the first to argue that we do not perceive the world directly. Instead, what we see is an inference based on the portion of the light field our eyes encode. When Helmholtz proposed this, he challenged the prevailing belief that introspection alone was a sufficient scientific method for understanding visual interpretation. He argued that we are largely unaware of the processes leading to perception and described them as “unconscious inferences.” This phrase was not a mechanistic explanation, but a counterargument to the idea that we can understand vision simply by reflecting on our own perceptions.

The key takeaway is that our percepts are essentially educated guesses—probabilistic estimates under uncertainty—about the state of the world. Here, inference contrasts with strict logical deduction. How do we understand the principles behind these guesses? A crucial source of insight comes from visual illusions. In Helmholtz’s view, normal vision and illusions rely on the same inferential machinery. Cases where our visual guesses are incorrect are particularly instructive, as they help reveal the principles that also guide our correct perceptions. This section uses illusions to illustrate this point, so we do not let technical details obscure the Big Picture.

We do not yet have a fully predictive, end-to-end model that maps the incident light field to visual experience, but we do have some hints. Each illusion presented here demonstrates a principle the visual system uses to infer the state of the world. I leave the philosophical debate about whether these are true illusions or simply compelling demonstrations to others¹. I have selected them for what they reveal about the principles embedded in our visual inferences. For a more extensive collection, I highly recommend the work of Akiyoshi Kitaoka and the remarkable entries in the Illusion of the Year contest.

21.2 The world is three-dimensional

The two tabletops in this picture are the same shape and size in the 2D image plane. Yet one appears to be long and narrow, and the other appears to be short and wide. Why the strong difference?

Your visual system uses the cues in the drawing to interpret the image as a projection of a 3D scene. What you experience accounts for the depth and slant (foreshortening) suggested by the table legs, the implied perspective, and the contour orientations. The inferences that help you perceive true 3D structure from the two-dimensional retinal image drive the illusion.

If we could walk around the actual (imagined) 3D scene the image depicts, the two physical tabletops producing those projections would almost certainly differ. The exact 2D match of the table topcs in the image plane is an artifact of viewpoint, created by Shepard’s careful drawing. It is an accidental agreement, and thus it is not something you would normally notice. What you consciously experience -what you see- is a more plausible 3D interpretation, not the accidental 2D shapes in the image and on your retina.

Figure 21.1: R.N. Shepard’s illusion: Turning the Tables.

Shepard’s illusion reminds us: for everyday vision, the 2D retinal shape is not what we see. We see an inference about 3D structure that is derived from the image.

21.3 Expectations of likely shapes

Richard Gregory’s ‘Hollow Mask’ illusion is a powerful extension of the “Turning the Tables” demonstration. In this illusion the mask is genuinely three-dimensional, having the same shape as a face. The illusion arises because when you see the inside of the mask the three-dimensional shape is quite unlike the typical three-dimensional shape of a face -the mask is hollow-. Yet, the mask appears to have the three-dimensional structure one would expect of a real face, with the nose nearest you. You do see a three-dimensional shape, but it is the likely (and wrong) shape.

Equally interesting is the how we see the motion of the rotating mask. The interpretation of the depth is wrong. To explain the stimulus on your retina, your brain makes a second wrong interpretation: the mask appears to be rotating opposite to its true direction.

Figure 21.2: R.L. Gregory’s ‘hollow mask’ demonstration.

The Hollow Mask Illusion informs us that we are deeply biased about the three-dimensional shape of certain objects. Apparently, we expect to see noses closer than foreheads. Given the chance, we force that interpretation and we then reinterpret other features (the motion) to make them consistent.

21.4 Regional comparisons

These three demonstrations show that even a seemingly simple judgment—how light or dark a square appears—is based on context, not just the raw pixel (retinal) intensity.

Kitaoka: Squares that have the same physical (image) intensity look different because each is compared against its immediate surround. Local contrast (center vs. surround) is weighted more heavily than absolute level.

21.4.1 Kitaoka

Figure 21.3: The A square appears darker than the B square, but they are the same. Source: Akiyoshi Kitaoka

Adelson (Checkered Shadow): This adds a reason. The visual system tries to estimate surface reflectance rather than raw intensity. Image intensity at a point can be thought of (simplistically) as $ I(x) = E(x),R(x)$ where $E(x)$ is illumination (which can vary sharply because of shadows) and $R(x)$ is surface reflectance (which is more stable). Because $E(x)$ is unreliable, the system discounts it by using local ratios and edges. Under a smooth shadow, neighboring points share nearly the same $E(x)$, so $ $. That makes local contrast a better cue to $R$ than the absolute intensity.

21.4.2 Adelson

Lotto and Purves: A photorealistic version of the same principle. Two floor tiles (one apparently in bright light, one in shadow) have identical image intensities, yet you experience them as different materials (light vs. dark) because your visual system attributes the intensity difference to illumination, preserving a stable reflectance interpretation.

21.4.3 Lotto and Purves

Figure 21.5: Lotto and Purves. Source: Cite their book here.

Key idea: You do not directly “see” intensity; you infer surface properties. Context (contrast relations, shadows, surrounds) reshapes the percept to favor a plausible, illumination-invariant reflectance estimate. No equations beyond the simple product model were needed here, and those shown are standard.

21.5 Edges matter

A single local edge can change how you see a wide region. In the Craik–O’Brien–Cornsweet (Cornsweet) illusion the two large adjoining fields have exactly the same physical intensity, yet one looks lighter and the other darker. The only physical difference lies in a narrow luminance gradient (a “dipole”) straddling the shared boundary. Your visual system gives that gradient disproportionate weight and then “fills in” a brightness (or surface reflectance) level across the full interior of each region.

21.5.1 Cornsweet-Craik

Say some words.

Figure 21.6: Cornsweet–Craik–O’Brien illusion. A narrower dipole better conveys the effect. Image by Lord Belbury, CC BY-SA 4.0, via Wikimedia Commons.

21.5.2 Winawer-Horiguchi

Jon Winawer and Hiroshi Horiguchi converted the Cornsweet-Craik image into a dynamic, fun demonstration. Each of the segments, as it appears, is darker than the one behind it. Even though the new segments appear darker and darker, the segment becomes very dark.

Figure: Winawer and Horiguchi’s dynamic version of the Cornsweet–Craik illusion. Each new segment appears darker than the previous one; all segments have the same luminance.

:::

Shepard’s tone

Jon and Hiroshi’s illusion was inspired by Roger Shepard’s well-known Ascending Tone Illusion. In that case, the pitch rises, but never gets higher.

The Stanford Chorale kindly sang the illusion for a Sensation and Perception class I taught project some 35 years ago. Fun.

21.5.3 Daikin-Bex

The Daikin–Bex variant makes the spatial spreading especially clear: many broad areas all share the same intensity, but they appear different solely because of the contrast polarity and magnitude at their edges. Move closer or farther so that the regions take up more of your visual field—the effect persists, showing it is not a tiny, fovea-only phenomenon.

Figure 21.7: Daikin and Bex. The forehead and the hair are the same intensity. The many light-dark edges (dipoles) give an impression of lightness that spreads across the image.

Intuition: Early spatial filtering emphasizes luminance changes (edges, gradients) and attenuates uniform interiors. The visual system then infers a smooth or piecewise-smooth surface reflectance that is most consistent with the edge signals. A small, signed edge contrast can therefore shift the perceived baseline (offset) over a large area until another edge provides a counteracting constraint.

Key points: - Perceived brightness depends more on edge structure than on absolute interior intensity.
- Local contrast can propagate its influence globally (“filling-in”).
- These illusions warn that large uniform regions are perceptually underconstrained without their borders.

21.6 2nd order contrast?

Looking for a fundamental 2nd order contrast illusion.

21.7 Motion, pattern, and color combine

Not sure

21.8 Seeing without light

Something about Craik, Newton, Brindley personal experiments? Entopic phenomena.

Yes, I have had people I respect challenge whether some of these are really illusions or not.↩︎