Addressing the Rotation and Scaling Problem
in Template-Based Image Recognition Systems
© 2004 by Steven K. Roberts
Nomadic Research Labs
In the early 1980s, I found myself spending a lot of time at AI
conferences in journalist mode, and became particularly enamored with
image recognition systems. This fledgling pursuit was
sufficiently captivating to propel me into the study of human vision,
and I was fortunate to spend time with many of the luminaries in the
field (including a year working in VER research, developing tools for
presenting averageable image changes on the retina through a cloudy
medium with no net luminance change, studying the layers of feature
abstraction implemented in the brain, and reviewing then-prominent
texts on the subject and spending time with their authors).
In the midst of all this, I came across an intriguing paper by (if I
recall correctly) Dr. I. J. Good of the NYU Medical Center, in which re
reported the laborious sectioning and transmission electron-microscopy
of the entire visual pathway from retina to primary visual
cortex. In the paper was an understated but startling
observation: there is apparently a coordinate transformation from
a logarithmic polar system (at the retina) to a cartesian system (at
the cortex), implemented entirely in the hardwiring of the optic
nerve. This struck me as rather bizarre, and I assumed that it
supported our column-interleaved focusing system so colorfully
described by E. W. Kent in his engaging book, The Brains of Men and Machines.
But still it nagged at me, and not being
much of a mathematician I didn’t immediately grasp the significance.
Finally, out of curiosity, I wrote a coordinate transformation program
on my primitive CP/M system of the era, and presented it with a simple
“face” graphic to see what happened. This is shown in Figure 1
(the arrow is just a reference point, for reasons that will become
clear in a moment).
Figure
1: Basic face and transformation
As you can see, the circle, being a constant distance from the origin,
becomes a vertical line. The eye dots are uninteresting, since
they are just dots. And the mouth line (my drawing tools, like my
artistic sense, were unsophisticated) becomes a little logarithmic
curve.
That last bit is interesting, and I stared at it a while, wondering
what I had done wrong.
To aid in debugging, I rotated the input image and almost fell off my
chair. The output is identical, merely translated
vertically (wrapping around my primitive display space; hence the arrow
for reference).
Figure
2: Rotated face
With growing excitement, I tried scaling the input image, and the
result is shown below: again, the identical output, merely
translated horizontally.
Figure
3: Scaled face
In the human system, this neatly solves the problem of varying angles
and distances rendering a scene unrecognizable (a very real problem in
computer-based systems). The cells of the retinal are organize in a
logarithmic polar grid that is mapped onto a rectangular array in the
cortex, converting rotation of an object in real space into a
corresponding vertical translation, and expansion or contraction into a
horizontal translation. Thus, the tilting of your friend's face
produces no pattern change in at least this aspect of the image
presented to the brain (note that real-world lateral translation would create problems, but we
automatically direct our receptors and visual attention to the subject
under analysis, effectively centering it in the macular region.
Doing this, whether with number-crunching or some kind of optical
trickery, would appear to solve one of the most restrictive problems in
image processing, replacing it with the much more addressable challenge
of keeping the object of interest centered in the frame of reference
(presumably using centrums or other relatively simple tricks). If
the template objects are all stored as pre-transformed images, and the
input is subjected to the same treatment, then spatial FFT-based
correlation should have a much less difficult time dealing with
rotation and scaling, at least within a 2-D visual space.
Obviously there’s a lot more to image recognition than simple
coordinate transformation and cross-correlation, but I’m reasonably
convinced that this should be investigated as a preprocessing step—or
as a parallel path alongside existing techniques to provide a
rotation-invariant input to existing algorithms.
For More Information
Over 20 years have passed since I played with this and made those
simple
images (which I used in my Prentice-Hall textbook, available from
Amazon at these links: Creative
Design With Microcomputers or the hardbound version with a
different title, Industrial
Design With Microcomputers).
Naturally other people have thought about the same thing: a bit
of
searching turns up this document on log-polar mapping with attention
to its value in data compression and attention focus, and this
article about global and local symmetry of the primary visual
cortex is particularly enchanting.