Back to Resources


Addressing the Rotation and Scaling Problem
in Template-Based Image Recognition Systems

© 2004 by Steven K. Roberts
Nomadic Research Labs



In the early 1980s, I found myself spending a lot of time at AI conferences in journalist mode, and became particularly enamored with image recognition systems.  This fledgling pursuit was sufficiently captivating to propel me into the study of human vision, and I was fortunate to spend time with many of the luminaries in the field (including a year working in VER research, developing tools for presenting averageable image changes on the retina through a cloudy medium with no net luminance change, studying the layers of feature abstraction implemented in the brain, and reviewing then-prominent texts on the subject and spending time with their authors).  

In the midst of all this, I came across an intriguing paper by (if I recall correctly) Dr. I. J. Good of the NYU Medical Center, in which re reported the laborious sectioning and transmission electron-microscopy of the entire visual pathway from retina to primary visual cortex.  In the paper was an understated but startling observation:  there is apparently a coordinate transformation from a logarithmic polar system (at the retina) to a cartesian system (at the cortex), implemented entirely in the hardwiring of the optic nerve.  This struck me as rather bizarre, and I assumed that it supported our column-interleaved focusing system so colorfully described by E. W. Kent in his engaging book, The Brains of Men and Machines.  But still it nagged at me, and not being much of a mathematician I didn’t immediately grasp the significance.

Finally, out of curiosity, I wrote a coordinate transformation program on my primitive CP/M system of the era, and presented it with a simple “face” graphic to see what happened.  This is shown in Figure 1 (the arrow is just a reference point, for reasons that will become clear in a moment).

 
Basic face and transformation

Figure 1:  Basic face and transformation


As you can see, the circle, being a constant distance from the origin, becomes a vertical line.  The eye dots are uninteresting, since they are just dots.  And the mouth line (my drawing tools, like my artistic sense, were unsophisticated) becomes a little logarithmic curve.  That last bit is interesting, and I stared at it a while, wondering what I had done wrong.

To aid in debugging, I rotated the input image and almost fell off my chair.  The output is identical, merely translated vertically (wrapping around my primitive display space; hence the arrow for reference).  


Rotated face
 
Figure 2:  Rotated face


With growing excitement, I tried scaling the input image, and the result is shown below:  again, the identical output, merely translated horizontally.
 

Scaled face

Figure 3:  Scaled face


In the human system, this neatly solves the problem of varying angles and distances rendering a scene unrecognizable (a very real problem in computer-based systems). The cells of the retinal are organize in a logarithmic polar grid that is mapped onto a rectangular array in the cortex, converting rotation of an object in real space into a corresponding vertical translation, and expansion or contraction into a horizontal translation.  Thus, the tilting of your friend's face produces no pattern change in at least this aspect of the image presented to the brain (note that real-world lateral translation would create problems, but we automatically direct our receptors and visual attention to the subject under analysis, effectively centering it in the macular region.

Doing this, whether with number-crunching or some kind of optical trickery, would appear to solve one of the most restrictive problems in image processing, replacing it with the much more addressable challenge of keeping the object of interest centered in the frame of reference (presumably using centrums or other relatively simple tricks).  If the template objects are all stored as pre-transformed images, and the input is subjected to the same treatment, then spatial FFT-based correlation should have a much less difficult time dealing with rotation and scaling, at least within a 2-D visual space.

Obviously there’s a lot more to image recognition than simple coordinate transformation and cross-correlation, but I’m reasonably convinced that this should be investigated as a preprocessing step—or as a parallel path alongside existing techniques to provide a rotation-invariant input to existing algorithms.

For More Information

Over 20 years have passed since I played with this and made those simple images (which I used in my Prentice-Hall textbook, available from Amazon at these links: Creative Design With Microcomputers or the hardbound version with a different title, Industrial Design With Microcomputers). 

Naturally other people have thought about the same thing:  a bit of searching turns up this document on log-polar mapping with attention to its value in data compression and attention focus, and this article about global and local symmetry of the primary visual cortex is particularly enchanting.



Back to Resources