Automatic Scale Selection as a PreProcessing Stage
for Interpreting the Visual World
^{1}
Tony Lindeberg
Department of Numerical Analysis and Computing Science 
KTH, S100 44 Stockholm, Sweden 
tony@nada.kth.se,
http://www.nada.kth.se/~tony 
Date:
Abstract:
This paper reviews a systematic methodology for formulating
mechanisms for automatic scale selection when performing feature detection.
An important property of the proposed approach is that the notion of
scale is included already in the definition of image features.
Computer vision algorithms for interpreting image data usually
involve a feature detection step.
The need for performing early feature detection is usually motivated
by the desire of condensing the rich intensity pattern to a more
compact representation for further processing.
If a proper abstraction of shape primitives can be computed,
certain invariance properties can also be expected with respect to
changes in view direction and illumination variations.
The earliest works in this direction were concerned with the
edge detection (Prewitt, 1970; Roberts, 1965).
While edge detection may at first to be a rather simple task,
it was empirically observed that it can be very hard to extract
edge descriptors reliably.
Usually, this was explained as a noise sensitivity that could be reduced
by presmoothing the image data before applying the edge detector
(Torre & Poggio, 1980).
Later, a deeper understanding was developed that these difficulties
originate from the
more fundamental aspect of image structure, namely that realworld objects
(in contrast to idealized mathematical entities such as points and lines)
usually consist of different types of structures at different
scales (Koenderink, 1984; Witkin, 1983).
Motivated by the multiscale nature of realworld images,
multiscale representations such as pyramids (Burt & Adelson, 1983)
and scalespace representation (Koenderink, 1984; Lindeberg, 1994; Witkin, 1983)
were constructed.
Theories were also formed concerning what types of image
features should be extracted from any scale level in a multiscale
representation (Florack et al., 1992; Florack, 1997; Koenderink & van Doorn, 1992; Lindeberg, 1994).
The most common way of applying multiscale representations in practice
has been by selecting one or a few scale levels in advance,
and then extracting image features at each scale level more or less independently.
This approach can be sufficient under simplified conditions,
where only a few natural scale levels are involved and
provided that the image features a stable over large ranges of scales.
Typically, this is the case when extracting edges of manmade objects
viewed under controlled imaging conditions.
In other cases, however, there may be a need for adapting scale levels
individually to each image feature, or even to adapt the scale levels
along an extended image feature, such as a connected edge.
Typically, this occurs when detecting ridges (which turn out to
be much more scale sensitive than edges) and when applying an
edge detector to a diffuse edge for which the degree of diffuseness
varies along the edge.
To handle these effects in general cases, we argue that it is natural
to complement feature detection modules by explicit mechanisms for
automatic scale selection,
so as to automatically adapt the scale levels to the image features
under study.
The purpose of this article is to present such a framework for automatic
scale selection,
which is generally applicable to a rich variety of image features,
and has been successfully tested by integration with other visual modules.
For references to the original sources,
see (Lindeberg, 1998a,b,1999) and the references therein.
An attractive property of the proposed scale selection mechanism is that
in addition to automatic tuning of the scale parameter,
it induces the computation of natural abstractions (groupings) of image shape.
In this respect, the proposed methodology constitutes a natural preprocessing
stage for subsequent interpretation of visual scenes.
To demonstrate the need for an automatic scale selection mechanism,
let us consider the problems of detecting edges and ridges, respectively,
from image data.
Figure 1 shows two images,
from which scalespace representations
have been computed by convolution with Gaussian kernels,
i.e. given an image
,
its scalespace representation
is

(1) 
where
denotes the Gaussian kernel

(2) 
and the variance
of this kernel is referred to as the scale parameter.
At each scale level, edges are defined from points at which the gradient
magnitude assumes a local maximum in the gradient direction
(Canny, 1986; Korn, 1988).
In terms of local directional derivatives, where
denotes
a directional derivative in the gradient direction, this edge definition
can be written

(3) 
Such edges at three scales are shown in the left column
in figure 1.
As can be seen, sharp edge structures corresponding to object boundaries
give rise to edge curves at both fine and coarse scales.
At fine scales, the localization of object edges is better,
while the number of spurious edge responses is larger.
Coarser scales are on the other hand necessary to capture the
shadow edge, while the localization of e.g. the finger
tip is poor at coarse scales.
Figure 1:
Edges and bright ridges detected at scale levels ,
and , respectively.

The right column in figure 1 show corresponding
results of multiscale ridge extraction.
A (bright) ridge point is defined as a point where the intensity assumes
a local maximum in the main eigendirection of the Hessian matrix
(Haralick, 1983; Koenderink & van Doorn, 1994).
In terms of local coordinates with the mixed directional derivative
, this ridge definition can be written
while in terms of a local system with the direction parallel
to the gradient direction and the direction perpendicular, the ridge
definition assumes the form

(5) 
As can be seen, the types of ridge curves that are obtained are
strongly scale dependent.
At fine scales, the ridge detector mainly responds to spurious
noise structures.
Then, it gives rise to ridge curves corresponding to the fingers
at , and a ridge curve corresponding to the arm as a whole
at .
Notably, these ridge descriptors are much more sensitive to the
choice of scale levels than the edge features in
figure 1(a).
In particular, no single scale level is appropriate for describing
the dominant ridge structures in this image.
The experimental results in figure 1
emphasize the need for adapting the scale levels for feature
detection to the local image structures.
How should such an adaptation be performed without a priori
information about what image information is important?
The subject of this section is to give an intuitive motivation
of how size estimation can be performed, by studying
the evolution properties over scales of scalenormalized derivatives.
The basic idea is as follows: At any scale level, we define a normalized
derivative operator by multiplying each spatial derivative operator
by the scale parameter
raised to a
(so far free) parameter :

(6) 
Then, we propose that automatic scale selection can be performed by
detecting the scales at which normalized differential entities
assume local maxima with respect to scale.
Intuitively, this approach corresponds to selecting
the scales at which the operator response is as strongest.
For a sine wave

(7) 
the scalespace representation is given by

(8) 
and the amplitude of the thorder normalized derivative operator is

(9) 
This function assumes a unique maximum over scales at

(10) 
implying that the corresponding value (
)
is proportional to the wavelength
of the signal.
In other words, the wavelength of the signal can be detected from
the maximum over scales in the scalespace signature of the signal
(see figure 2).
In this respect, the scale selection approach has similar properties
as a local Fourier analysis, with the difference that there is no
need for explicitly determining a window size for computing the
Fourier transform.
Figure 2:
The amplitude of firstorder normalized derivatives as function of
scale for sinusoidal input signals of different frequency
(
,
and
).

If a local maximum over scales in the normalized differential expression
is detected at the position
and the
scale
in the scalespace representation
of a signal ,
then for a signal
rescaled by a scaling factor
such that

(11) 
the corresponding local maximum over scales is assumed at

(12) 
This property shows that the selected scales follow any size variations
in the image data, and this property holds for all homogeneous
polynomial differential invariants
(see (Lindeberg, 1998b)).
In view of the abovementioned scale invariance result,
one may ask the following.
Imagine that we take the idea of performing local scale selection
by local maximization of some sort of normalized derivatives
(not specified yet).
Moreover, let us impose the requirement that the scale levels
selected by this scale selection mechanism should commute
with size variations in the image domain according to
equation (11) and (12).
Then, what types of scale normalizations are possible?
Interestingly, it can then be shown that the form of the
normalized derivative normalization (6)
arises by necessity,
i.e., with the free parameter
it spans
all possible reasonable scale normalizations (see (Lindeberg, 1998b) for a proof).
The idea is that the normalized scalespace derivatives will be used as a
basis for expressing a large class of image operations,
formulated in terms of normalized differential entities.
Equivalently, such derivatives can be computed by applying normalized
Gaussian derivative operators

(13) 
to the original dimensional image.
It is straightforward to show that the norm of such a
normalized Gaussian derivative kernel is

(14) 
which means that the normalized derivative concept
can be interpreted as a normalization to constant norm
over scales, with
given by

(15) 
The special case
corresponds to normalization
for all orders .
Another interesting interpretation can be made with respect to
image data
having selfsimilar power spectra of the form

(16) 
Let us consider the following class of energy measures,
measuring the amount of information in the th order
normalized Gaussian derivatives

(17) 
In the twodimensional case, this class
includes the following differential energy measures:
It can be shown that the variation over scales of these
energy measures is given by

(22) 
and this expression is scale independent if and only if

(23) 
Hence, the normalized derivative model is neutral
with respect to power spectra of the form

(24) 
Empirical studies on natural images often
show a qualitative behaviour similar to this (Field, 1987).
The results presented so far apply generally to a large
class of image descriptors formulated in terms of differential
entities derived from a multiscale representation.
The idea is that the differential entity
used for automatic
scale selection, together with its associated
normalization parameter
should be determined for the task at hand.
In this section, we shall present several examples of how this scale
selection mechanism can be expressed in practice for various types
of feature detectors.
Let us first turn to the problem of edge detection,
using the differential definition of edges
expressed in equation (3).
A natural measure of edge strength that can be associated
with this edge definition is given by normalized gradient magnitude

(25) 
If we apply the edge definition (3) at all
scales, we will sweep out an edge surface in scalespace.
On this edge surface, we can define a scalespace edge
as a curve where the edge strength measure assumes a local maximum
over scales

(26) 
To determine the normalization parameter , we can consider
an idealized edge model in the form of a diffuse step edge

(27) 
It is straightforward to show that the edge strength measure is maximized at

(28) 
If we require that this maximum is assumed at ,
implying that we use a similar derivative filter for
detecting the edge as the shape of the differentiated
edge, then we obtain
.
Figure 3:
The result of edge detection with automatic scale selection based on local maxima over scales of the first order edge strength measure
with
. The middle column shows all the scalespace edges, whereas the right column shows the 100 edge curves having the highest significance values. Image size:
pixels.

Figure 4:
Threedimensional view of the 10 most significant
scalespace edges extracted from the arm image.
From the vertical dimension representing the selected scale
measured in dimension length
(in units of
),
it can be seen how coarse scales are selected for the diffuse
edge structures (due to illumination effects)
and that finer scales are selected for the sharp edge structures
(the object boundaries).

Figure 3 shows the results of detecting
edges from two images in this way.
The middle column shows all scalespace edges that satisfy the definition
(26), while the right column shows the result of
selecting the most significant edges by computing a significance measure
as the integrated normalized edge strength measure along each connected
edge curve

(29) 
Figure 4 shows a threedimensional view
of the 10 most significant scalespace edges from the hand image,
with the selected scales illustrated by the height over the image plane.
Observe that fine scales are selected for the edges corresponding to
object boundaries.
This result is consistent with the empirical finding that rather
fine scales are usually appropriate for extracting object edges.
For the shadow edges on the other hand,
successively coarser scales are selected with increasing
degree of diffuseness, in agreement with the analysis of
the idealized edge model in (28).
Let us next turn to the problem of ridge detection,
and sweep out a ridge surface in scalespace by applying the
ridge definition (4) at all scales.
Then, given the following ridge strength measure

(30) 
which is the square difference between the eigenvalues
and
of
the normalized Hessian matrix, let us define a
scalespace ridge as a curve on the ridge surface
where the normalized ridge strength measure assumes
local maxima with respect to scale



(31) 
To determine the normalization parameter , let us consider
a Gaussian ridge

(32) 
The maximum over scales in
is assumed at

(33) 
and by requiring this scale value to be equal to
(implying that a similar rotationally aligned Gaussian derivative filter
is used for detecting the ridge as the shape of the second derivative
of the Gaussian ridge) we obtain
.
Figure 6:
Alternative illustration of
the five strongest scalespace ridges extracted from
the image of the arm in
figure 5.
Each ridge is backprojected onto
a dark copy of the original image as the
union of a set of circles centered on the ridge curve
with the radius proportional to the
selected scale at that point.
Figure 5:
The 100 and 10 strongest bright ridges respectively extracted using scale selection based on local maxima over scales of
(with
). Image size:
pixels in the top row, and
pixels in the bottom row.

Figure 5
shows the result of applying such a ridge detector
to an image of an arm and an aerial image of a suburb, respectively.
The ridges have been ranked on significance,
by integrating the normalized ridge strength
measure along each connected ridge curve.

(34) 
Observe that descriptors corresponding to the roads are selected
from the aerial image.
Moreover, for the arm image, a coarsescale descriptor is extracted for
the arm as a whole, whereas the individual fingers
give rise to ridge curves at finer scales.
The Laplacian operator
is a commonly used entity for blob detection,
since it gives a strong response at the
center of bloblike image structures.
To formulate a blob detector with automatic scale selection,
we can consider the points in scalespace at which the
the square of the normalized Laplacian

(35) 
assumes maxima with respect to space and scale.
Such points are referred to as scalespace maxima
of
.
Figure 8:
Threedimensional view of the 150 strongest scalespace
maxima of the square of the normalized Laplacian of the Gaussian
computed from the sunflower image.
Figure 7:
Blob detection by detection of scalespace maxima of the
normalized Laplacian operator:
(a) Original image.
(b) Circles representing the 250
scalespace maxima of
having the
strongest normalized response.
(c) Circles overlayed on image.

For a Gaussian blob model defined by

(36) 
it can be shown that the selected scale at the
center of the blob is given by

(37) 
Hence, the selected scale directly reflects the width
of the Gaussian blob.
Figures 78
show the result of applying this blob detector to an image of a sunflower field.
In figure 7,
each blob feature detected as a scalespace maximum
is illustrated by a circle,
with its radius proportional to the selected scale.
Figure 8 shows a threedimensional illustration
of the same data set, by marking the scalespace extrema by
spheres in scalespace.
Observe how well the size variations in the image are
captured by this structurally very simple operation.
A commonly used technique for detecting junction candidates in
greylevel images is to detect extrema in the curvature of level curves
multiplied by the gradient magnitude raised to some power
(Kitchen & Rosenfeld, 1982; Koenderink & Richards, 1988).
A special choice is to multiply the level curve
curvature by the gradient magnitude raised to
the power of three.
This leads to the differential invariant
,
with the corresponding normalized expression

(38) 
Figure 9 shows the result of detecting
scalespace extrema from an image with corner structures at
multiple scales.
Observe that a coarse scale response is obtained
for the large scale corner structure as a whole,
whereas the superimposed corner structures
of smaller size give rise to scalespace maxima at finer scales
(see figure 10
for results on realworld data).
Figure:
Threedimensional view of scalespace maxima of
computed for a large scale
corner with superimposed corner structures at finer
scales.

We argue that a scale selection mechanism is an essential tool
whenever our aim is to automatically interpret the image data that arise from
observations of a dynamic world.
For example, if we are tracking features in the image domain,
then it is essential that the scale levels are adapted to
the size variation that may occur over time.
Figure 10 shows a
comparison between a feature tracker with automatic scale selection
(Bretzner & Lindeberg, 1998a) and a corresponding feature tracker operating at
fixed scales.
(Both feature trackers are based on corner detection from local maxima
of the corner strength measure (38),
followed by a localization stage (Lindeberg, 1998b) and a multicue
verification (Bretzner & Lindeberg, 1998a).)
As can be seen from the results in
figure 10,
three out of the ten features
are lost by the fixed scale feature tracker compared to
the adaptive scale tracker.
Figure 10:
Comparison between feature tracker with automatic scale
selection and a feature tracker operating at fixed scale.
The left column shows a set of corner features in the
initial frame, and the right column gives a snapshot
after 65 frames.
Initial frame with 14 detected corners 
Tracked features with automatic scale selection 
[width=65mm]fig/tracking/phoneinit.ps 
[width=65mm]fig/tracking/phonecomb66.ps 

Tracked features using fixed scales 

[width=65mm]fig/tracking/phonefix66.ps 

A brief explanation of this phenomenon is that if we use a
standard algorithm for feature detection at a fixed scale
followed by hypothesis evaluation using a fixed size window for correlation,
then the feature tracker will after a few frames fail to
detect some of the features.
The reason why this occurs is simply the fact that the corner
feature no longer exists at the predetermined scale.
In practice, this usually occurs for blunt corners.
An attractive property of a feature detector with automatic scale
selection is that it allows us to capture less distinct features
than those that occur on manmade objects.
Specifically, we have demonstrated how it makes it possible to
capture features associated with human actions.
Figure 11 illustrates one idea we have been working on in the
area of visually guided humancomputerinteraction.
The idea is to have a camera that monitors the motion of a
human hand. At each frame blob and ridge features are extracted
corresponding to the fingers and the finger tips.
Assuming rigidity, the motion of the image features allow
us to estimate the threedimensional rotation of the
hand (Bretzner & Lindeberg, 1998b).
These motion estimates can in turn be used for controlling other
computerized equipment;
thus serving as a ``3D hand mouse'' (Lindeberg & Bretzner, 1998).
Figure 11:
Illustration of the concept of a ``3D hand mouse''.
The idea is to monitor the motion of a human hand
(here, via a set of tracked image features) and
to use estimates of the hand motion for controlling
other computerized equipment (here, the visualization
of a cube).
Controlling hand motion 
Detected ridges and blobs 
Controlled object 
[height=45mm]fig/tracking/handgrey2.ps 
[height=45mm]fig/tracking/handfeatures2.ps 
[height=45mm]fig/rotation/real/hand1r106 

We have presented a general framework for automatic scale selection
as well as examples of how this scale selection mechanism can be
integrated with other feature modules.
The experiments demonstrate how abstractions of the
image data can be computed in a conceptually very simple way,
by analysing the behaviour of image features over scales
(sometimes referred to as ``deep structure'').
For applications in other areas as well as related works, see
(Lindeberg, 1998a,b,1999) and (Almansa & Lindeberg, 1999; Wiltschi et al., 1998).

Almansa, A. & Lindeberg, T. (1999)
 , Fingerprint enhancement by shape adaptation of scalespace operators with
automatic scaleselection, Technical Report ISRN KTH/NA/P99/01SE,
Dept. of Numerical Analysis and Computing Science, KTH, Stockholm, Sweden.
(Submitted).

Bretzner, L. & Lindeberg, T. (1998
 a), `Feature tracking with automatic selection of spatial
scales', Computer Vision and Image Understanding 71(3), 385392.

Bretzner, L. & Lindeberg, T. (1998
 b), Use your hand as a 3D mouse or relative orientation
from extended sequences of sparse point and line correspondances using the
affine trifocal tensor, in H. Burkhardt & B. Neumann, eds,
`Proc. 5th European Conference on Computer Vision', Vol. 1406 of
Lecture Notes in Computer Science, Springer Verlag, Berlin, Freiburg,
Germany, pp. 141157.

Burt, P. J. & Adelson, E. H. (1983)
 , `The Laplacian pyramid as a compact image code', IEEE Trans.
Communications 9:4, 532540.

Canny, J. (1986),
 `A computational approach
to edge detection', IEEE Trans. Pattern Analysis and Machine Intell.
8(6), 679698.

Field, D. J. (1987),
 `Relations between the
statistics of natural images and the response properties of cortical cells',
J. of the Optical Society of America 4, 23792394.

Florack, L. M. J. (1997),
 Image
Structure, Series in Mathematical Imaging and Vision, Kluwer Academic
Publishers, Dordrecht, Netherlands.

Florack, L. M. J., ter Haar Romeny, B. M., Koenderink, J. J. & Viergever, M. A. (1992),
 `Scale and the
differential structure of images', Image and Vision Computing
10(6), 376388.

Haralick, R. M. (1983),
 `Ridges and valleys
in digital images', Computer Vision, Graphics, and Image Processing
22, 2838.

Kitchen, L. & Rosenfeld, A. (1982)
 , `Graylevel corner detection', Pattern Recognition Letters
1(2), 95102.

Koenderink, J. J. (1984),
 `The structure of
images', Biological Cybernetics 50, 363370.

Koenderink, J. J. & Richards, W. (
1988),
 `Twodimensional curvature operators', J. of
the Optical Society of America 5:7, 11361141.

Koenderink, J. J. & van Doorn, A. J. (
1992),
 `Generic neighborhood operators', IEEE Trans.
Pattern Analysis and Machine Intell. 14(6), 597605.

Koenderink, J. J. & van Doorn, A. J. (
1994),
 `Twoplusonedimensional differential geometry',
Pattern Recognition Letters 15(5), 439444.

Korn, A. F. (1988),
 `Toward a symbolic
representation of intensity changes in images', IEEE Trans. Pattern
Analysis and Machine Intell. 10(5), 610625.

Lindeberg, T. (1994),
 ScaleSpace Theory
in Computer Vision, The Kluwer International Series in Engineering
and Computer Science, Kluwer Academic Publishers, Dordrecht,
Netherlands.

Lindeberg, T. (1998a),
 `Edge detection
and ridge detection with automatic scale selection', Int. J. of
Computer Vision 30(2), 117154.

Lindeberg, T. (1998b),
 `Feature
detection with automatic scale selection', Int. J. of Computer Vision
30(2), 77116.

Lindeberg, T. (1999),
 Principles for
automatic scale selection, in B. J. (et al.), ed., `Handbook on
Computer Vision and Applications', Academic Press, Boston, USA, pp. 239274.

Lindeberg, T. & Bretzner, L. (1998)
 , Förfarande och anordning för överföring av information
genom rörelsedetektering, samt användning av anordningen.
Patent pending.

Prewitt, J. M. S. (1970),
 Object enhancement
and extraction, in A. Rosenfeld & B. S. Lipkin, eds,
`Picture Processing and Psychophysics', Academic Press, New York,
pp. 75149.

Roberts, L. G. (1965),
 Machine perception of
threedimensional solids, in J. T. T. et al., ed., `Optical and
ElectroOptical Information Processing', MIT Press, Cambridge, Massachusetts,
pp. 159197.

Sporring, J., Nielsen, M., Florack, L. & Johansen, P., eds
(1996),
 Gaussian ScaleSpace Theory:
Proc. PhD School on ScaleSpace Theory, Series in Mathematical Imaging and
Vision, Kluwer Academic Publishers, Copenhagen, Denmark.

ter Haar Romeny, B., Florack, L., Koenderink, J. J. & Viergever, M.,
eds (1997),
 ScaleSpace Theory in
Computer Vision: Proc. First Int. Conf. ScaleSpace'97, Lecture Notes
in Computer Science, Springer Verlag, New York, Utrecht, Netherlands.

Torre, V. & Poggio, T. A. (1980),

`On edge detection', IEEE Trans. Pattern Analysis and Machine Intell.
8(2), 147163.

Wiltschi, K., Pinz, A. & Lindeberg, T. (
1998),
 An automatic assessment scheme for steel quality
inspection, Technical Report ISRN KTH/NA/P98/20SE, Dept. of Numerical
Analysis and Computing Science, KTH, Stockholm, Sweden.

Witkin, A. P. (1983),
 Scalespace filtering,
in `Proc. 8th Int. Joint Conf. Art. Intell.', Karlsruhe, West Germany,
pp. 10191022.
Automatic Scale Selection as a PreProcessing Stage
for Interpreting the Visual World
^{1}
This document was generated using the
LaTeX2HTML translator Version 98.1p7 (June 18th, 1998)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html split 1 fspipa
The translation was initiated by Tony Lindeberg on 19990916
Tony Lindeberg
19990916