Miroslaw Bober, Francoise Preteux and Y­
M Kim
Checked by:
Miroslaw Bober (corresponding author)
Visual Information Laboratory
Mitsubishi Electric Information Technology Centre Europe
20 Frederick Sanger Road
The Surrey Research Park
Guildford, Surrey GU2 5YD, United Kingdom
(+44) 1483 885818
(+44) 1483 579107
[email protected]
Prof. F. PRETEUX Institut National des Telecommunications 9, Rue Charles Fourier 91011 Evry Cedex FRANCE Phone : (+33) Fax : (+33) email:
[email protected]­ Whoi­Yul Yura Kim
Associate Professor
School of Electrical and Computer Engineering
Hanyang University
Tel : +82­2­2290­0351
Fax : +82­2­2292­6316
Mob : 011­9783­2353
E­mail:[email protected]
Miroslaw Bober, Francoise Preteux and Y­M Kim
1 Introduction
This chapter is concerned with techniques for describing and matching shape features of 2­dimensional and 3­dimensional objects. The Visual Part of the MPEG­7 standard defines three descriptors with different properties: the contour­based shape, the region­based shape and the 3D Shape spectrum descriptors. The theory underpinning each of them, their respective properties and application domains are explained here in some detail. We also present examples of applications demonstrating the ability and applicability of the MPEG­7 shape descriptors. Object shape features are very powerful when used in similarity search and retrieval. This is because the shape of objects is usually strongly linked to object functionality and identity. Humans can recognize characteristic objects solely from their shapes – an illustration that shape often carries semantic information. This property distinguishes shape from some other elementary visual features, such as colour or texture. A large body of research has been devoted to shape­based recognition, retrieval and indexing [1].
Many applications, including those concerned with visual objects retrieval or indexing, are likely to use shape features. For example, an intelligent video security system may use shape techniques to determine the identity of an intruder. Shape descriptors can be used in e­commerce, where it is difficult to use a text index to specify the required shape of the object, and query­by­example is simpler and faster. In fact, any application targeting visual scene interpretation and object recognition cannot do it efficiently without powerful shape tools. Naturally, the best retrieval performance can be achieved by using a combination of visual (and possibly other) feature descriptors, but that subject is beyond the scope of this chapter. In the following section, an overview of the MPEG­7 shape descriptors is presented, and their functionalities, properties and application domains are analysed. Section 3 describes the performance tests and experimental results obtained in MPEG­7 Core Experiments, including the databases used in the tests. The region­based shape descriptor, contour­based shape descriptor and 3­D Shape Descriptors are presented in subsequent Sections 4,5 and 6. Section 7 deals with a combination of the Multi View and contour shape descriptors. Example applications are outlined in Section 8 and conclusions are drawn in Section 9. 2 Overview of Shape Descriptors The notion of object shape and shape similarity, while intuitively clear, may have many meanings. Most of the real­world objects are 3­dimensional and a 3D­shape descriptor has been developed by MPEG [2,22] and is presented in Section 6. However, in the image and video world, 2D projections of the real­world 3D objects onto image plane are visible and often their exact 3D shapes are unknown. To cover this important domain, MPEG­7 provides contour­based shape and region­based shape tools. 3
Figure 1 Example of contour­based and region­based region similarity
Figure 1 shows that even in the 2D case, there can be two notions of similarity: based on the object contour and based on the distribution of all pixels within contour. Objects in the first row have similar spatial distribution of pixels and are therefore similar according to a region­based criterion. However, they clearly have different outline contours. When contour­based similarity is concerned, objects shown in each column are similar. Posing a query with the object located in the first row and second column will result in retrieved object from the first row (if region­based similarity is concerned) or second column (if contour­based similarity is concerned). MPEG­7 supports both notions of similarity having region­based and contour­based shape descriptors.
2.2 Region­based shape descriptor
Figure 2 Example of shapes where region­based shape is applicable
The region­based shape descriptor belongs to the broad class of shape analysis techniques based on moments . It uses a complex 2D Angular Radial Transformation (ART) , defined on a unit disk in polar coordinates. The region­based shape descriptor expresses pixel distribution within a 2D object region; it can describe complex objects consisting of multiple disconnected regions as well as simple objects with or without holes (Figure 2). Some important features of this descriptor are: • It gives a compact and efficient way of describing properties of multiple disjoint regions simultaneously,
• It can cope with errors in segmentation where an object is split into disconnected sub­regions, provided that the information which sub­regions contribute to the object is available and used during the descriptor extraction, • The descriptor is robust to segmentation noise, for example salt­and­paper noise. The ART descriptor is considered in detail in Section 4. 2.3 Contour­based shape descriptor
Figure 3 Examples of shapes where contour­based shape is applicable
The contour shape descriptor is based on the Curvature Scale Space representation of the contour . Objects for which characteristic shape features are contained in their contour are described efficiently by the contour shape descriptor – see Figure 3. Some important features of this descriptor are: • It emulates well the shape similarity perception of the human visual system and has good shape generalisation properties. 4
It is robust to significant non­rigid deformations (e.g. the outline of a running person or walking beetle) and to perspective transformations
• It is a compact and its size adjusts depending on the contour complexity.
Note that, as shown before, objects may have similar region shape properties and still exhibit different contour properties (and vice versa). If a complex object consists of multiple disjoint regions, each of region component contours can be described separately, using the contour­based descriptor and the MPEG­7 description scheme. The descriptor, its properties and example applications are considered in detail in Section 5.
2.4 MultipleView Descriptor for Shape Sometimes a three­dimensional object model may not be known, but many applications require more then just 2D information. Let’s take e­shopping as an example ­ the customer may imagine the required product, say a flower vase, but there is no actual 3D object from which a 3D descriptor could be extracted. Such a mental picture consists of a set of 2D views, which when combined describe the 3D properties of the object. To cater for such situations, a combination of a Multiple­View descriptor and Contour­ shape or Region­shape descriptors can be used. Section 7 provides more details and experimental results. 3. Selection and testing of the MPEG­7 descriptors. All shape descriptors proposed for the MPEG standard were rigorously tested in Core Experiments [3], where retrieval performance, descriptor size, matching complexity and other properties were examined. Based on the test results, the best techniques for the contour­shape, region­shape and 3D shape descriptors were selected. A good descriptor captures characteristic shape features in a concise manner, which should render itself easily to fast search and browsing. It is not required that the original shape can be reconstructed from the descriptor, and therefore a significant compactness of the representation can be achieved. Naturally, it should be invariant to scaling, translation and rotation. Furthermore, as one is frequently dealing with digital (sampled) images, scaling of objects to a relatively small size (as compared to the sampling grid) may result in significant distortions in their shape. It is therefore critical that the shape representation should be robust to such effects. Other types of distortions often present result from the imaging conditions, say a 2D shape may undergo a perspective transformation as a results of change in the viewing angle. Some objects are flexible and the representation should be able to cope with non­
rigid deformations. Finally, the descriptors should cope with distortions typical for object segmentation and extraction processes.
To help the reader understand the properties of the shape descriptors, we will present a short overview of the Core Experiments and the databases used in each category. 3.1 Testing Contour­based shape descriptors. Testing of the contour­based descriptors was performed in Core Experiment 1 (CE1) on a database of over 3400 images. Four experiments, referred to as A­1, A­2, B and C were conducted. Robustness to rotation and scaling (CE1­A1, CE1­A2), similarity­based retrieval and shape generalisation properties (CE1­B) and robustness to non­rigid deformation (CE1­C) were tested. The CE­1 test database consists of three main components:
• Classified Shapes: 1400 shapes of various objects classified into 70 different classes (20 examples in each class). • Geometrically Transformed Shapes: Rotated and scaled versions of the first image in each class of the Classified Shapes Database.
Non­rigid motion shapes: 200 frames taken from the “Bream sequence” depicting swimming fish and about 1100 shapes of fish and other marine creatures were included in this database. 3.1.1 Classified Shapes Database for similarity­based retrieval
Since the notion of shape similarity is subjective and it may be sometimes difficult to precisely rank similarity, a new way of assessing performance was developed. The idea is construct the database in such a way that the shapes create “natural” clusters of similar shapes. The similarity between shapes within the cluster is greater that the similarity of shapes between clusters. Figure 4. Example of shape classes in the contour­based shape test set
This is illustrated on Figure 4, where cluster A contains shapes depicting horses (side view), cluster B contains shapes of camels, and cluster C shapes of cattle. While there is a different degree of variability of shapes in each cluster (see the position of tail or legs), the degree of similarity between objects within each class is greater then the similarity between two objects from two different classes. This property simplifies the assessment of the retrieval performance: A query with a shape from a cluster A should ideally return all remaining shapes from cluster A at top ranks before shapes from any other clusters are retrieved.
This part of the database consists of 1400 shapes in 70 clusters (classes) of 20 shapes each. We have examples of natural objects with significant shape variability, such as beetles (Figure 5, top row); manually­drawn objects (second row), objects extracted from cartoons with lesser amount of shape variability (third row) and drawn objects targeting specifically region­based similarity (bottom row). Each of the 1400 images is used as a query in turn and the number of correctly retrieved shapes in the top 40 is calculated. (An object is considered to be correctly retrieved if it comes from the same class as the shape used for query).
Figure 5. Example shape clusters from the database for similarity­based retrieval
3.1.2 Geometrically Transformed Shapes
All the techniques tested supported rotational and scaling invariance, however it was imperative to test the robustness of each approach to the effects such as changes to the shape due to image re­sampling (rotation & scaling), quantisation of the representation, and the overall behaviour of the software implementations. For this reason, two additional databases were constructed to test scaling invariance (referred to as A­1) and to test rotational­invariance (referred to as A­2). One image from each of the 70 classes was used to derive digitally scaled versions for the A­1 dataset (scaling factors: reduction by factors: 0.3 0.25, 0.2 and 0.1 and enlargement by factor 2). The A­2 dataset contains digitally rotated versions of the reference images (digital rotation by angles 9, 36, 45, 90 and 150 degrees). Thus there are 70 clusters in each of the database, every cluster containing 7 images (the reference image and 6 rotated or scaled versions), 420 images in total. The query is posed using every image in the database, and the number of correct matches in the top 7 shapes is calculated for each query. The overall performance for each transformation is calculated by dividing the total number of correct retrievals by 2940 (The maximum number of correct retrieval for 420 queries). In this experiment, in order to be counted as a correct match the shape must be retrieved in the top N matches, where N is the number of shapes from the same cluster (e.g. N=7). This is a more stringent requirement that in the similarity­
based retrieval experiment, where the correct matches are computed in the 2N top retrievals. The scale­invariance test is the more challenging of the two, due to the fact that some images are severely distorted when reduced by factors of 0.2 or 0.1. This was intended – the objective is to determine the robustness to various degrees of distortions. Figure 6 shows examples where scaling by a factor of 0.1 caused severe distortions, to a degree where even a human may find it difficult to determine a true shape. Figure 6 Shapes of child, rat and bottle (left box) and their scaled­down and re­sampled versions (right box)
3.1.3 Non­rigid motion shapes
The main characteristic features of object shape are usually retained under some non­rigid deformations. Many applications will rely on this property and therefore it is important that good shape descriptors can still cope with such deformations. In this experiment, 200 frames showing a swimming fish (often known in MPEG community as the Bream fish) are combined with a database of 1100 shapes of marine creatures, including many images of other fish. The query is posed with the first image of the Bream fish and the number of instances of other Bream fish in the top 200 retrievals is computed. The shape of the fish is deformed due to swimming and also the fish turns causing more dramatic changes to its shape. Figure 7 shows the query shape bream­000 (left box), three shapes of the bream fish under non­rigid deformations (middle box) and some other fish present in the database (right box).
Figure 7 Example of shapes used to test robustness to non­rigid deformations
3.2 Testing Region­based shape descriptors.
Region­based techniques were tested in the Core Experiment number 2 (CE­2). A database of trademarks is used to evaluate performance on complex shapes consisting of multiple disjoint regions, possibly with holes. The test database consists of two main components: classified trademarks shapes and transformed shapes.
3.2.1 Classified Shapes: The S8 database contained 2800 trademark shapes: 678 objects are classified into 10 different classes (variable number of examples in each class) and 2122 remaining trademarks are un­clustered. Examples of shapes in two classes are shown in Figure 8.
Figure 8 Example of shape classes in the region­based shape test set
3.2.2 Geometrically Transformed Shapes: Four tests are performed to test robustness to: digital rotation and scaling (CE2­A1, A2) analogue rotation and scaling (CE2­A3), and perspective transformation (CE2­A4). • Digital rotation and scaling CE2­A1, A2: 20 randomly selected trademarks were digitally scaled by factors of 3, 1.5, 0.5 and 0.33 and rotated by 9, 36, 45, 90, 120, and 150 degrees. After merging with the original images A­1 contains 100 images and A­2 contains 140 images. The query is performed using all shapes in each class.
• Analogue rotation and scaling CE2­A3. Capturing a video sequence with 20 different trademarks printed on a board and subsequently digitising certain images from the sequences, created the database A­3. Scaling and rotation is effected by changes in the imaging conditions such as camera­to­object distance and relative position. Both, analogue and digital cameras were used. • Perspective transformation is tested using dataset CE2­A4. This dataset was created by capturing video sequences with 20 different trademarks printed on a rigid board. For each reference shape 10 images exhibiting different degree of perspective distortions, were selected (200 shapes in total) and combined with the reference database S8.
Figure 9 Examples of shapes under perspective transformation used in tests
3.3 Testing 3D shape descriptors. Construction a database for testing 3D descriptors was perhaps the most difficult task due to lack of sufficiently large publicly available dataset. Some 1290 models were collected by combining the 3D library used for MPEG­4 experiments, 3D café dataset ( and models of letters generated by one of the participants. The participants divided the dataset into 15 broad categories, such as 4­limbs, cars, aerodynamic­shapes, trees, balloons, letter­a, letter­b etc. As in the previous experiments all classified models were used as queries and the Bull’s Eye score was used. In the following sections we will introduce each shape descriptor, namely the region­based shape descriptor, the contour­based shape descriptor, and the 3D shape descriptor.
4. The Region­based shape descriptor
In this section, we present the MPEG­7 region­based shape descriptor, which can handle objects consisting of single or multiple regions. Three region­based shape descriptors were proposed to MPEG­7: a Multi­Layer Eigen Vector Descriptor (MLEV) , a descriptor based on Zernike moments [28], and a descriptor based on Angular Radial Transform (ART) . The technologies competed through the process of Core Experiments, which concluded that the ART descriptor offers the best overall performance for region­based similarity. The strength of the region­based shape techniques is that they perform well for complex shapes that consist of several disjoint regions, such as trademarks or logos, emblems, clipart and characters. The contour­based shape descriptors, presented in the next section, may not be suitable for such complex objects. One of the reasons is that the contour on an object can be changed drastically if there is a small crack­like opening or an object touching neighbouring objects. For example, the shapes shown in Figure 10 (a) and (c) are very similar according to human perception. Their contours or boundaries, as shown in subfigures (b) and (d), however, are very different depending on whether or not the inner star touches the outer circle. 1
(a) (b) (c) (d)
Figure 10 Visually similar images; (a) An example consists of a touching star and a circle, (b) The contours of (a), (c) Similar example with non­touching star and a circle (d) The contours of (c)
Naturally, contour­based shape descriptors are not appropriate for describing shapes consisting of several disjoint regions; for example the contour­based shape descriptors cannot be used to differentiate the shapes shown in Figure 11.
Figure 11 Example of complex shapes, consisting of multiple regions
The region­based shape descriptor uses the Angular Radial Transform. 4.1 The Definition of ART
ART is the orthogonal unitary transform defined on a unit disk that consists of the complete orthonormal sinusoidal basis functions in polar coordinates. The ART coefficients are defined by:
Fnm = Vnm ( ρ ,θ ) , f ( ρ ,θ ) = ∫
∫ V ( ρ ,θ ), f ( ρ ,θ ) ρdρdθ (1)
where Fnm is an ART coefficient of order n and m, f ( ρ, θ) is an image function in polar coordinates, and Vnm ( ρ, θ) is the ART basis function that are separable along the angular and radial directions, i.e.,
Vnm ( r ,q ) = Am ( q ) Rn ( r ) (2)
In order to achieve rotation invariance, an exponential function is used for the angular basis function,
Am ( q ) =
exp ( jmq ) 2p
The radial basis function is defined by a cosine function,
Rn ( r ) =
2 cos ( p n r
n 0
Real parts and imaginary parts of ART basis functions are shown in Figure 12.
m =0 m=0
n =0
n= 4 n=5
( a) Real part n=3
n= 1
n= 4
(b) Imaginary part
n= 7
Figure 12 Real and Imaginary parts of ART basis functions
Rotation Invariance of the magnitudes of ART
The magnitudes of the ART coefficients are rotation invariant. Let the image f ( r ,q ) be the rotated image of f ( r ,q ) by an angle α about its origin,
f a ( r ,q ) = f ( r , a + q ) (6)
ART of the rotated image are then given as
( r ,q ) f a ( r ,q ) r d r dq (7)
= Fnm exp ( - jma ) (8)
Here, the magnitude of the ART of the rotated image and that of the reference is the same, i.e.,
= Fnm (9)
4.2 The ART Descriptor
The ART descriptor is defined as a set of normalized magnitudes of ART coefficients. For scale normalization, ART coefficients are divided by the magnitude of ART coefficient of order n=0, m=0. The ART coefficient of order n=0, m=0 is not used as a descriptor element because it is constant after scale normalization. To keep the descriptor size to minimum, quantisation is applied to each coefficient using four bits per coefficient. 4.3 Similarity Measure
The distance between two shapes with ART descriptor is calculated by summing of the absolute differences of each order of descriptor elements.
Dissimilarity =
M d [ i ] - M q [ i ] (10)
Here, the subscript d and q represent image in the database and query image, respectively. M is the array of ART descriptor values.
4.4 Experimental results
Table 1 (middle row) presents the results obtained in the Core Experiment 2 using the ART descriptor implementation from the MPEG­7 XM (eXperimental Model) version 5.0. Good results are obtained in tests of the rotational and scaling invariance. The performance is somewhat degraded under a perspective transformation test (part CE2­A4), and a possible solution to improve it is suggested in the next section. Finally, the tests of similarity based retrieval (CE2­B) show a score of over 66%, which is high considering the complexity of the test database. CE2­A1
Overall 1
Table 1 CE2 Core Experiment Results for the Region Shape descriptor
Figure 13 Example similarity retrieval results for a trademark database
Figure 13 shows an example of the retrieval results, for a query image 3075, for which the performance is close to the average obtained in part B. There are 22 shapes in this cluster. According to the manual clustering, shapes numbered 4,5,7,8,9,13 and 16 are considered members of a different cluster (therefore counted as errors), however a large degree of similarity is still visible between them and the query image. Results for the Core Experiment CE­1 are presented in Table 2. It can be seen from parts A­1 and A­2 that the descriptor offers very good rotational and scaling invariance. As mentioned earlier, the region­
based approach is more robust to deformations related to the extreme scaling – in CE1­A1 it outperforms the contour shape descriptor by 7%. In similarity based retrieval (part B), the region­shape descriptor is on average about 15% worse compared to the contour­shape descriptor. A detailed class­
by­class analysis shows that out of 70 shape classes, the region­shape descriptor significantly outperforms the contour­based descriptor in 8 cases, while the contour­shape descriptor significantly outperforms the region­shape descriptor in 42 cases. More details on Core Experiment CE­1 are provided in section CE1­A1
Table 2 CE1 Core Experiment Results for the Region Shape Descriptor
Average performance in CE2 = { (A1 + A2 + A3 + A4)/4 + B }/2
Overall result in CE1 is defined as: CE1 = { (A­1 + A­2)/2 + B + C }/3
4.5 Compensation for the perspective transformations
Perspective transformations of the shape frequently occur in real world applications. It is desirable that shape descriptors have the capability to retrieve shapes distorted by perspective transformations. However, most region­based shape descriptors perform poorly in such situations. As shown in Table 1, the retrieval performance in the CE2­A4 experiment (involving perspective transformations) is relatively low compared with the performance under rotation and scaling. In this section, a method to improve the retrieval performance in the presence of perspective transformation is discussed.
J. G. Leu proposed a compaction algorithm to obtain the skew invariance . The objective of the compaction algorithm is to normalize an image via a sequence of two linear transformations so that the covariance matrix of the compacted image becomes a scaled identical matrix. So, the compacted image by the algorithm becomes invariant to translation, scaling and skew. As the result of the compaction algorithm, a triangle becomes an equilateral triangle, a rectangle becomes a square etc. Although the compaction algorithm can normalize the skewness caused by perspective transformation, it is not desirable in all applications. The similarity distance between two images can be defined as follows when the normalization for perspective transformation is taken into account,
Dissimilarity = min( DO , DC ) (11)
Here DO and DC are the similarity distances from the query image to the original image set and compacted image set, respectively.
Table 1 compares the retrieval performances of the ART descriptor with and without the perspective normalization (first and second row respectively). The retrieval performance has been drastically improved when the normalization for perspective transformation is needed (CE2­A3), while those of other experiments remain almost the same.
5 Contour Shape Descriptor
In this section, the MPEG­7 contour shape descriptor is presented: we explain the underlying theory, discuss its properties and overview some experimental results. The contour shape descriptor is based on the Curvature Scale­Space (CSS) representation of the contour . The CSS representation has been successfully used for search and retrieval in the past, but it has been further extended and optimised during the MPEG development phase. Six different techniques were compared in Core Experiments: Fourier­based techniques, shape representations based on Zernike moments and ART, turning angles and wavelet­based techniques. It was concluded that the CSS based descriptor offered the best overall performance for contour­based similarity. A detailed overview of the CSS representation and its use in the MPEG­7 standard is presented in [4]. 5.1 The Curvature scale­space representation The concept of CSS representation is intuitively easy. It is based on an observation that when comparing shapes, humans tend to decompose shape contours into concave and convex sections. The overall similarity is assessed based on similarity of the “corresponding” sections: e.g. how prominent they are, their length relative to the contour length and their position and order on the contour. The CSS representation also “decomposes” the contour into convex and concave sections by determining the inflection points (e.g. ponts where curvature is zero). This is done in a multiresolution fashion, where the contour is analysed at various scales, obtained by a smoothing process. Figure 14 shows shape of a fish (left) and its CSS image (right). There are 3 peaks in the CSS image, (marked as A, B, and C) signifying that the fish contour has 3 concave parts (marked correspondingly 12
on the contour). The CSS image also shows that the concave segments A and C are approximately of similar perceptual importance (because the peak heights are similar), while concavity B is less prominent. We can also deduce that concavity B is located in­between concavities A and C and find the relative distances between the inflection points. Figure 14 Example of the CSS image representation for a fish contour
Extraction of the CSS representation involves calculation of the curvature of a (usually digital) contour, while progressively smoothing the contour. Figure 15 shows the shape evolution during filtering (left column), the corresponding curvature functions (middle column) and the CSS image (right column). The curvature of a contour is defined as the derivative of the tangent angle to the contour. Let us assume that the contour C is represented by a parametric vector equation:
c(u ) = ( x(u ), y (u )) , where x, y are the coordinates of the pixels belonging to the contour and u is an arbitrary parameter. For a closed, planar contour, where u is the normalised arc length parameter, the curvature can be calculated as:
κ (u ) = x (u ) y (u ) − x(u ) y (u )
where are the first and second order derivatives of the x with respect to the parameter u. The smoothing is performed iteratively and for each realisation of the contour (e.g. after each iteration), the zero crossings of the curvature function are computed. The filtering process continues until a convex contour results, that is until there are no more zero­crossings of the curvature function. Figure 15 shows three contours C1,C2,C3 after increasing amount of smoothing has been applied and their corresponding curvature functions (middle column). The zero crossings of the curvature function are marked on each contour. Contour C3 is convex and therefore there are no zero crossings in its curvature function. Each zero crossing has a two dimensional vector associated with it (s,l), where s is the amount of the smoothing applied and l the position of the zero crossing on the contour. l is expressed as the distance along the contour from an arbitrarily chosen point on the contour to the inflection point, while circulating the contour clockwise. The CSS image is obtained by plotting all the zero crossing points on a 2­dimensional plane, where vertical axis corresponds to the amount of filtering and horizontal axis corresponds to the position on the contour. Thus each contour C1,C2,C3 defines a single horizontal line in the CSS image, shown in Figure 15 (left).
Figure 15 Extraction of the CSS image for the fish contour of Fig. 14
5.2 Descriptor Syntax and Extraction
The descriptor consists of the eccentricity and circularity values of the original and filtered contour (each quantised to 6 bits), the index indicating the number of peaks in the CSS image (6 bits), the magnitude (height) of the largest peak (7 bits) and the x and y positions on the remaining peaks (9 bits per peak). The average size of the descriptor is 112 bits per contour. Extraction of the circularity and eccentricity
The circularity of the contour is calculated using the formula (12). It is uniformly quantized to 6 bits in the range [12­110]. If the circularity value calculated is above 110, the value is clipped to 110.
perimeter 2
The calculation of eccentricity is as follows:
eccentricity =
where: i02
∑ ( yk
k =1
i20 + i02 + i20 2 + i02 2 − 2i20 i02 + 4i112
i20 + i02 − i20 2 + i02 2 − 2i20 i02 + 4i112
k =1
k =1
y c ) 2 , i11 = ∑ ( x k − x c )( y k − y c ) , i 20 = ∑ ( x k − x c ) 2 , M is the number of points inside the contour shape, and (xc,yc) is the center of mass of the shape. eccentricity is uniformly quantised to 6 bits in the range [1­10]. If the eccentricity value calculated is above 10, the value is clipped to 10.
Extraction of the peak parameters
The CSS image does not have to be explicitly extracted, we only need the position of the peaks. Fast methods of extraction are described in . Here we outline a simple but slower procedure for ease of understanding. Firstly, N equi­distant points are selected on the contour, starting from an arbitrary point and following the contour clockwise. The x­coordinates and y­coordinates of the selected N points are grouped into two series X, Y. The smoothing of the contour is performed by repetitive application of a low­pass filter with the kernel (0.25,0.5,0.25) to X and Y coordinates of the selected N contour points. As a result of the smoothing, the contour evolves and its concave parts gradually flatten­out, until the contour becomes convex (no zero crossings in the curvature), at which point the iterations terminate. For each smoothed contour, the zero­crossings of its curvature function are computed. The CSS image is constructed by marking each zero­crossing at an appriopriate position. 14
The CSS image horizontal­coordinates correspond to the indices of the contour points selected to represent the contour (1,…N), and CSS image vertical­coordinates correspond to the amount of filtering applied, defined as the number of passes of the filter. So if the contour obtained after k=20 iterations has a curvature zero crossing at location corresponding to index I=60, the pixel (60,20) will be black. Once all zero crossings were marked, the locations of the prominent peaks (x_css, y_css) in the CSS image are extracted. The location values are suitably normalised so that they are invariant to the number of sampling points and the size of the contours. The peaks are ordered based on decreasing values y_css, transformed using a non­linear transformation and quantised. The details of the quantisation are explained in .
5.3 Properties of the contour shape descriptor
The contour shape descriptor has some very useful properties. Firstly, the representation is invariant to orientation, scale and mirroring of the object contour. It has also been shown to be robust to noise in the contour . Secondly, unlike the region shape descriptor, it retains the local information of the input shape (e.g. the order and neighbouring information about the features of the shape). This is because every contour concavity and convexity has its own corresponding contour on the CSS image. Local information plays important role in retrieval, for example one can determine the number of concave and convex elements on the contour just by counting the number of the so­called peaks (maxima) in the CSS image.
The shape properties of contours are important for retrieval of semantically similar objects [1]. The descriptor is also very efficient in applications where high variability in the shape is expected, due to, for example, deformations in the object (rigid or non­rigid) or perspective deformations. It supports search for shapes that are semantically similar for humans, even when significant intra­class variability exists, such as in the images of Beetles shown in Figure 5.
Finally, many objects have distinguishing features in properties of their contour while having similar region properties and can efficiently differentiate between such shapes.
All the above properties can be clearly seen when analysing the experimental results, described in the following section. 5.4 Experimental Results
The results obtained in the Core Experiment 1 are summarised in Table 3, using the MPEG­7 XM implementation version 5.0. It can be seen that very good results are obtained in the rotational and scaling invariance tests. However, the real strength of the contour shape descriptor is in its performance in similarity­based retrieval (part B), where the average result is above 79%. CSS­XM
Table 3 CE1 Core Experiment Results for the Contour Shape Descriptor
Figure 16 Example similarity­based retrieval results for the contour­shape descriptor
Figure 16 shows example results for a similarity­based retrieval, for the butterfly­1 query image. It can be seen that the majority of butterfly images are retrieved correctly despite the significant variability in the shape of the wings, the size and spread of the antennae and whether or not the corpus of the body is protruding at the back. Images 16,19 and 20 are errors, but still some degree of similarity exists.
A class­by­class analysis of performance of the shape descriptors in similarity­based retrieval (CE1­B) illustrates very well the differences between the region­based and contour­based notion of similarity. Let us consider the shape classes where relative performance of one of the descriptors is better by 50% or more, compared to the other descriptor. This means that one of the descriptors retrieved 50% more correct shapes for the given class, which is a very significant difference in performance. The contour shape descriptor outperforms the region shape by 50% or more for 17 classes ­ Figure 17 shows example shapes. Figure 17 Example of shapes where contour descriptor significantly outperforms region descriptor
The fist class includes images of a running person, where significant non­rigid deformations occur. The contour –based technique is much more robust to such deformations. Fork (second object in Figure 17) is a good example of a shape where characteristic features are conveyed in the contour. The region­
based approach tends to confuse it with other elongated objects that expand at one end (e.g. Spoon, Hammer). Similarly, for Butterfly some characteristic features are in its antennae, which have a relatively small area and thus are not sufficiently emphasized by the region­based approach. Finally, the classes of Beetles, Horses and Elephants are examples where the natural variability in shapes is much better captured by the contour­based approach. The four classes where the region based approach performs better are shown in Figure 18. The first, second and fourth shapes represent classes where the same “basic convex shape” such as Circle (Device­9), Triangle (Device­4) and Pentagon (Device­6), has been altered by cutting out a variable number of long and narrow openings (Refer to Figure 5, bottom row, to see some shapes in the “pentagon” class). All such shapes have a similar distribution of pixels within their regions, but obviously very different contours, because the number and shape of the openings differs within each class. Finally, the third shape, a square with rounded edges, is a convex shape and therefore it contour carries very little characteristic information (no peaks in the CSS image). The contour­based technique tends to confuse it with other convex objects, for example with the Circle. 16
Figure 18 Classes where the region shape descriptor significantly outperforms the contour shape descriptor
6. 3­D Shape Descriptor
The 3D Shape Spectrum Descriptor (3D SSD) captures characteristic shape features of 3D objects by exploiting some local geometrical attributes of objects’ surfaces defined by their 3D mesh models. The Shape Spectrum has been previously used for indexing and retrieval of 2D images and 3D range data .
6.1 Definition of the 3D Shape Descriptor
Before defining the shape spectrum, the concept of the shape index should be introduced. The shape index was first introduced by Koenderink and is defined as a function of the two principal curvatures. 1
Let p be a point on a regular 3D surface S and let k p and k p denote the principal curvatures 1
associated with point p ( k p ≥ k p ). The shape index at point p, denoted by Ip, is defined as: .
The shape index is a local geometrical attribute of a 3D surface, expressed as the angular coordinate of a polar representation of the principal curvature vector. The shape index ranges in the interval [0,1] and is not defined for planar surfaces. The shape index provides a scale for representing salient elementary shapes such as convex, concave, rut, ridge and saddle (Figure 19), and is invariant with respect to scale and Euclidean transforms. Title: fig3.eps
Creator: MATLAB, The Mathworks, Inc.
CreationDate: 02/02/ 1 13:01:36
Title: fig4.eps
Creator: MATLAB, The Mathworks, Inc.
CreationDate: 02/02/ 1 13:01:51
a. Spherical cup (0.0)
b. Rut (0.25)
Title: fig5.eps
Creator: MATLAB, The Mathworks, Inc.
CreationDate: 02/02/ 1 13:02:02
c. Minimal Saddle (0.5)
Title: fig1.eps
Creator: MATLAB, The Mathworks, Inc.
CreationDate: 02/02/ 1 13:01:07
Title: fig2.eps
Creator: MATLAB, The Mathworks, Inc.
CreationDate: 02/02/ 1 13:01:19
d. Ridge (0.75)
e. Spherical cap (1.0)
Figure 19 Elementary shapes and their corresponding shape index (in brackets).
It may be recalled that the principal curvatures are defined as the eigenvalues of the Weingarten map (W) given by the following expression: W = I −1 II , (16)
where I and II denote respectively the first and the second fundamental differential forms. Considering a Cartesian coordinate system and a Monge parametrization S = ( x , y , z ) = ( x , y , f ( x , y )) of the surface, with f ­C2 differentiable, the fundamental differential forms I and II can be expressed as symmetric and positive semi­definite matrices, and are given by the following equations:  S ,S
 x x
I =
 S x ,S y
S x ,S y   1 + f x2
 =
S y ,S y   f x f y
 S ,N
 xx
II = 
 S xy , N
S xy , N 
 f xx
S yy , N 
1 + f x2 + f y2  f xy
fx fy 
 1 + f y2 
and f xy 
 , f yy 
where N denotes the normal vector to the surface at the considered surface point p, and Sxy is the standard Monge notation for the partial derivatives of the surface S with respect to variables x and y. Here, notations used correspond to : ∂S
∂ 2S
∂ 2S
∂ 2S
Sx =
, S y =
, S xx = 2 , S xy =
, S yy = 2 .
The shape spectrum of the 3D mesh is defined as the distribution of the shape index calculated over the entire mesh. We note that the above definitions refer to surfaces in the continuous space ℜ 3 . The extraction of the 3D SSD from discrete, polyhedral 3D meshes is described in detail in the following section. 6.2 Extraction of the 3D SSD Descriptor Many 3D models are expressed in VRML format, which is primarily intended for graphics purposes. Before applying differential geometry­based analysis to such data, it is necessary to transform it into orientable meshes, represented by a minimal number of connected components, without degenerated or overlapped polygons. To obtain the results presented here, a preliminary regularizing filtering of the 3D mesh models has been applied. Related approaches for repairing 3D mesh models to form manifold surfaces and solid regions are reported in . In addition, in order to obtain smoother representations, necessary when considering surface derivatives, a subdivision algorithm based on Loop's subdivision scheme was applied.
6.2.1 Estimation of the Principal Curvatures Estimating the principal curvatures is the key step of the 3D SSD extraction. Indeed, the 3D SSD performance strongly depends on the accuracy of estimates and the robustness of the estimation technique used for the triangulation of the 3D model. Our approach is based upon a second­degree polynomial surface fitting. The principal curvatures are associated with each face of the mesh. First, the mean normal vector to ~
face fi of the mesh, denoted by N fi , is defined as the weighted average of the normal vectors of the 0­
adjacent faces of the considered face fi (Equation 18). Two faces of the mesh are said to be 0­adjacent if and only if they share a common vertex. ~
N fi =
wk N f k
wk N f k
{ }
f k ∈F0 f i
{ }
f k ∈F0 f i
Here, F0 {
} denotes the set of 0­adjacent faces of fi, N f k denotes the normal vector to face fk, wk is the weighting coefficient associated with the 0­adjacent face fk, and ⋅ the L2 norm of a vector in ℜ 3 . Each weighting coefficient wk is set to equal the area of the face fk, in order to take into account some irregularities in the mesh face distribution. A local Cartesian coordinate system (x, y, z) is defined such that its origin coincides with the gravity ~
center of the face fi and the z­axis is associated with the previously defined mean normal vector N fi . {
} iN=1 denotes the cloud of points made of the centroids of the considered face f and Let W= ( x i , y i ,z i )
its 0­adjacent faces, with coordinates expressed in the local coordinate system. The parametric surface approximation is achieved by fitting a quadric surface through the set of points W.
Let S = ( x , y , z ) = ( x , y , f a ( x , y )) denote a second degree polynomial surface, expressed as: f a ( x , y ) = a 0 x 2 + a1 y 2 + a 2 xy + a 3 x + a 4 y + a 5 , t
where ai's are real coefficients. By denoting a = (a 0 a1 a 2 a 3 a 4 a 5 ) and b( x , y ) = ( x y xy x y 1) , Equation 6 can be re­written more compactly by using standard matrix notations: 18
The parameter vector a = (a 0 a1 a 2 a 3 a 4 a 5 ) is determined by applying a least square fit procedure. N
Given the data points denoted by { ( x i , y i ,z i )} i =1 , with associated weights { wi } i =1 , the parameter vector â corresponding to the optimal (in the weighted mean square error sense) fit is expressed as: N
a =  ∑ wi b( x i , y i )b t ( x i , y i )
 i =1
 ∑ wi z i b( x i , y i ) = arg min ∑ wi ( z i − f a ( x i , y i )) 2 .  i =1
a ∈ℜ 6 i =1
Note that the representation of the gravity centers { ( x i , yi ,z i )} i =1 in local coordinates guarantees the invariance of the approximation with respect to Euclidean transforms. Once the parametric approximation is computed, the principal curvatures k1 and k2 can be easily determined as the eigenvalues of the Weingarten map W, since the first and second differential forms I and II at (x, y) = (0, 0) take simple expressions, given in Equation 22.
 a0 a2 
 1 + a 32 a 3 a 4 
II =
I =
 a3a4 1 + a4 
1 + a 2 + a 2  a 2 a1 
Note, that it is also possible to consider the set Fn { f i } of the n­adjacent faces (n>0). Such a set is recursively defined from F0 { f i } by considering successively wider neighborhoods. In this case, the computational complexity increases proportionally to the number of n­adjacent faces without providing a significant improvement of the principal curvature final estimation. 6.2.2 Computation of the 3D Shape Spectrum The 3D SSD of a 3D mesh M is defined as the histogram of the shape index values, calculated over the entire mesh. The histogram is represented on a Nbins number of bins, uniformly quantizing the range of N
the 3D shape index values (the [0, 1] interval) into a set of Nbins intervals, { ∆ k } k =1
, where N
 k −1
−1 
k 
∆k = 
, 1 .  , for k = 1, N bins − 1 and ∆ N bins =  bins
 N bins N bins 
 N bins
The relative (with respect to the total area of the mesh) area of each face of the mesh of shape index belonging to the interval ∆ k is added to the kth component of the 3D SSD. Two additional components, denoted by PlanarSurface and SingularSurfaces, and representing the relative surfaces of planar and border patches, respectively, are introduced in the 3D SSD. The planar surfaces must be separately considered since the 3D shape index is not defined for planar patches. Let ka denote the L2 norm of the curvature vector (k1, k2): k a = k 12 + k 22 . The degree of curvedness of the surface is defined as C = Area(F0 {f i }) * k a . Each face with curvedness C less than a threshold T, will be considered as part of a planar patch and its relative area will increment the PlanarSurface component of the 3D SSD. Unreliable estimates of the principal curvatures may occur in the case of border faces with a number of 0­order adjacent faces smaller than a pre­defined number Nmin. Hence, the border faces are not taken into account when computing the shape index and their relative area is stored into the SingularSurfaces component. 6.2.3 Example similarity measure
The similarity measure is L1 or L2 norm computed on vectors formed by the SSD histograms. Such similarity measures may, or not, take into account the non­planar surfaces or the singular surfaces, depending on the target application.
The 3D SSD has been evaluated within the MPEG­7 Core Experiments (CE’s). The results presented in next section are those obtained and reported within the framework of the MPEG­7 3D Shape CE over 5 successive MPEG meetings (October 1999 ­ July 2000). 6.3 Experimental Results
The 3D SSD has been evaluated on a data set consisting of 1300 3D meshes. To facilitate objective evaluation, a subset consisting of 228 models has been selected and categorized into 15 distinct categories, based on the shape similarity as perceived by the participants in the CE.
The L1­based norm was used and the singular and planar histogram bins have not been taken into account. and the 3D SSD has been re­normalized such that N bins
∑ SSD(i ) = 1 .
i =1
Figures 20 and 21 show some examples of similarity retrieval corresponding to queries from the "Cars" (2.a) "Aerodynamic" (2.b) "4­Limbs" (3.a and 3.b) and "Trees" (3.c) categories. The first 6 retrieved results are presented according to a decreasing similarity order, from left to right and from top to bottom. In the upper left cell the query is presented. Figures 21.a and b show that nice results are obtained for articulated objects in the "4­Limbs" category, where elements in various postures are retrieved (the humanoid with arms and legs spread­out in the second row of Figure 21.a, the sitting dog in Figure 21.b). For objective evaluation of the proposed 3D SSD in terms of retrieval performances, the Bull’s­Eye Percentage (BEP) score is computed. The BEP is defined for each query as the percentage of correct matches (i.e. items within the same category as the query) among the 2Q top retrieved results, where Q denotes the size of the considered query category. After computing the BEP scores for each categorized item, an average BEP is computed for each category. The global mean score of 85% demonstrates the good discriminatory properties of the 3D SSS. (a)
Figure 20 Similarity retrieval results for queries within the "Cars" (a) and "Aerodynamics" (b) categories
Figure 21 Similarity retrieval results for queries within the "4­Limbs" (a, b) and "Trees" (c) categories.
In order to decrease as much as possible the complexity of the 3D SSD, the effects of uniformly quantizing the 3D SSD and of decreasing the number of histogram bins of the representation, with respect to the similarity retrieval related performances were studied. Experimental results show that a 12 bits for the quantization values yields hit rates approaching the floating point representation performances. More compact representations, down to 8 ­ 7 bits are also possible with 3% degradation of the mean retrieval rate. Concerning the number of bins of the representations, good results are obtain in a range varying from 100 down to 10 histogram bins, with a degradation of the mean retrieval rate of only 5%. Thus, a compact representation of the 3D shape spectrum descriptor can be used while ensuring optimal retrieval rates. 7 Combining a Multiple View and Contour Shape Descriptors
The Multiple­View Descriptor has been developed to combine 2D descriptors representing a visual feature of a 3D object seen from different view angles [10]. The descriptor forms a complete 3D view­
based representation of the object. Any 2D visual descriptor, such as for example contour­shape, region­shape, colour or texture can be used. The Multiple­View Descriptor supports integration of the 2D descriptors used in the image plane to describe features of the 3D (real world) objects. Experiments with the Multiple­View Descriptor and the Contour­based Shape Descriptor have shown good performance in multi­view description of 3D shapes . 8 Example Applications of the MPEG­7 Shape Descriptors
We will now consider two example applications of MPEG­7 shape descriptors: a cartoon search engine and a trademark search system. (MZB Note: insert short para on the trademark search system)
Cartoon Search Engine Figure 22 shows an example of a Web­based cartoon clip search application, which uses MPEG­7 visual descriptors. Short cartoon stories are stored in MPEG­4 format on a server. The characters present in each key­frame are segmented and their contour shape and dominant colour descriptors extracted and stored as the MPEG­7 descriptors. The user selects a required cartoon character from the group of images in the upper part of the display. Alternatively, the visual properties of the searched character can be drawn/selected with a mouse. A search is performed on the descriptions stored on the 22
servers and key frames from the video clips containing similar characters are displayed to the user (lower row of images). The user can then select the required clip, which is transmitted over the network and played back. Combination of the dominant colour and contour shape descriptor gives robust performance, even in the case of partial occlusion, as can be seen in two of the retrieved key frames in the Figure 22. Figure 22 GUI of the Shape and Colour Cartoon Search Engine
9 Conclusions
A coherent set of shape descriptors has been developed as an international collaborative effort and now forms a key part of the MPEG­7 visual specification. Extensive tests shown that the descriptors offer state­of­the­art retrieval performance, and have excellent generalisation properties, similar to that of the human visual system. High compactness and expression efficiency combined with fast extraction and matching should satisfy even the most demanding real­time application. We expect to see a wide range of novel devices and applications, based on the MPEG­7 shape description technology in the near future. Some very early examples were presented here, but it appears that there is much more to come. ACKNOWLEDGEMENTS
The authors of this chapter would like to thank all participants in the MPEG “shape experimentation group”, especially W­S Kim, K. Muller, J Ohm, J. Cooper, T. Zaharia, H­K Kim, J.D. Kim, for their technical contributions towards development of the MPEG shape descriptors.
[1] Shimon Ullman, “High Level Vision”, The MIT Press, 1997 23
[2] Text of ISO/IEC 15938­3/FCD Information technology – Multimedia content description interface – Part 3 Visual, ISO/IEC JTC1/SC29/WG11/N4062, March 2001 (Singapore)
[3] S. Jeannin and M. Bober, “Description of Core Experiment for Motion/Shape,” ISO/IEC N2690, Seoul, Mar. 1999.
[4] F. Mokhtarian, M. Bober, “The Curvature Scale Space Representation: Theory, Applications, and MPEG­7 Standardization”, Kluwer Academic Publishers, to appear in 2001
[5] J.J. Koenderink, A.J. van Doorn, “Surface shape and curvature scales”, Image and Vision Computing, 10 (8), 1992.
[6] Whoi­Yul Kim, Young­Sung Kim,“A new region­based shape descriptor ”, ISO/IEC MPEG99/M5472, Maui, Dec. 99. [7] Cho­Chuak Ten, Roland T. Chin, “On Image Analysis by the Methods of Moments”, IEEE Transactions on attern Analysis and Machine Intelligence, Vol. 10, No. 4, July 1988, pp. 496­513.
[8] M. Bober, – “Shape descriptor based on Curvature Scale Space”, MPEG­7 proposal P320, Lancaster, UK, February 1999.
[9] Jong Deuk Kim and Hae­Kwang Kim, – “Shape Descriptor based on Multi­Layer Eigen Vector”, MPEG­7 proposal P517, Lancaster, UK, February 1999.
[10] F. Mokhtarian, A.K. Mackworth, “A theory of multiscale, curvature­based shape representation for planar curves” , IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. 14, No 8, August 1992, pp. 789­805.
[11] J. G. Leu, “Shape Normalization Through Compacting,” Pattern Recognition Letters, vol. 10, pp. 243­250, 1989.
[12] K. Müller, J­R Ohm, J. Cooper , M Bober, “ Results of 2D/3D Shape Core Experiments MS­4 ”, ISO/IEC MPEG99/M6190, Beijing, China, July 2000.
[13] C. Nastar, "The Image Shape Spectrum for Image Retrieval", INRIA Technical Report RR­3206, July 1997.
[14] C. Dorai, A.K. Jain, "Shape Spectrum­Based View Grouping and matching of 3D Free­Form Objects", IEEE Trans. on PAMI, Vol. 19, No. 10, pp. 1139­1145, 1997. [15] J. Koenderink, "Solid shape", The MIT Press, Cambridge, Massachusetts, 1990. [16] M. Do Carmo, "Differential geometry of curves and surfaces", Prentice­Hall Inc., Englewood Cliffs, New Jersey, 1976. [17] M. Spivak, "A comprehensive introduction to differential geometry", 2nd ed., Houston: Publish or Perish, 1979. [18] P. Hoogvorst, "Filtering of the 3D mesh models", ISO/IEC JTC1/SC29/WG11, MPEG00/ M6101, Geneva, June 2000. [19] A. Gueziec, G. Taubin, F. Lazarus, W. Horn,"Converting Sets of Polygons to Manifold Surfaces by Cutting and Stitching", IEEE Visualization'98, 1998. [20] T.M. Murali, Thomas A. Funkhouser, "Consistent solid and boundary representations", Computer Graphics (1997 SIGGRAPH Symposium on Interactive 3D Graphics), pp. 155­162, March 1997,. [21] G. Butlin, C. Stops, "CAD data repair", Proc. of the 5th Int. Meshing Roundtable, October 1996. [22] G. Barequet, S. Kumar, "Repairing CAD models", Proc. of IEEE Visualization'97, October 1997. [23] C. T. Loop, "Smooth subdivision surfaces based on triangles", Master's Thesis, Dep. of Mathematics, Univ. of Utah, August 1987. [24] T.Zaharia, F. Preteux, M. Preda, "3D Shape spectrum descriptor", ISO/IEC JTC1/SC29/WG11, MPEG99/M5242 , Melbourne, Australia, October 1999. [25] T. Zaharia, F. Preteux, "3D Shape descriptors: Results and performance evaluation" ", ISO/IEC JTC1/SC29/WG11, MPEG99/ M5592, Maui, December 1999. [26] T. Zaharia, F. Preteux, "Crosscheck of the 3D shape spectrum descriptor", ISO/IEC JTC1/SC29/WG11, MPEG00/ M5917 , Noordwijkerhout, The Netherlands, March 2000. [27] T. Zaharia, F. Preteux, "3D Shape Core Experiment: The influence of mesh representation", ISO/IEC JTC1/SC29/WG11, MPEG00/ M6103, Geneva, June 2000. 24
[28] T. Zaharia, F. Preteux, "3D Shape Core Experiment: Semantic versus geometric categorization of 3D mesh models", ISO/IEC JTC1/SC29/WG11, MPEG00/M6104, Geneva, Switzerland, June 2000. [29] T. Zaharia, F. Preteux, "The influence of the quantization step on the 3D shape spectrum descriptor performances", ISO/IEC JTC1/SC29/WG11, MPEG00/M6316, Beijing, China, July 2000
[30] W.S. Kim and W. Y Kim, « Content­Based Trademark Retrieval System using visually salient features », IEEE Conference on Computer Vision and Pattern Recognition, 1997, pages 307­312