Chapter 1
Introduction

1.1 Computer Vision

The study of computer vision is a research field of artificial intelligence in computer science. Its main task is to understand scenes from their projected images by imitating the human vision process. Although vision is an innate property and a natural process for human beings, it has proved to be extremely difficult to replicate its power in automated vision.

To achieve the goal of machine perception, certain characteristic values have to be recognized or recovered from the camera input. These values can vary on a large scale. Past projects have applied, for instance, texture, shape, color, etc. On the two-dimensional, projected images that are analyzed by computers, information is lost about the depth and three-dimensional structure of the scene. As even with a complete description it is extremely difficult to separate all the different factors that determine the appearance of an object (e.g.: illumination, environment and texture properties), vision problems always require a priori knowledge about the surrounding world. This base can contain widely ranging information about the nature of the tasks and interests of the agent.

In computer vision, research approaches can be divided into two major groups. One works with "low-level" computations while the other attacks "high-level" problems. The term "low-level vision" refers to the early stages of image processing. That is when noise is expected to be eliminated and information is gathered about the input by analyzing grayscale measures, intensity values and gradient operators. Some of the major aims of these operations are edge- or region-finding, boundary-drawing and basic image processing tasks. Besides the raw data, these operators have little or no knowledge about the nature of the input that could guide their work. "High-level vision" processes, on the other hand, make extensive use of any predefined information about either the task or/and the input. They assume perfect output from the lower level processes and justify the correctness of their hypothesis by referring to the result of those computations. Using the additional data source allows for highly task-specific applications. The focus of this strategy is much more limited than the general low-level attempts that do not assume anything about the input.

1.1.1 Brief History

Historically, research studies utilized a sequential bottom up approach in their computer vision experiments. In these cases, the algorithm first started with the analysis of the raw input image. Then a hypothesis was formulated about the image objects by recovering particular features. Roberts [Roberts, 1965] carried out pioneering work in the so-called "blocks world". That artificial environment consisted of only regular geometrical shapes. His algorithm transformed the input image into line drawings. Then it interpreted these lines and differentiated among the pre-specified three dimensional bodies.

Other machine vision approaches applied segmentation to compare the input to the pre-determined models. Segmentation is one of the first steps of low-level image processing. It divides the input into distinguishable units based upon a collection of common features. One of the earliest region analysis algorithms was introduced by Brice and Fennema [Brice, 1970]. Instead of searching for and emphasizing the differences between picture elements (pixels), their code collected these elements together based upon identical characteristics, grayscale intensities. After the elementary regions consisting of identical pixels were identified, two heuristic operators were introduced. These attempted to merge together some of the neighboring regions based upon the assumption that their grayscale differences could be attributed to noise or uneven lightening conditions. Then the enlarged regions that formed the output of these procedures were associated with various objects and surfaces in the environment. This algorithm could be effectively used in navigation and object recognition tasks.

Without any additional information, region analysis algorithms can "underdivide" or "overdivide" image areas belonging to a surface. In the first case, some boundary properties are not recognized and regions of two or more different types are merged together. In the second case, however, there are several image regions to represent a single surface in the environment. These "failures" can result from distortions introduced by shadows, occlusion and varying illumination. To overcome problems of this nature Guzman, in [Guzman, 1968], introduced heuristics, to obtain segmentations of two-dimensional images that were meaningful in three-dimension as well. He characterized line-junctions and vertices and connected image regions that could belong to the same object. Applying the additional information gathered through these heuristics, he was able to recognize objects in a simple scene. Tenenbaum and Barrow applied object-to-object relations (e.g.: next to, under, above) to achieve the same purposes [Tenenbaum, 1976].

During the early seventies it was physics and geometry that were most commonly applied in computer vision tasks. According to these new approaches essential scene information could be obtained from shape information, such as convexity, concavity and smoothness. These characteristics were recovered by calculating image gradient values. Other features that were also thoroughly examined were texture and shading [Horn, 1970]. Barrow and Tenenbaum [Barrow, 1978] worked on recovering reflectance and orientation information from the input image data.

Place recognition is grouped into the area of image understanding or scene analysis. The task of scene analysis is to build and reconstruct a complex symbolic representation of the input from a simple description [Horn, 1986]. For instance, given a set of two-dimensional line drawings, the scene analysis algorithm would recover a three-dimensional description of the set by identifying geometrical properties of objects represented on the input image. The most widely recovered features have been geometrical ones, but texture, special markings and color have also played an important role. The new information recovered from the input can then be utilized in applications such as navigation, obstacle avoidance and planning.

1.1.2 The Role of Colors

Color plays an essential role in the human visual process. It seems to have some significant advantages as surface property compared to geometrical features. A red object, for example, remains red to the observer even if he/she moves away from the scene or examines the object from another view. The same observation is not true for the previously mentioned geometrical properties. On a two-dimensional projection, for example, the shape of an object can considerably change under different viewing conditions. Hence a lengthy and detailed description is required about the scene to account for all the possible arrangements. Although shadowing effects and uneven illumination often distort the appearance of colors, they still prove to be essential and useful perceptual measures for people.

1.2 Description of the Susan B. Project

In 1990, Mount Holyoke College began an Artificial Intelligence project centered around a mobile robot that the students named "Susan B." Susan B. is equipped with a TV camera and she is also connected to a network of PC machines. The Susan B. Project addresses a wide area of Artificial Intelligence. Besides computer vision, for example, studies have been continuously carried out in the field of natural language processing, automated reasoning and navigation. The programs written to enrich the capabilities of Susan B. are collected in a software package called the VLSys system.

In the past years, the majority of the Susan B. research projects have focused on problems in computer vision. Numerous series of experiments attacked the problem of place recognition. For instance, one project investigated stereo vision, another applied triangulation and template matching and a third dealt with robot navigation relying on the black baseboard strip of the walls. These studies investigated black-and-white images and focused on gradient calculations, finding landmarks in the environment and gathering depth information from matching points.

Although none of these studies relied directly on color properties, the VLSys system contains various types of representations related to color sensing. For color input images it initially captures and stores color in RGB (Red-Green-Blue) measurements. After a preliminary preprocessing mechanism, it also obtains a representation based on color-intensity.

1.3 Place Recognition Using Color Region Analysis

Our research project on place recognition using color region analysis is now one of the building blocks of the VLSys project. It addresses the problem of identifying surfaces on an image based upon its color properties and also attempts to solve the complexity of determining the location of the represented scene. More specifically, the goal of this identification process is to enable Susan B. to answer the well-known computer vision question "Where am I?" The response to that question can provide indispensable information to an agent when, for example, in planning a route or when verifying its current location. Throughout the localization procedure the input color images are compared to an environment model that describes the actual scenes in the "real-world" setting of the agent. Figure 1 demonstrates how this thesis study could be incorporated into the already implemented navigational features of the Susan B. Project. The shaded areas refer to tasks that have already been realized and introduced in [Fennema, 1991], and the star indicates the problem investigated in our study.
 

Figure 1. The decomposition of the navigation system that attacked the "Go Fetch" problem. The new capability investigated is: Where am I ? (Reproduced courtesy of Claude Fennema)

The project that is the focus of this thesis study investigates the efficiency of a characterizing feature that has not been examined extensively as a primary

search feature. To solve the proposed place-recognition task, it is suggested that color is used as a new analysis attribute and it is applied to assist in determining the location of the agent/video that produced the input color image.

The color measures used in the color and scene identification process are surface reflectance values. Edwin Land, in a series of color constancy experiments, demonstrated that human color perception does not depend on the amount or composition of radiant energy reaching the eye and it is hardly modified by the surfaces immediately surrounding the examined one. He concluded that a unique triplet of lightness determines each color and he introduced a mathematical algorithm that was expected to produce the surface reflectance values of the perceived regions. To prove these results, he carried out his experiments in strictly controlled environments. The nature of illumination was predetermined and the analyzed scenes merely consisted of matte, rectangular, color papers called color Mondrians.

Even though Land first introduced the "Retinex Theorem" and his findings about color forty years ago, it is difficult to find any computer vision research application that would use color as a primary search cue. Recent computer vision reports (see their detailed description in Chapter 2) did indicate that this property can be successfully applied as a secondary or tertiary operator, however, it has not yet been widely examined as a sole descriptor of surfaces and faces. The question immediately arose. If color is truly such an important object feature why is it "neglected" in machine vision studies? This is a fascinating problem and it helped to define the principal goals of our research.

We were motivated by Land’s studies to give color a higher priority as a surface characterizing descriptor for three-dimensional objects and to use its power in the scene identification task. This task could not only provide an answer for the "Where am I?" problem, but could also confirm the effectiveness of Land’s theory and reflectance algorithm in a more natural environment setting. In an office environment, for instance, the spatial and the spectral distribution of illumination are generally not controlled, the surface colors are not pure like the ones on the standardized Munsell chips and the analyzed objects are in three dimension instead of two. The extra dimension in the experiments often leads to unwanted/color-distorting shadows and reflections appearing on the input images. These phenomena are not encountered for in the original algorithm. That complexity also allowed for testing the magnitude of the "simultaneous contrast" effect that signifies the changes in color appearance caused by variation in the surface reflectance functions of the surrounding objects.

The mechanism that carries out the scene identification procedure relies on a pre-specified environment model. This is represented as a semantic network that describes the three-dimensional entities (locales) constituting Susan B.’s current surroundings [Fennema, 1990]. The hierarchical structure among the locales reveals information about the relative location of these space volumes. The model also stores essential details about the geometrical properties and other characteristic features (e.g.: color) of surfaces that are smaller building blocks of the environment. More specifically, the current model of the agent represents a portion of Clapp Laboratory, one of the science buildings of Mount Holyoke College.

1.4 Organization of Thesis

Chapter 2 contains a brief theoretical description of the significance of color and human color sensation. It introduces and compares various color specification systems that could be utilized in color identification processes. It also investigates current research attempts that mainly applied color as an auxiliary feature in their research attempts. The environment model is also introduced. There is a detailed description of the new identification features and their role in the three-dimensional localization assignment.

Chapter 3, 4 and 5 demonstrate all the experimental work that has been carried out in order to complete the above-proposed identification goals. Chapter 3 can be related to color normalization. First, it introduces a linearization algorithm. It corrects for camera distortions via a calibration procedure. Then the improved measurements are used in color classification experiments to test how efficiently the various color spaces can quantify the human visual experience. Inspired by Land’s color constancy algorithm, reflectance evaluators are calculated to eliminate the effect of uneven lighting conditions. This operator produces values that are relative to the brightest region on the input. It is a standard gray card (which is calibrated to reflect 18% of the incident light) that is used to determine the absolute reflectances. A set of surface colors, which describe all the existing regions in the environment model, is also defined.

Chapter 4 outlines the nature and efficiency of the color identification procedure. As the linearization procedure assumes a particular lighting condition, the new operator is to implicitly find the presumed lighting conditions on the current input. With a set of transformations it modifies the input reflectance values to match them all with a corresponding descriptor located in the environment model. The most important evaluation methods are described and the best transforms are evaluated in detail.

The final stage of the algorithm, the scene identification, is specified in Chapter 5. It is the procedure by which the location of the agent is identified, given the reflectance information from the input image. A bottom-up strategy is specified that compares the input surface colors to the ones located in the locale model, and applies search heuristics to confirm the confidence of the scene analysis. The possible advantages of a top-down search mechanism are also introduced.

Chapter 6 summarizes the achievements of our research project and describes several intriguing ideas that could be investigated in the future.

Appendices A, B, C, D and E contain the majority of the programming code written for the experiments in color and scene identification.