In Chapter 5, the place recognition algorithm is presented. It is responsible for searching the environment model for the surfaces that could be, with the highest confidence value, the ones captured on the input. Possessing a set of reflectance values assigned to the examined regions on the input image, the bottom up algorithm first traverses the locale semantic net and gathers (statistical) information about regions and faces. It examines which of them could possibly satisfy the requirement of appearing on the image. To narrow down the pool of all possible matches, implicit knowledge is also introduced from the environment model. This is described in the form of a semantic net that contains information about the general location of surfaces in the vertical direction compared to each other. The search then produces a list of possible faces that could appear on the image. It is the face with the highest confidence measure that is eventually to be selected. The idea of top-down localization is also introduced and the description of difficulties encountered during the experiments.
5.1 Scene Identification Procedure
The location search procedure in the environment model is a very complex task. On the one hand, a lot of the faces in the environment model are identical in color. They consist of three main regions: wall, upper and lower baseboard surfaces. It is only their position and shape in the environment model that differentiates between them. On the other hand, another source of difficulty roots in the fact that the input images do not necessarily represent single faces in the environment model. It can happen, especially with smaller-sized neighboring surfaces (e.g.: firehose, doors), that the input contains regions from two or even more number of faces. Such an example is depicted on Figure 25, which displays a possible input image with two faces on it. That is one of the reasons why the algorithm cannot be a simple "find the face with the maximum number of identified regions relative to its total number of regions". A very general face could contain matches for all its existing surfaces, but if its neighbors do not contain any of the remaining ones identified on the input, the choice of this face should probably be eliminated or given lower priority. In order to lower the difficulty level of that process, a good strategy is required to gather all the available information from the input images and the locale model.

5.2 The Locale Node Tree
Throughout the color identification experiments, two major types of false assignments were encountered. One of them misidentified the color of a region, but the guess was close to the correct value (for example: metal-brown of the elevator and the deep brown of the main entrance doors were mistaken). The other also provided with a close estimate of the color, but because of a shadow or reflection effect the region’s color was distorted even on the unprocessed input image. The latter mistake is easy to eliminate for human observers. Assuming that the color camera is in upright position when the images are taken, it is impossible to obtain an image where, for instance, the floor region is located above all the other regions. It was found that the most rigid rules that could be introduced to imitate that type of human reasoning are about the vertical direction. As an example, the ceiling region should reside above all other regions and a sofa should not have a wall region below it. It is significantly more difficult to express identical rules in the horizontal direction because of varying object sizes, various possible projection effects on the input and the nature of the segmentation process. In the latter, good region representatives of some of the major surfaces are identified. However, the dimensions are not complete. It is possible that in strongly varying lighting conditions only a minor section of a bigger surface is recognized. To compare the location of this region with the other surfaces can lead to false estimates. In Figure 26, if the door region color is correctly identified, but the bulletin board gets the light-timber assignment (which corresponds to the color of the bench) instead of the cork, the arrangement would be accepted. However, this match should not be allowed, as a bench should always stay on the floor. The above observation led to the idea of introducing a set of constraints that could efficiently guide the localization search.

5.2.1 Tree Structure
A subset of the knowledge implicitly embedded in Susan B.’s environment model was incorporated into a net structure called the locale node tree [Figure 27]. The tree represents some general restrictions considering the location of the environment surfaces in the vertical direction. The nodes in the tree represent objects or surfaces that can be found in Susan B.’s model. The arrows connecting the nodes refer to "above" or "below" relationships. Each surface that is connected to other nodes via the arrows has to follow the rules. For example, if there are several regions on the input and one of them is identified to be the floor, the latter is required to be located below all the others. If this rule is not met by the input entities, then the color assignment most probably made a mistake. The most suspicious region might be dropped in the following calculations or the original image could be traversed for additional data. The current project assigns confidence values to the regions after being compared to all the others. The sum of these evaluators should be able to indicate whether the current environment model can justify the set of color matches for the input image.

The locale node tree does not imply any knowledge about the horizontal vicinity of surfaces. As it was already mentioned, describing limitations in this direction is not feasible because the nature of input images and the segmentation algorithm allow for only a good approximation about the detected faces. Any kind of a regulation set would have to be extremely complex to convey true information on these relationships.
In [Tenenbaum, 1976], Tenenbaum and Barrow addressed a similar region localization problem. Their relaxation algorithm was to interpret region assignments. In their study, the input image was correctly partitioned into individual surfaces and the task was to find the appropriate label for each. To achieve this goal, they used constraints formed about the objects in the environment. The four restrictions introduced by the authors were: "within", "beside", "above" and "size". These rules were sequentially applied to the regions and eventually a discrete label was assigned to all surfaces.
The locale node tree could not take an advantage of the exact same reasoning. As it was discussed above, the segmentation process is not capable of exactly determining the composition of the input image in terms of regions. The four specific operators could not have been able to convey reliable information about the partially recovered regions.
5.2.2 Bottom-Up Scene Identification
The scene identification algorithm that took advantage of the information stored in the locale node tree followed a bottom up approach. It started the search mechanism using the simplest elements of the algorithm, the input image regions and attempted to collect sufficient amount of information about them to determine a highly confident guess about the location of the agent. This operation relied on the reflectance triplets found by the color identification procedure. By traversing the environment model of the agent, it first collected all the regions that might possibly be represented on the input. In other words, all faces that had regions of at least one of the recognized colors were collected together. It should be one or rather the combination of some of these faces that appeared on the input. Besides the name of the identified region and its parent locale, different other statistical information (number of matches of the same color for the face, ratio of the number of matched regions to the total number of regions belonging to the face, etc.) was recorded as well. This data was expected to aid the confidence assignment of the search in the final stages. Following the semantic net traversal, the face statistics information was checked. All the faces with one or more regions identified were grouped together and the node tree was visited. The tree structure was used to strengthen or weaken the probability of a region belonging to a particular face by analyzing its actual location on the input image and its location stored in the environment model. First, all the individual faces were analyzed separately. If a face possessed more than one identified region, the mutual relationship (in the vertical direction) of all was tested. For instance, the node tree specifies that the floor region must be located below all other surfaces. If on the image the olive colored region that was classified as part of the floor based upon its color was located on the top, the confidence level of that match was reduced. The application of the node tree followed a parallel probabilistic model. The confidence measures were registered independently of the ones that might have already been assigned to the other image regions previously.
It was very difficult, throughout this algorithm, to accurately estimate the location of the image regions on the input. The only available information about the regions was the coordinates of their "bounding" rectangles. In other words, the four known indicators measured, in a 480 x 640 coordinate system, the location of the maximum and minimum row and column that were taken by any pixel of a single region. In this way, judgements about the actual size, orientation and the nature of the represented surface could only be very general. For instance, Figure 28, illustrates two regions "embracing" each other. It is true that region A is (partially) located above region B. However, its thin segment curling around region B suggests that it contains the smaller region rather than just bordering it from above. Such a complex relationship could not be recovered by the implemented "above or below" conditions.

Some of the relationships in the node tree model were tested indirectly. The comparison of the model and image data attempted to find a contradiction between these two descriptions. For example, if according to the model, region C had to lie on the same level as region D, then it was tested whether image region C was located immediately above or below region D. The definition of the "above" relationship is illustrated on Figure 29. The four white boxes illustrate all the possible locations that would be accepted as lying "above" the gray. If the two regions, C and D, were not related in any of the demonstrated ways, the positive confidence measure increases. However, if either region C or D were above the other, the negative "confidence" indicator would be augmented.

In the latest version of the scene identification code, the confidence-assigning algorithm differentiated among three indicators: positive, negative and neutral. The positive confidence level of a reference assignment was increased if the locale node tree could verify the location of a region compared to the others. (The supposed wall was above the supposed floor and sofa regions.) The negative confidence level grew if the indicated relationship was apparently violated. (The olive colored region, that only describes the floor in the current model, was above any other image region.) The neutral evaluator was augmented in case no definite statement could be formed about the legitimacy of the region locations. (The regions lied next to each other or there was no vertical path from one to the other in the locale node tree.)
After gaining information from the locale node tree there was an additional feature of the collected faces that was tested for. We checked whether a face had any neighboring surfaces residing in the same locale that also had some probability to match the scene on the input image. That step was important as the majority of the input images displayed several faces at the same time. Some of them were only partially visible, and some of them might have been occluded. In the environment model, though, there was a clear distinction between which region belonged to which face. Therefore, if a face with identified regions had a neighbor with matches the positive confidence value of both of these faces was increased.
5.2.3 Ideas for Future Study
The neighbor finding feature of the algorithm is still under heavy examination. It is currently studied whether searching for secondary and tertiary neighbors would add any useful information to the system. It is also examined whether it is enough to find neighbors on the face’s own locale level or some additional parents should be considered as well. This question is interesting, as objects and furniture that can be moved around easily in the office setting are each represented in separate locales. That convention allows for easy modification of the model. Looking for neighbors only on the same locale level would prevent the system to identify any of the current neighbors of these objects.
After the color and locale identification operations are completed, the exact level of the confidence values has to be calculated. The positive, negative and neutral evaluators should be merged together with the color identification confidence indicators. The final outcome should be able to signal how reliable the transformations and matches are and how certain is the localization guess that the system is making.
Below a certain threshold the algorithm will not provide an answer. Low confidence values can suggest the presence of yet-unseen surfaces. As these could not have been registered by the agent formerly, they cannot be recognized.
5.2.4 Top-Down Scene Identification
There is another, very powerful approach that could be implemented to carry out the localization task after the color identification procedure. In contrast with the above-described bottom up, it would utilize a top down approach. Instead of collecting all the surfaces in the environment model that possess certain color properties and then formulating confidence values that would indicate which sub-locale or face is most likely to be the searched location, the new approach would begin its investigation on one of the highest hierarchical levels. In each locale, it would impose the question whether the agent could be located there. To test the probability for that happening, the algorithm could systematically compare the known surfaces from the model to the ones found on the input. After registering some confidence value to their correspondence, it would continue the comparison with the rest of the locales. At the end, the locale with the highest confidence value would be selected.
One definite advantage of this method would be that the neighboring relations between regions could be more easily recognized than in the previous case. By directly inquiring about the existence of a particular surface color on the input, the search could be guided from regions that are more distinguishable (sofa, bookshelf, etc.) to faces that are very common in the environment (wall, baseboard, etc.). Implementing this approach could get very sophisticated. By knowing the exact parameters for the robot and the three-dimensional environment settings, the recognized image regions could be projected to the model in various angles and from various points. That option might facilitate testing the relationships between the location of individual regions. In case of the locale node tree, only the vertical relationships could be described. And even these restrictions had to be fairly general. As the output of the region analysis method is not a complete description of the surfaces found on the image, more severe rules could not be implemented. However, by being able to use projections, the location of surfaces might be more precisely described. The actual coding of this approach is a future task of the project.
Introducing the top-down scene identification algorithm would also allow for generalizing the localization goals of the system. Currently, with the bottom-up implementation, it is the most characteristic (couple of) face(s) that are searched for. The top-down approach could address the problem from the locale level. By directly questioning the probability of each locale whether it could represent the scene displayed on the input, it would investigate which is the smallest locale in which the input image regions can be identified. That task might be slightly easier with the given data-structures than the aims of the bottom-up strategy.