Computing Representations for Bound and Unbound 3-D Objects Matching1

Computing Representations for Bound and Unbound 3-D Object Matching1

Gary C.-W. Shyi

National Chung-Cheng University, Taiwan, R. O. C.

Robert L. Goldstone John E. Hummel

Indiana University University of California, Los Angeles

Chang-ming Lin

National Chung-Cheng University, Taiwan, R. O. C.

_________________________________

1The study reported here was supported by a grant from the National Science Council (NSC87-2413-H194-017-G5), Taiwan, Republic of China, awarded to the first author, and National Science Foundation grant SBR-9409232 and a Cattell sabbatical award to the second author. Portions of the results were presented at the 5th Annual Workshop on Object Perception and Memory (OPAM’97), on November 20th, 1997, Philadelphia, PA and at the 2nd International Conference on Cognitive and Neural Systems, on May 28, 1998, Boston. Correspondence regarding this article should be addressed to Gary C.-W. Shyi, Department of Psychology, National Chung-Cheng University, Chiayi, Taiwan 621, R. O. C. or via internet to psycws@ccunix.ccu.edu.tw (GS).

Abstract

Five experiments examined the nature of object representation. Participants made same-different judgments between two multipart 3-D objects, according to rules where either the object parts and their spatial relationship had to be considered (role-relevant, RR) or just the object parts (role-irrelevant, RI). Results indicate that it was easiest to judge two identical and orientationally aligned objects according to either rule, followed by judging those that shared identical parts located in different positions according to the RI rule. It was most difficult to judge the latter according to the RR rule when they were misaligned by rotation. These findings lend support to the hypothesis that object representations at the image level, part level, or full structural description level may be computed and used for making same-different judgements. The implications of our findings for object recognition in general and the role of spatial attention in particular are discussed.

There is a growing debate on whether object recognition is best understood in terms of representations based on structural description versus representations resulting from image-based normalization processes. According to the structural description view, the construction of an object representation entails parsing the object into components and binding the components in accordance with their spatial relationships (Biederman, 1987; Hummel & Biederman, 1992; Hummel & Stankiewicz, 1996; Palmer, 1978, 1999). According to theories based on normalization, objects are represented in view-specific formats such that their recognition often requires a normalization process (e.g., alignment, mental rotation, or interpolation) that would discount or undo the transformations embodied in a particular point of view. The transformed representation of the object can then be matched to a model of the object stored in long-term memory (e.g., Edelman & Bülthoff, 1992; Tarr, 1995; Tarr & Pinker, 1989; Tarr & Bülthoff, 1995; Tarr, Williams, Hayward, & Gauthier, 1998; Ullman, 1989, 1996; for a recent review, see Tarr & Bülthoff, 1998 ). In contrast to the structural description approach, normalization-based theories put more emphasis on the process of normalization rather than on the nature of representations (Biederman & Gerhardstein, 1995). It is likely both kinds of representation (i.e., view-specific and structural description) may be computed and used in the course of object recognition, and the more challenging issue is under what conditions each type of representation is computed and used (Tarr & Bülthoff, 1995; Ullman, 1989, 1996).

For example, Hummel and Stankiewicz (1996) have recently proposed a model of object recognition that represents shape in a hybrid fashion (i.e., as structural descriptions augmented with holistic "views"). Structural descriptions are generated by decomposing an object’s image into parts through synchronizing the outputs of local features detectors (lines, vertices, etc.) responding to the same part, and de-synchronizing detectors responding to separate parts (following Hummel & Biederman, 1992). The synchrony serves to dynamically bind features into parts and parts to their relations. However, synchrony--and more importantly, asynchrony--takes time to establish, and so initially, all of an object’s features will fire together, whether they belong to the same part or not. The model’s hybrid representation of shape permits it to recognize objects even in the event of such binding errors. Although the explicit representation of parts and their relations (by a collection of units called the Independent Geon Array, or IGA) is hindered by improper synchronization of parts, the holistic representation (a collection of units called the Substructure Matrix, or SSM) does not depend on synchrony for binding, and consequently it can drive recognition even in the event of binding errors. However, the SSM is more view-sensitive than the IGA, and so it can only support recognition in familiar views. The model as a whole can recognize objects in familiar views without attention and very quickly (i.e., before parts-based asynchrony is established) via SSM. Decomposing an object into its parts and explicitly representing the relations among those parts takes more time because dynamic binding requires visual attention. Hummel and his colleagues (Hummel & Stankiewicz, 1996; Stankiewicz, Hummel & Cooper, 1998) have collected some evidence to support the distinction between two types of representation. Stankiewicz et al., for example, showed that SSM may be used to prime the naming of a previously ignored object presented in the same view as earlier, whereas IGA may be used to facilitate naming a previously attended objects when it is presented in a different (i.e., reflected) view. It should be noted that the SSM is not a strictly retinotopic image-like representation in that it can support rapid recognition of objects that are translated in their spatial location or transformed in scale (Hummel & Stankiewicz, 1996; Stankiewicz et al., 1998).

The results reported by Shyi and Cheng (1996) are also consistent with Hummel and Stankiewicz’s proposal inasmuch as IGA is concerned. Using a modified illusory conjunction paradigm (Prinzmetal, Presti, & Posner, 1986), participants in their study were asked to identify whether a target geon or geon assembly was present in a subsequent two-item display. When a geon was the target, the subsequent display possessed one of the following three configurations: (a) in the illusory-yes condition, the display included geons that conjoined the attributes of the target geon arrangements (e.g., the edge of cross section, the size of cross section, the shape of the elongated axis, and the symmetry of cross section, see Biederman, 1987) in new arrangements, (b) in the illusory-no condition, exchanging the attributes of displayed geons could not erroneously form the target geon, and (c) in the identical condition, one of the displayed geons was identical to the target geon. When a geon assembly was the target, the target as well as the subsequent displayed items comprised a vertically oriented geon connected to a horizontally oriented geon in an above-below spatial relation. The display presented following the target assembly consisted of two geon assemblies with one of the three configurations analogous to those used when a single geon was used as target. Illusory conjunctions were estimated by the difference in errors committed in the illusory-yes and illusory-no conditions. Reliable conjunction errors were found for both the single geon condition and the geon-assembly condition.

In essence, participants in Shyi and Cheng’s study were asked to discriminate between objects, either as single geons or as geon assemblies, that shared a significant number of attributes, parts or spatial relations. It is a situation where accurate discrimination among objects requires full identification of each object, and full identification most likely depends upon a complete structural description-the description of components of an object and their relations--that geon theory has proposed to be the basis for object recognition (Biederman, 1987; Hummel & Biederman, 1992). The short exposure time that Shyi and Cheng used for viewing the displayed items severely compromised the computation of the complete structural description (i.e., IGA), and as a result led to impoverished extraction of attributes of parts, poorly articulated specification of spatial relations among parts, or both. In terms of Hummel and Stankiewicz’s model, the errors they obtained, including those reflecting conjunction errors, were a consequence of insufficient or incomplete bindings due to limited time.

The goal of the present study was to further shed light on the kind of representations that may be computed and used in the course of object recognition. In particular, we attempted to empirically identify conditions that would implicate the use of different kinds of representation, when participants were asked to make same-different judgments between pairs of multipart 3-D objects. Our hypothesis is that 3-D objects can be matched on the basis of one of the three levels of representation. The first level involves the output from analyzing the 2-D images. This level of output does not necessarily mean a literal copy or template of the image; rather, it can be a view-specific representation (Lawson & Humphreys, 1996), or a kind of representation akin to Hummel and Stankiewicz’s (1996) SSM, or to Tarr’s notion of multiple-view representations (Tarr, 1995; Tarr & Pinker, 1989, 1990; see also Tarr & Bülthoff, 1998). The second is a level where representations of the parts or components comprising a multi-part object are computed (Biederman, 1987; Biederman & Cooper, 1991; Hummel & Biederman, 1992; Marr & Nishihara, 1978). Note that this level of processing only gives rise to the constituent components or parts of an object. Its outputs do not include the spatial relations that each part or component has relative to other parts or components. The third level of processing has representations of full structural description as its output. Not only the parts or components are identified but also the spatial relations that each part has in relation to other parts. The three levels of outputs do not necessarily imply a fixed sequence of processing. Rather, the three levels might be part of a cascaded neural network (Hummel & Biederman, 1992; Hummel & Stankiewicz, 1996; Rumelhart & McClelland, 1982). Nonetheless, this hypothesis does imply an order of relatively difficulty and efficiency with which each level of representations can be computed, with the 2-D image level being the easiest and quickest, the full structural description level being the most difficult and time-consuming, and the part-level lying in between. The experiments reported here were designed to identify conditions where matching two multipart 3-D objects may rely upon representations computed at these three respective levels.

Our approach to this question borrows from a model, called SIAM (for Similarity, Interactive Activation, and Mapping), recently developed by Goldstone and his colleagues (Goldstone, 1994; Goldstone & Medin, 1994; Medin, Goldstone, & Gentner, 1993) to account for the process of similarity comparison. There are a number of important features of SIAM which suggest a natural link between the model and that proposed by Hummel and Stankiewicz (1996) (i.e., JIM2). According to SIAM, for example, when structured scenes are compared, the parts of the scene must be aligned, or placed in correspondence, with parts from the other scene. Correspondences can include both the proper bindings between features (e.g., the shape and color within a part) and/or a part with the spatial role it has relative to other parts. Another important feature of the SIAM model is its emphasis on the time course of comparison, as an attempt to provide a unified account for the disparate results obtained from a variety of similarity tasks used in the past. Specifically, SIAM predicts qualitative shifts in the comparison process in that local matches between features strongly influence similarity early on; with time, however, global consistency among correspondences becomes more important (Goldstone, 1994; Goldstone & Medin, 1994). Consistent with this prediction, Goldstone and Medin (1994) find that when participants are given a short deadline to make same/different judgments, locally determined, improperly bound matching features exert a larger influence on judgments than they do when participants are given a longer period of time to respond.

The emphasis on time course in SIAM is consistent with the time required for dynamic binding, and thus structural descriptions, to be established. For the purpose of object recognition, the JIM2 model would suggest that image-like, view-specific representations are computed and used either when there is limited time or when the object to be recognized was shown in a highly familiar point of view. In contrast, representations of complete structural description are computed and used when object recognition requires full identification of an object’s parts and the spatial relations among them. The computation of a full structural description would need more time than is required for the construction of a viewpoint-specific image. For the purpose of similarity comparisons, matches between local features would play a greater role when there is limited time available (as is often the case with same-different judgment tasks). Global consistency among corresponding features would become increasingly influential as the time available increases (as is often the case in similarity rating tasks).

In the five experiments reported here, we examined participants’ performance in matching multipart 3-D objects where a target object may or may not match a subsequently or simultaneously displayed sample object. The judgments were made when the relative spatial locations between the object parts were either relevant or irrelevant (Goldstone, 1994; Goldstone & Medin, 1994; Proctor & Healy, 1985), and when target and displayed objects were presented in either the same or different orientations in the pictorial-frontal plane.

Experiment 1

Proctor and Healy (1985) asked participants to make same-different judgment between two 3-letter strings in accordance with one of the two rules. For the order-relevant rule, the strings would be classified same when the letters in each string were identical and occupied the same locations. For the order-irrelevant rule, the strings would be classified same as long as the letters in each string were identical, regardless of whether or not they occupied the same locations. Proctor and Healy found that it was much easier for participants to make judgments based on the order-relevant than the order-irrelevant rule, both in terms of reaction time and error rate. The difference was most evident in cases where an identical set of letters was rearranged in the two strings (e.g., ABC vs. CBA). While they could easily respond different according to the order-relevant rule, participants could not as easily respond same according to the order-irrelevant rule. Goldstone and Medin (1994) reported a similar finding. Participants in their study were asked to make same-different judgments, according to either a role-relevant rule or a role-irrelevant rule, to a set of colored squares forming a cross shape. These rules were analogous to those used by Proctor and Healy (1985); for instance, according to the role-relevant rule, for the colored patterns to be judged same, the same-colored squares had to occupy the same locations within the cross shape. Like Proctor and Healy, Goldstone and Medin also found that on average, the error rate and response latency for "different" judgments was substantially lower for the role-relevant than for the role-irrelevant group.

Such findings are hard to reconcile with the predominant view that visual processing begins by breaking down objects into features, and originally features are processed in an unbound manner (Treisman, 1988, 1993; Wolfe, 1998). One possible reason that performance for role-relevant (RR) is better than for role-irrelevant (RI) is that participants may be able to respond as though sensitive to binding because they are using an image-like representation. This has limitations, however. If the objects were misaligned with respect to orientation, for instance, then the image-like representations would not be as dependable as when the objects were aligned in orientation. The main goal of Experiment 1 therefore was to identify conditions under which role-relevant judgments are easy to make and those where role-irrelevant judgements are easy to make. Another goal of Experiment 1 was to see whether the time course of matching 3-D objects resembles those obtained with 2-D patterns in that featural information plays a larger role during the initial stage of the matching process, whereas relational information becomes more important as the correspondences between objects become more developed over time (Goldstone, 1994; Goldstone & Medin, 1994).

Method

Participants. One hundred and nineteen undergraduate students from Indiana University served as participants in order to fulfill a course requirement. Fifty-eight and 61 participants were in the role-irrelevant and role-relevant groups, respectively.

Materials. Objects were designed using three-dimensional rendering software (Infini-D R2.6, Spectular International) on a Power Macintosh computer. Each object consisted of three components -- a large cut-off cone and two smaller geometric shapes attached to the upper left and lower right portions of the cone. Sample stimuli are shown in Figure 1. The smaller components were separated by approximately 140 degrees in depth. The smaller shapes were chosen from the following set: cone, rectangular prism, wedge, and cylinder. These shapes were created so as to vary on two orthogonal geon aspects -- curvature and taper.

----------------------------------

Insert Figure 1 about here

----------------------------------

A set of 12 objects was created by combining every possible pair of small shapes under the constraint that no shape was paired with itself (see Figure 1). Thus, letting the letters A, B, C, and D represent the four small shapes and letting the order of the letter indicate whether it is on the left or right side of the object, the following objects were constructed: AB, AC, AD, BA, BC, BD, CA, CB, CD, DA, DB, and DC. This set of 12 objects was replicated at three different orientations: 0, 120 , and 240 degrees. All rotations were performed in the plane of the screen, as shown in the lower panel of Figure 2.

--------------------------------

Insert Figure 2 about here

--------------------------------

Objects appeared on a black background and the screen itself was black. Each object was 6.35 cm wide by 4.32 cm tall before rotation, subtending a visual angle of about 7.85^o

X 5.36^o at a viewing distance of approximately 46 cm.

Procedure. On each trial, participants were shown two objects and were instructed to respond as to whether the two objects were the same or different. The role-irrelevant and role-relevant groups were given different instructions for deciding whether two objects were the same. For the role-relevant group, for two objects to be the same, they had to possess the same shapes, and these shapes had to be in the same positions within the two objects. Thus, the two objects under the 2 MIPs column in Figure 2 are the same for the role-relevant group, but the objects in the 2 MOPs column are different because they have the same shapes but in different positions. If two objects were identical but differently rotated, participants were instructed to respond "same." For the role-irrelevant group, two objects were the same as long as they had all the same shapes, regardless of the positions of these shapes. Thus, participants in this group should respond "same" when presented with the objects related by 2 MOPs, but the objects related by 1 MIP would be different because they do not have all of their shapes in common.

There were four different types of relation between the two compared objects, each of which is represented in Figure 2. These four trial types are: 2 MIPs (Matches In Place), 2 MOPs (Matches Out of Place), 1 MIP, and 1 MOP. On 2 MIPs trials, the two objects were identical, although one may have been rotated with respect to the other. On 2 MOPs trials, the objects had the same shapes, but the positions of the two small shapes were swapped. On 1 MIP trials, the objects shared one shape (in addition to the large common base) and this shape occupied the same position within the objects prior to any rotations. On 1 MOP trials, the objects shared one shape but this shape occupied different positions in the two objects. The correct responses for the role-irrelevant and -relevant groups were identical for all trial types except for the 2 MOPs trials, which were properly called "same" for the role-irrelevant group and "different" for the role-relevant group. When instantiating each of the four trial types, the standard object was chosen randomly from the set of 12 objects, and then was altered in one of the four ways. When the object could be transformed in more than one way to implement a particular trial type, the transformed object was selected at random.

On each trial, a standard object was presented at a random location on the left half of a Macintosh 2CI computer screen for 1.7 seconds. The object was then removed, and after a 16.67 msec interstimulus interval a transformed version of the standard objects was displayed at a random location on the right side of the screen for one of three deadlines: 300, 600, or 900 msec. Participants responded "same" by pressing the "s" key and "different" by pressing the "d" key. Participants were required to make their "same" or "different" response within 500 ms of the offset of the second object. Thus, the offset of the second object can be viewed as a cue for participants to give a response, although it was also acceptable for participants to make their response prior to the offset of the second object. Participants received instantaneous feedback as to whether their response was correct and were shown the correct response. If participants made a response after 500 ms of the offset of the second object, the word "Overtime" was displayed.

Participants saw 288 trials in all, consisting of four blocks of 72 trials. Each block consisted of a set of 12 trials repeated six times by factorially combining two levels of orientation (same or different) and three deadline levels. Within the set of 12 trials, there were several repetitions of each of the four trial types. In order to give role-irrelevant and role-relevant groups an equal number of same and different trials, the number of 2 MIPs, 2 MOPs, 1 MIP, and 1 MOP trials were 6, 2, 2, and 2 for the role-relevant group and 3, 3, 3, and 3 for the role irrelevant group. In addition to producing 50% "same" trials for each group, this method of assigning frequencies also makes each of the trial types that produce the same response for a group equally likely. The ordering of all trials was randomized.

The rotation of the standard object was randomly selected from 0, 120, and 240 degrees. On "same orientation" trials, the identical amount of rotation was assigned to the transformed object. On "different orientation" trials, one of the two rotations not given to the standard object was randomly selected. The correct response to a comparison was determined solely by the trial type and not the orientation difference between the compared objects.

The interval between trials was 1800 msec. Participants were given detailed instructions on the criteria for making "same" and "different" responses, including several pictorial examples. The experiment required about 40 minutes to complete. Participants received rest breaks after every block of trials, during which they were told their average response time and percentage of correct responses.

Design. The dependent variable of principle interest is the percentage of correct responses (not including the overtime responses). The main independent variables were deadline (3 levels), trial type (2 MIPs, 2 MOPs, 1 MIP, or 1 MOP), orientation (same or different), and instructions (role-irrelevant or -relevant). All but the last variables were manipulated in a within-participants manner, giving rise to a 3 X 4 X 2 X 2 factorial design.

Results and Discussion

Overall analyses. Only correct responses committed prior to the expiration of deadline were entered for data analyses. Overtime responses were excluded from analyses, which accounted for an average data loss of 7.85% (7.63% from the RI group and 8.05% from the RR group). The mean rate of accuracy for each condition was first submitted to a 2 (group: RR vs. RI) X 4 (block) X 4 (types) X 2 (orientation: same or different) X 3 (deadline: 300, 600, 900) mixed analysis of variance (ANOVA). The main effect of group was reliable, F(1, 117) = 8.66, MSE =.15, p < .0001. As shown in Figure 3, the overall accuracy was higher for the role-irrelevant (RI) group (M = .70) than for the role-relevant (RR) group (M = .61). This finding fails to replicate the previous findings by Proctor and Healy (1984) and by Goldstone and Medin (1994) and indicates that the general trend for RR judgments to be easier to make than the RI judgments did not generalize to connected 3-D objects.

--------------------------------

Insert Figure 3 about here

--------------------------------

The main effect of orientation and its interaction with group were both reliable, F(1, 117) = 19.60, MSE = .08, p < .0001, and F(1, 56) = 17.10, MSE = .08, p = .0001, respectively. The main effect of block was also reliable, F(3, 351) = 62.42, MSE = .19, p < .0001. The main effect of type and its interaction with group were both reliable, F(3, 351) = 43.84, MSE = .22, p < .0001, and F(3, 351) = 26.51, MSE = .22, p < .0001, respectively. The two-way interaction of orientation and type as well as the three-way interaction of orientation, type, and group were reliable, F(3, 351) = 12.98, MSE = .09, p < .0001, and F(3, 351) = 5.93, MSE = .09, p = .0006, respectively. The interaction of block and type also was found reliable, F(9, 1093) = 3.33, MSE = .09, p = .005. Finally the main effect of deadline, F(2, 234) = 77.21, MSE = .11, p < .0001, and a number of higher-order interaction involving deadline were reliable, including the 3-way interaction of group, type, and deadline, F(6, 702) = 2.28, MSE = .07, p = .035, the 3-way interaction of type, orientation, and deadline, F(6, 702) = 2.67, MSE = .07, p = .014, and the 4-way interaction of block, group, type, and deadline, F(18, 2106) = 1.85, MSE = .07, p = .016. None of the other interaction effects were found reliable, F’s < 1 or p’s > .17. Given quite a few number of reliable results involve an interaction with the group factor and given the purpose of the experiment, further analyses were performed separately for the RR and RI group in order to reveal more clearly the nature of the foregoing findings.

The mean accuracy for each condition was therefore submitted to a 4 (block) X 4 (types) X 2 (orientation: same or different) X 3 (deadline: 300, 600, 900) repeated-measure analysis of variance (ANOVA) respectively for participants in the role-relevant (RR) condition and those in the role-irrelevant condition (RI). The results of analysis are reported for each group separately below.

Role-irrelevant group. The main effect of block was reliable, F(3, 171) = 51.61, MSE = .13, p < .0001, reflecting a general trend of improvement as participants became more experienced in the task (the mean accuracy was .60, .71, .73, .76 for blocks 1 to 4, respectively). The main effect of type was also reliable, F(3, 171) = 24.64, MSE = .13, p < .0001. As shown in the left panel of Figure 3, participants performed best when the to-be-matched stimuli were identical to the standard (i.e., the 2 MIPs trials) (M = .77), followed by stimuli with 2 matched features that were out of place (2 MOPs) (M = .71), and worst with those either of 1 MIP (M = .66) or of 1 MOP (M = .67) (see the left panel of Figure 3). The contrasts between the 2 MIPs trials and the other three were all reliable, F(1, 57) = 32.12, p < .0001, F(1, 57) = 49.77, p < .0001, and F(1, 57) = 47.40, p < .0001, for 2 MOPs, 1 MIP and 1 MOP, respectively. Likewise, the contrasts between the 2 MIPs and 1 MIP and 1 MOP were reliable, F(1, 57) = 8.59, p = .0049, and F(1, 57) = 5.10, p = .028, respectively. However, the difference between 1 MIP and 1 MOP was not reliable, F(1, 57) = 1.87, p = .18. The interaction between block and type was only marginally reliable, F(9, 513) = 1.73, p = .08.

The main effect of orientation was not reliable, F < 1; its interaction with type was, however, F(3, 173) = 8.93, MSE = .073, p < .0001. As shown in the left panel of Figure 4, when to-be-matched stimuli assumed the same orientation as the standard, the mean accuracy was .79, .69, .64, and .69, respectively, for 2 MIPs, 2 MOPs, 1 MIP, and 1 MOP. The difference among the four types were reliable, F(3, 171) = 26.68, MSE = .008, p < .0001. Further analyses reveal that all pairwise comparisons were reliable except that between 2 MOPs and 1 MOP: F(1, 57) = 53.81, p < .0001 for 2 MIPs vs. 2 MOPs, F(1,57) = 56.55, p < .0001 for 2 MIPs vs. 1 MIP, F(1, 57) = 40.58, p < .0001 for 2 MIPs vs. 1 MOP, F(1, 57) = 4.77, p = .033 for 2 MOPs vs. 1 MIP, F < 1 for 2 MOP vs. 1 MOP, and finally, F(1, 57) = 11.80, p = .0011 for 1 MIP vs. 1 MOP.

A different picture emerges when to-be-matched stimuli assumed different orientations than the standard. The mean accuracy was .74, .73, .68, and .66 for 2 MIPs, 2 MOPs, 1 MIP, and 1 MOP, respectively. The difference among the four types was again significant, F(3, 171) = 11.48, MSE = .008, p < .0001. However, further analyses reveal that the difference between 2 MIPs and 2 MOPs was not reliable, F(1, 57) = 1.08, p = .30, neither was the difference between 1 MIP and 1 MOP, F(1, 57) = 1.91, p = .17. The differences appear to lie in the contrast between whether the number of matches was one or two. That is, all the contrasts between one and two matches were reliable: F(1, 57) = 13.69, p = .0005 for 2 MIPs vs. 1 MIP, F(1, 57) = 27.2, p < .0001 for 2 MIPs vs. 1 MOP, F(1, 57) = 6.76, p = .012 for 2 MOPs vs. 1 MIP, and finally, F(1, 57) = 14.14, p = .0004 for 2 MOPs vs. 1 MOP.

---------------------------------

Insert Figure 4 about here

---------------------------------

Looking at the data in a different vein, we noted that for both 2 MIPs and 1 MOP, participants performed better when the objects were presented in the same orientations than in different orientations (see Figure 4). An opposite pattern was found -- namely the same orientations were worse than the different orientations -- for 2 MOPs and 1 MIP stimuli.

Finally the main effect of deadline, as expected, was highly reliable, F(2, 114) = 50.03, MSE = .003, p < .0001, reflecting the fact that as participants had more time to make judgments, they made them more correctly (M’s = .65, .71, and .75 for deadline of 300, 600, and 900 ms, respectively). All pairwise comparisons between the performances under the three deadlines were highly reliable.

No other main effects or interactions were found reliable, F’s < 1.67 or p’s > .13.

Role-relevant group. The pattern of results was more complicated for participants in the role-relevant condition. As for the RI group, the main effect of block was reliable, F(3, 180) = 20.85, MSE = .24, p < .0001, again reflecting the general trend of performance improvement across the blocks. The mean accuracy was .53, .63, .65, and .65 from blocks 1 to 4. The main effect of type was highly reliable, F(3, 180) = 40.86, MSE = .31, p < .0001, as was its interaction with block, F(9, 540) = 2.62, MSE = .10, p = .0057. As shown in the right panel of Figure 3, participants' performance was best when the to-be-matched stimuli were identical to the standards (2 MIPs) (M = .70), followed by those that had 1 match out of place (1 MOP) (M = .69), and those that had 1 match in place (1 MIP) (M = .59), and was worst with those that had two matches but out of place (2 MOPs) (M = .49). The differences between 2 MIPs and 2 MOPs and between 2 MIPs and 1 MIP were reliable, F(1, 60) = 60.54, p < .0001, and F(1, 60) = 35.74, p < .0001, respectively. However, the difference between 2 MIPs and 1 MOP was only marginally reliable, F(1, 60) = 3.59, p = .06. The accuracy for 1 MIP was reliably greater than that for 2 MOPs, F(1, 60) = 17.22, p = .0001; likewise, the accuracy for 1 MOP was greater than that for 2 MOPs, F(1, 60) = 64.58, p = .015. Finally the accuracy for 1 MOP was higher than that for 1 MIP, F(1, 60) = 28.09, p < .0001.

The interaction between block and type, as depicted in Figure 5, reflects the general trend that participants’ performance quickly reached asymptote during the second block of trials, and this trend can be found in all four types of stimuli. The interaction was caused mainly by the absolute differences in performance for each type of stimuli.

---------------------------------

Insert Figure 5 about here

---------------------------------

The main effect of orientation was highly reliable, F(1, 60) = 28.15, MSE = .11, p < .0001, indicating that participants in general performed better with matched stimuli having the same orientation as the standard (M = .64) than with those having a different orientation (M = .59). However, the interaction of orientation and type, shown in Figure 4, reveals that this was the case for stimuli of 2 MIPs (M’s = .75 and .66 for the same and different orientation, respectively), 2 MOPs (M’s = .53 and .45 for same and different orientation, respectively), and 1 MOP (M’s = .69 and .66 for same and different orientation, respectively), but it was not the case for those of 1MIP (M’s = .58 and .61 for same and different orientation, respectively).

As for the participants in the role-irrelevant group, the main effect of deadline was highly reliable, F(2, 120) = 29.02, MSE = .004, p < .0001, reflecting the trend that as the deadline increases, participants' accuracy in performance increases. The mean accuracy for the 3 deadlines was .58, .61, and .66, respectively, from shortest to longest; pairwise comparisons among them were all significant: F(1, 60) = 10.97, p = .0016 for deadline of 300 and 600 ms., F(1, 60) = 47.24, p < .0001 for deadline of 300 and 900 ms, and F(1, 60) = 24.54, p < .0001 for deadline of 600 and 900 ms. The two-way interaction of deadline and type, as well as the three-way interaction of deadline, type, and block were found reliable, F(6, 360) = 2.35, MSE = .078, p = .031, and F(18, 1080) = 2.04, MSE = .085, p = .0062. As can be seen in the four panels of Figure 6, although participants' performance increased as a function of deadline, the pattern of increment varied depending on the type of stimuli as well as how far into the experiment. In particular, participants seemed to experience a great deal of difficulty when they tried to make judgment with stimuli that shared two features with the standard but the shared features were not in the same spatial locations as their counterparts in the standard (i.e., 2 MOPs). With that kind of stimuli participants were performing essentially at the chance level, and the chance-level performance did not change much even when participants had more time to respond (see, in particular, blocks 1, 3, and 4), which has contributed to the higher-order interactions.

---------------------------------

Insert Figure 6 about here

---------------------------------

Discussion

In Experiment 1, we created a number of conditions that may require computing different kinds of representations to support participants’ performances. In particular, we were trying to see whether judgments that could be based on images are easier to make than those that would require representations of parts, which in turn are easier than those that require full structural descriptions. In what follows, we discuss in turn the main findings from the RI and RR group to evaluate the foregoing hypothesis. Most importantly, our goal is to provide an account to reconcile our finding with those in the previous studies which have primarily used 2-D patterns as stimuli (Goldstone, 1994; Goldstone & Medin, 1994; Proctor & Healy, 1985).

Role-irrelevant group. Participants in the RI group in general performed better than those in the RR group. Moreover, the relative rankings in performance among the four types of stimulus objects were quite diverse between the two groups. For the RI group, participants were able to maintain a relatively high level of accuracy, with the number of matches a dividing line for determining participants’ performance--the more matches there were, the better the performance (see Figure 3), regardless of whether or not the matches were in place. It is, however, interesting to note that there was a reliable interaction between type of objects and orientation for the RI group. An inspection of the left panel of Figure 4 reveals that it was easier for participants to respond "same" to 2 MIPs objects presented in the same orientation as the standard than to those presented in a different orientation. This finding is easy to understand in that 2 MIPs objects are actually identical to the standard. When presented in the same orientations, the standard and the corresponding 2 MIPs object can be compared and matched on the basis of 2-D images. That is, participants can adequately compare the stimuli without (the output of) processing that goes beyond encoding the 2-D images. The reason that participants’ performance in that condition fell below a near-perfect ceiling may have to do with the fact that the to-be-compared object was presented subsequent to the offset of the standard, and the nonoptimal performance may reflect a limitation due to memory loss (and location uncertainty). In contrast, participants may not be able to make their judgments solely based on 2-D images when objects were presented in different orientations. Rather, the processing may have to proceed to the level where the parts comprising the objects are parsed and identified. Since a "same" response only requires the compared objects to have the same parts, the RI participants can make their judgments at the moment the objects were parsed into constituent components. The additional processing other than encoding the 2-D images explains why 2 MIPs presented in different orientations were more difficult to judge.

The pattern for judging the 2 MOPs objects was reversed. It was easier for the RI participants to respond "same" to 2 MOPs objects when they were presented in an orientation different from that of the standard than when they were presented in the same orientation. The inferior performance with the same orientation trials may be explained in terms of a competition between the image-level processing and the part-level processing. At the image-level, presenting a 2 MOP object in the same orientation as the standard may signal or prime a "different" response, which is not accurate for the 2 MOPs objects. The RI participants may have to withhold that answer until the result of part-level processing is available, which presumably would strongly signal a "same" response. On the other hand, when 2 MOPs were presented in a different orientation, participants may have to forego the output of the image-level processing with the realization that such outputs are not dependable for objects presented in apparently different orientations. They have to wait for the result of more advanced or complete processing, namely the part-level processing, before they can respond. Note that for RI participants, there really were no conditions where their responses would have to rely upon full structural descriptions of the presented objects. At most, the RI participants would need outputs from the part-level processing to be able to identify the components of each object. It should also be noted that part-level processing does not entail binding attributes other than those within the boundary of an individual part or geon (i.e., the edge and size of cross-section, symmetry of cross section, and the main axis, see Biederman, 1987, 1995). For the RI participants the actual role that each part plays in relation to other parts within the same object is, by definition, irrelevant. The fact that 2 MIPs and 2 MOPs presented in different orientations resulted in about the same level of performance suggests that participants’ judgments in those conditions were based on the same level of outputs.

There was a slight disadvantage judging the 1 MIP objects than judging the 1 MOP objects when both were presented in the same orientation as the standard. The disadvantage disappeared, however, when they were shown in orientations different from that of the standard. The slight disadvantage may be accounted for assuming, again, that participants relied upon 2-D images for their judgments. With a 2-D image, objects that were 1 MIP to each other are more similar than those that were 1 MOP to each other, therefore it was harder to respond "different" to 1 MIP displays than to 1 MOP displays. When judging either the 1 MIP or the 1 MOP objects presented in a different orientation, participants would refrain from responding solely on the basis of 2-D image and had to wait till the output of the part-level processing. Unlike the part-level output for either the 2 MIPs or 2 MOP objects, however, the outputs for both 1 MIP and 1 MOP objects entailed a shared attached part and a nonshared attached part. Such outputs presumably would have reduced the overall similarity between compared objects, and yet were not strong enough to signal a clear "different" response, leading to a lower level of performance. It is interesting to note that RI participants' performances for 1 MIP and 1 MOP objects were at almost identical levels. Taken together with the fact that their performances for 2 MIPs and 2 MOPs objects were also at about the same level, these findings suggest that the judgments were made based on the representations computed at the part-level. The fact that performance levels for the 2 MIPs and 2 MOPs objects were superior to those of the 1 MIP and 1 MOP objects also is consistent with the proposal that those judgments were made based on the part-level outputs. In summary, participants’ performance according to a RI rule appears largely accounted for by appealing to representations computed either at the 2-D image level or at the part-level. Whether or not the to-be-matched objects were displayed in the same orientation as the standard serves as a cue for deciding which level of representation should be used for making a judgment.

Role-relevant group. For RR participants, it was as easy for them to respond "same" to 2 MIPs as to respond "different" to 1 MOP, and they evidently experienced great difficulty in responding "different" to both 2 MOPs and 1 MIP objects (see Figure 3). Furthermore, as can be seen from the right panel of Figure 4, in judging the 2 MIPs objects presented in either the same or different orientation than the standard, the RR participants exhibited the same pattern of performance as the RI participants, albeit at a somewhat lower absolute level of performance. It is tempting to think that the same explanation can be applied here, namely, that 2 MIPs were judged primarily based on 2-D images. Although this may actually be the case for 2 MIPs presented in the same orientation, it is not the primary account for those presented in the different orientations. Note that for RR participants, the objects not only had to have the same parts but also the parts had to play the same spatial roles in order to receive a "same" response. Therefore, the lower performance associated with 2 MIPs objects shown in different orientations reflects a combination of processing that includes both the parsing of the object into its constituent parts and apprehending the spatial roles that each part played in relation to the base cone. This latter processing may also have existed for judging 2 MIPs presented in the same orientation, which explains why RR participants exhibited slightly worse performance than their RI counterparts. One implication from the foregoing argument is that the spatial binding relationship (or the part-location conjunction) between parts of an object can be computed with relatively high efficiency (Saiki & Hummel, 1996, 1998). As discussed shortly, however, whether or not such efficient binding would bring benefits or costs to participants’ actual performance may depend upon whether the parts shared by the compared objects also share the same spatial roles in each host object.

The RR participants exhibited the same pattern, and at about the same level, of performance in judging 1 MOP objects as in judging 2 MIPs objects, suggesting that it was relatively easy for them to say "different" to 1 MOP objects. Consider the case when the standard and a 1 MOP object were shown in the same orientations. In this case, participants could mostly rely upon representations computed at the 2-D image level to make their judgments, because the 2-D images between the standard and a 1 MOP objects would look rather dissimilar to each other. On the other hand, when the standard and the 1 MOP objects were shown in different orientations, participants may have to withhold their response until the outputs at the part-level are available, as we suggested earlier. That is, the responses were postponed until the objects were parsed into constituent parts.

Even when the to-be-matched object was presented in the same orientation as the standard, the RR participants were uncertain whether or not they should respond "different" to 1 MIP and 2 MOPs objects. It is interesting to note that while RR participants performed at about the same level as RI participants in judging 1 MOP displays presented in the same orientation, RR participants were performed at a lower level than their RI counterparts judging 1 MIP displays. The RR participants in general were more concerned with binding between a part and its spatial role. The correct binding partly exhibited in the 1 MIP displays may have increased the overall similarity more so than the 1 MOP displays, and therefore it was harder for participants to respond "different" to 1 MIP.

The most interesting result probably is the fact that RR participants had a very hard time judging 2 MOPs objects; in particular, they were severely disturbed by 2 MOPs presented in an orientation different than the standard. What could have caused the essentially chance-level performance for the RR participants? It is puzzling to note that the 2 MOPs objects, presented in the same orientation as the standard, also exhibited the lowest level of performance among the four types. This result suggests that participants' responses were not merely based on the 2-D images; otherwise, given the dissimilarity in 2-D images, participants should have performed relatively well with 2 MOPs objects. Note that for RR participants, there were a number of occasions where computation of full structural descriptions are required for correct responses, such as when the 2 MIPs and 2 MOPs objects were rotated out of alignment with the standard. Participants may have no choice but exercise extra caution in making their judgments, especially when the outputs at the part-level signaling that the to-be-matched object shared the same parts with the standard. The difficulty in judging the 2 MOPs can then be accounted for by the additional requirement of computing full structural descriptions, compounded by confusions resulted from memory loss. Given the RR rule, the 2 MOPs objects, in a sense, represents an illusory part-location conjunction of the standard (Saiki & Hummel, 1996). There has been ample evidence suggesting such a condition would require the aid of attention for proper resolution (or conversely, the lack of attention would exacerbate erroneous conjunctions) (Treisman, 1988, 1993; Treisman & Gelade, 1980; Treisman & Schmidt, 1982; Prinzmetal, Presti, & Posner, 1986; Shyi & Cheng, 1996). The fact that participants' performance with 2 MOPs objects presented in a different orientation than the standard was even worse than that presented in the same orientation suggests that the difficulty in judging 2 MOPs objects was further aggravated by the need to undo the rotation.

In a recent study, Saiki and Hummel (1996) demonstrated that the conjunction of a part and its location in relation to other parts within the same object plays a more dominant role than the conjunction of features within a part (e.g., its shape and color) in learning object categories. Categories of objects distinguishable in terms of part-location conjunction were learned better and faster than those that were distinguishable in terms of, say, part-color conjunction. Saiki and Hummel attributed the superior categorization based on part-location conjunction to participants' sensitivity to such configural information, although it was not clear from where such sensitivity arose. It also was not clear whether or not such sensitivity would aid category learning in a cost-free manner, without consuming much, if any, processing resources, or whether the sensitivity came with a price on processing resources. Our data, and those by Shyi and Cheng (1996), seem to suggest that the binding of a part and its relative location for connected 3-D objects at least, can be computed with high efficiency. Such efficiency acts like a knife with double blades however, in that if the part and its location conjunctions of the to-be-matched objects were identical to those resided in the standard object, the part-location conjunction would speed up the matching process and aid the accuracy of performance. Conversely, if the part-location conjunctions were a mismatch compared to those resided in the standard object, then additional effort (attention) will be required to alleviate the impact of such binding.

In summary, the results of the RR group suggest that for those participants, judgments were often made relying upon representations computed at levels higher than 2-D images, namely the part level or the level of structural descriptions. The full structural descriptions are difficult to accurately derive with limited amount of time or without the aid of attention.

Comparing role-irrelevant and role-relevant groups. Given the complexity of our results, it would be helpful to highlight a number of findings in terms of comparing participants' performances between the role-irrelevant and the role-relevant groups, and point out how those results could be accounted for by the trichotomy of representations proposed here. First, for both RR and RI groups, participants were most accurate judging 2 MIPs of same orientations, strongly suggesting that they had relied 2-D image representation for judging those displays. Second, the difference between 1 MOP and 1 MIP displays was larger for RI than for RR, in both the same and different orientations. For RR participants, the spatial role of each part had to be considered (i.e., part-level representations) when objects to be compared were shown in different orientations. For RI participants, in contrast, 2-D images can generally be used, leading to accurate performance. Third, while there was only a general influence of deadline on overall performance of RI participants, its influence was more complex for RR participants. In particular, RR participants were not able to fully utilize the available time to constructed structural descriptions needed for judging 2 MOPs displayed in different orientations. In summary, the result for RR and RI same-orientation gives evidence for image-level representations, that for RI different orientation gives evidence to part-level representations and finally result for RR different-orientation gives evidence to full structural descriptions.

Experiment 2

In Experiment 1, we found that same-different judgments between 3-D multi-part objects were more accurate when the spatial role that each part played was irrelevant than when it was relevant. This finding is opposite to that obtained in previous findings where 2-D patterns were used (Goldstone, 1994; Goldstone & Medin, 1994; Proctor & Healy, 1985). In addition to the difference in stimulus materials used, there are a number of other differences that may have contributed to the inconsistency. In Proctor and Healy’s study, for instance, the letter strings receiving "same" responses in the order-relevant condition each had three identical letters occupying the identical locations within each string. As such, they can be coded as 3 MIPs objects. The letter strings in the order-relevant condition receiving "different" responses were mostly (83%) a rearrangement of the letter string receiving "same" responses, and the rearrangement involved swapping the positions of two letters (e.g., ABC vs. BAC) while the third remained unchanged. In this way, these letter strings can be coded as 2 MOPs/1 MIP. A small portion of letter strings (17%) receiving a "different" response involved a replacement of only one of the three letters in each string while the two other letters remained unchanged. As such, these letter strings can be coded as 2 MIPs. The letter strings were all presented in the same orientation. Half of the letter strings received "same" responses in the role-irrelevant condition; and most of them (83%) were rearranged pairs as in the role-relevant condition, and thus can be coded as 2 MOPs/1 MIP, and only a small portion were 3 MIPs. The other half were replacement string pairs (i.e., 2 MIPs) that would be classified "different." As in the role-relevant condition, all letter strings were presented in the same orientation.

It may not be surprising that the performance of Proctor and Healy’s participants in the RR condition should be better than those in the RI condition, according to the hypothesis we proposed for the present study. There was a much higher percentage of "same" responses in the RR condition that could be accurately based on 2-D images than in the RI condition. Furthermore, there was a much higher percentage of "different" responses in the RI condition that could not be made until part-level processing had been completed (i.e., after the constituent letters were segmented). The former may have contributed to better performance in RR than in RI, whereas the latter may have contributed to worse performance in the RI than in the RR condition. As such, given the materials used, participants could often use a fast image-based process with the RR, but not RI, instructions. A similar account can be applied to Goldstone and Medin’s (1994) results.

In Experiment 2 we sought to establish whether it would be possible to make the bound matching easier than the unbound matching, as previous studies have demonstrated, with 3-D connected objects. To this end, simultaneous presentation rather than sequential presentation was used in Experiment 2 in order to reduce the adverse impact that memory loss may have on computing representations at higher levels. Furthermore a set of longer deadlines were used in the hope that longer deadlines may help to elevate the performance in the RR condition.

Method

Participants. Fifty-eight undergraduate students from Indiana University served as participants in order to fulfill a course requirement. Twenty-eight and thirty participants were in the role-irrelevant and role-relevant groups, respectively.

Procedure. The materials were identical to those used in Experiment 1. The procedure was also identical except for the differences described here. Rather than presenting the two objects to be compared successively as in Experiment 1, the two objects were presented at the same time, one randomly positioned on the left side of the screen and the other randomly positioned on the right. The stimuli were displayed for one of three deadlines: 900, 1400, or 1900 msec. Responses were considered overtime unless they were made while the objects were displayed or within 500 msec of the offset of the objects.

In Experiment 1, same and different orientation trials were intermixed. In Experiment 2, the experiment was divided into two stages for same and different orientation trials. The order of these two stages was randomized. Within each stage, there were 12 (the base set of trials) X 3 (deadlines) X 4 (blocks) = 144 trials.

Design. The design of the experiment was identical to Experiment 1. The main dependent variable was response accuracy and the main independent variables were deadline (3 levels), trial type (4 levels), orientation (2 levels), block (4 levels), and instructions (2 levels). These independent variables were factorially combined in a within-participant (except for the between-participants factor of instructions) fashion.

Results

Overall analyses. As in Experiment 1, overtime responses were not included for analyses, which accounted for an average data loss of 7.92%. The mean accuracy for each condition was submitted to a 2 (group: RR vs. RI) X 4 (block) X 4 (types) X 2 (orientation: same or different) X 3 (deadline: 900, 1400, 1900) mixed analysis of variance (ANOVA). The main effect of group was not reliable, F < 1. As can be seen in Figure 7, participants in the RI groups (M = .73) performed equally well as those in the RR group (M = .73). The main effect of orientation and its interaction with group were both reliable, F(1, 56) = 61.71, MSE = .206, p < .0001, and F(1, 56) = 47.03, MSE = .206, p < .0001, respectively. The main effect of block was also reliable, F(3, 168) = 31.31, MSE = .08, p < .0001. The main effect of type and its interaction with group were reliable, F(3, 168) = 52.49, MSE = .169, p < .0001, and F(3, 168) = 12.45, MSE = .169, p < .0001, respectively. The two-way interaction of orientation and type as well as the three-way interaction of orientation, type, and group were reliable, F(3, 168) = 9.71, MSE = .117, p < .0001, and F(3, 168) = 42.29, MSE = .117, p < .0001, respectively. The interaction of block and type was also reliable, F(9, 504) = 2.04, MSE = .077, p = .034. Finally the main effect of deadline and the 3-way interaction of orientation, type, and deadline were reliable, F(2, 112) = 37.86, MSE = .076, p < .0001, and F(6, 336) = 2.40, MSE = .067, p = .028. None of the other main effects nor interactions were found reliable, F's < 1 or p's > .11. Given quite a large number of reliable results involve an interaction with the group factor and given the purpose of the experiment, we again felt justified to perform further analyses separately for the RR and RI group to reveal the nature of the above findings.

---------------------------------

Insert Figure 7 about here

---------------------------------

Role-irrelevant group. The main effect of block was reliable, F(3, 81) = 12.91, MSE = .08, p < .0001, reflecting a general trend of improvement as subjects became more experienced in the task (the mean rate of accuracy was .67, .74, .75, .76 for blocks 1 to 4 respectively). The main effect of type was also found reliable, F(3, 81) = 29.37, MSE = .18, p < .0001. Further analyses reveal that, as shown in the left panel of Figure 7, subjects performed best when the to-be-matched stimuli were identical to each other (i.e., the 2 MIPs) (M = .86), followed by stimuli with 2 matched features that were out of place (2 MOPs) (M = .72), and worst with those either of 1 MIP (M = .66) or of 1 MOP (M = .68). The contrasts between the 2 MIPs and the other three were all reliable, F(1, 27) = 75.58, p < .0001, F(1, 27) = 51.59, p < .0001, and F(1, 27) = 54.70, p < .0001, for 2 MOPs, 1 MIP and 1 MOP, respectively. The difference between the 2 MOPs and 1 MIP, however, was marginally reliable, F(1, 27) = 4.17, p = .05, and the difference between 2 MOPs and 1 MOP failed to reach significance, F(1, 27) = 2.28, p = .14. Finally, the difference between 1 MIP and 1 MOP also failed to be reliable, F(1, 27) = 1.51, p = .23. The interaction between block and type was reliable, F(9, 243) = 2.92, p = .0026. As shown in Figure 8, participants' performance on the 2 MIPs trials hit a ceiling early on (i.e., first block), while performance on the other three types of stimuli did not reach asymptote until the second block.

---------------------------------

Insert Figure 8 about here

---------------------------------

As in Experiment 1, the main effect of orientation was not reliable, F < 1. Its interaction with type, however, was reliable, F(3, 81) = 35.35, MSE = .075, p < .0001. As shown in the left panel of Figure 9, participants did better at judging 2 MIPs figures when the objects were presented in the same orientation (M = .92) than when they were presented in the different orientations (M = .80), F(1, 27) = 40.63, p < .0001. A reverse pattern was found, however, when subjects were judging 2 MOPs figures in that they did better when the figures were presented in different orientations (M = .80) than when they were presented in the same orientation (M = .64), F(1, 27) = 37.14, p < .0001. RI subjects did equally poorly judging 1 MIP figures, regardless of whether they were shown in the same or different orientations (both M's = .66), F < 1, and they again did better judging 1 MOP figures when they were shown in same orientation (M = .73) than shown in different orientations (M = .64), F(1, 27) = 17.16, p = .0003. The pattern of interaction between orientation and type remained more or less the same across four blocks of trials; they were not identical, however, yielding a reliable 3-way interaction of orientation, type, and block for the RI group, F(9, 243) = 2.12, MSE = .064, p = .028, as shown in Figure 10.

--------------------------------------------

Insert Figures 9 and 10 about here

-------------------------------------------

Finally, again as in Experiment 1, the main effect of deadline was reliable, F(2, 54) = 26.81, MSE = .002, p < .0001, reflecting the fact that participants made their judgments more correctly as they had more time to make them (M’s = .68, .75, and .76 for deadline of 900 ms, 1400 ms, and 1900 ms, respectively). Post hoc comparisons between the performances under the three deadlines reveal that participants’ performance asymptoted at the deadline of 1400 ms; performance increased significantly from the 900 ms to 1400 ms deadline, F(1, 27) = 47.44, p < .0001, but failed to increase further from 1400 ms to 1900 ms, F < 1.

No other main effects nor interactions were found reliable, F’s < 1 or p’s > .10.

Role-relevant group. Role-relevant group. The pattern of results was less complicated for subjects in the role-relevant condition. As for the RI group, the main effect of block was reliable, F(3, 87) = 19.02, MSE = .003, p < .0001, reflecting, again, the general trend of performance improvement across blocks, from Block 1 to 2 in particular. The mean accuracies were .66, .73, .75, and .77 for Blocks 1 to 4. The main effect of type was reliable, F(3, 87) = 35.91, MSE = .007, p < .0001. As shown in the right panel of Figure 7, participants' performance was best with 2 MIPs (M = .82), followed by 1 MOP (M = .79), and their performance was equally poor for 2 MOPs (M = .65), and 1 MIP (M = .65). The main effect of orientation was highly reliable, F(1, 29) = 89.56, MSE = .26, p < .0001, indicating that participants in general performed better with objects presented in the same orientation (M = .82) than in different orientations (M = .64). The interaction between orientation and type was also reliable, F(3, 87) = 22.12, MSE = .16, p < .0001. As shown in the right panel of Figure 9, the superior performance for the same orientation was true for all types of objects except 1 MIP, in which case the orientation did not have a discernible impact on participants' performance, F(1, 29) = 52.50, p <.001, for 2 MIPs, F(1, 29) = 70.03, p < .0001, for 2 MOPs, 1 MIP, F(1, 29) = 1.33, p = .26, for 1 MIP, and F(1, 29) = 22.88, p < .0001, for 1 MOP respectively.

Finally the main effect of deadline was highly reliable, F(2, 58) = 13.64, MSE = .037, p < .0001, reflecting the trend that as the deadline increased, participants’ accuracy increased. The mean accuracies for the 3 deadlines, from shortest to longest, were .69, .75, and .74. As with the RI group, RR participants’ performance asymptoted at the deadline of 1400 ms; their performance increased significantly from 900 ms to 1400 ms, F(1, 29) = 27.74, p < .0001, but failed to increase further from 1400 ms to 1900 ms, F < 1.

No other main effects nor interactions were found reliable, F’s < 1 or p’s > .09.

Cross-group comparisons. Participants in the RR and RI groups make identical responses (responding "same" or "different") to figures with 2 MIPs, 1 MIP and 1 MOP, which allows us to compare performance across these two groups. Analyses reveal that RR participants performed marginally worse than their RI counterparts with 2MIPs figures (M’s = .82 vs. .86), F(1, 56) = 3.92, MSE = .006, p = .053. The two groups performed equally poorly with 1MIP figures (M’s = .65 vs. .66), F < 1, and finally, the RR group performed better than the RI group with 1 MOP figures (M’s = .79 vs. .68), F(1, 56) = 12.09, MSE = .014, p = .001.

Discussion

The results of Experiment 2 again confirm our proposal that three levels of representation may be computed and used for matching multipart 3-D objects. Specifically, participants in the RR group can rely upon image-based representations for judging objects presented in the same orientation, and they would need full structural descriptions for judging objects presented in different orientations. This conjecture is supported by the following set of findings: First, misalignment in orientation considerably hurts 2 MOPs for RR -- although the right parts are there, a prolonged identifying and binding process is required to say "different." In contrast, misalignment actually helps RI because participants need no longer be tied to the image-based strategy, which would lead to a "different" response when RI participants should say "same." Second, the difference between MIPs and MOPs (e.g., 1 MIP vs. 1 MOP) makes a lot more difference for RR than RI. Third, participants' performance in judging 2 MOPs presented in the same orientation is actually better for RR than for RI, because the RR group can use the image-based representation to respond "different" but the RI group cannot use it to respond "same." Finally, the difference between MIPs and MOPs (e.g., 1 MIP and 1 MOP) shrinks when orientations are misaligned, suggesting that binding is no longer provided by image-based representations.

We lengthened the deadlines for responses and blocked the presentation of stimuli in terms of orientations in Experiment 2 with the hope that those measures would help to elevate the overall performance of the RR participants. It is interesting to note that those measures had differential impacts on the RI and RR participants. For the RR participants longer deadlines and blocked presentations helped, in comparison to Experiment 1, to elevate their performance for all four types of stimuli, with the largest increment occurring for the 2 MOPs displays, followed by 2 MIPs, 1 MOP, and 1 MIP, respectively. As a consequence, the RR participants performed at essentially the same level as the RI participants. For the RI participants, lengthening deadlines for response and blocking the presentation of orientation had relatively little impact on their overall performance. Except for some improvement in judging the 2 MIPs, RI participants’ performance with the other types of stimuli remained at the same level as in Experiment 1.

When we further compared the interactions between type and orientation for both the RI and RR group across the two experiments, we noted that the patterns were very similar. For the RI group, for example, the pattern of interactions between stimulus type and orientation were essentially identical, except that the difference between same and different orientation was enlarged for 2 MIPs and for 2 MOPs (see Figures 3 and 7). The enlarged difference was due to the higher performance with 2 MIPs presented in the same orientation and to the higher performance with 2 MOPs presented in the different orientations. For the RR group, in addition to an overall improvement in performance, there was an obvious elevation for 2 MIPs with either the same or different orientations, for 2 MOPs presented at the same orientation and for 1 MOP presented at the same orientation. In contrast, performance for most stimuli presented at different orientations did not show substantial improvements.

It is interesting to speculate on the implications of the foregoing comparisons for our proposal of three levels of representations. First, for the RI participants, when stimuli to be compared were presented simultaneously in the same orientation, image-based processing should be facilitated. The facilitation associated with the image comparison process apparently was helpful for judging 2 MIPs because the judgments can be made solely based on the outputs at the image level. This facilitation has resulted in an obvious elevation of RI participants’ performance with 2 MIPs. In contrast, the image-level output was not as helpful when the RI participants were judging the 2MOPs displays, because the image level would signal or prime a "different" response, which would compete against the correct "same" response. As a consequence, the difference in judging 2 MIPs and 2 MOPs was exaggerated by longer deadlines and blocked presentation. The difference in judging 2 MIPs and 2 MOPs presented at different orientations was diminished by the same measures. Note, however, this was also the case when shorter deadlines and mixed presentations of orientations were used in Experiment 1. In summary, then, the measures we took in Experiment 2 have the effect of enlarging the differences between the performances on some stimuli, while diminishing performance differences between others.

For the RR participants, the impact of longer deadlines and blocked orientation presentation is different. Compared to Experiment 1, participants’ performances on all types of stimuli presented at the same orientation were enhanced by various degrees. The largest enhancement occurred for 2 MOPs, followed by 2 MIPs, 1 MOP, and 1 MIP. This pattern of result suggests that with longer deadlines, participants were better able to use their image-level outputs when objects were shown in the same orientations. The image-level representations are more dependable when evaluating 2 MIPs for "same" responses, and 2 MOPs and 1 MOP for "different" responses than when evaluating 1 MIP for "different" responses. The inherent conflict in judging 1 MIP stimuli, caused by the contradiction between shared and nonshared parts, continued to have an impact even when there was more time for comparison.

A number of factors may have contributed to a relatively large elevation in accuracy for judging 2 MOPs. First, the blocked presentation of orientation could have offered participants a clear cue that all stimuli were presented at the same orientation half the time. Under that condition, participants could rely upon the image-level representations without reservation. Additionally, unlike 1 MIP stimuli, 2 MOPs stimuli would usually give rise to "different" responses when comparing images. Therefore it is not surprising that 2 MOPs should enjoy a large increment in performance. The ordering among the four types of objects remained the same when they were presented at orientations different from each other. Overall, participants’ performances with 2 MIPs and 1 MOP enjoyed a greater increase than their performances with either 1 MIP or 2 MOPs. Participants’ performances with 1 MIP and 2 MOPs remained essentially at the same level as in Experiment 1. Apparently longer deadlines and blocked presentation did not ease the difficulty in judging those stimuli.

In conclusion, by lengthening deadlines and blocking the presentation of orientation, we were able to raise the performance of the RR participants to a level that was comparable to that of the RI participants. Note, however, this is still not the same as the finding from previous studies (e.g., Proctor & Healy, 1985; Goldstone & Medin, 1994) which had demonstrated that, in general, judgments based on the RR rule were easier to make than those based on the RI rule. One of the culprits responsible for the discrepancy was the severe difficulty that RR participants had in judging 2 MOPs stimuli presented in disparate orientations. This is a condition that clearly requires the computation of a full structural description in which the components of a multipart object and the spatial role that each part plays are specified. The computation of full structural descriptions apparently was disrupted by the angular displacement between the objects.

Experiment 3

In Experiments 1 and 2 we employed deadlines of fixed duration attempting to determine the time needed for participants’ performance in the RR group to match that in the RI group. In Experiment 3, rather than using fixed durations, the time with which participants had to process displayed stimuli varied depending upon the success and failure of their responses. That is, we used a titration procedure so that the stimulus deadline for any given trial was contingent upon participants’ performance on the previous trials. By keeping the overall performance at a given level, we were able to locate the duration required for the RR participants to yield equivalent performance to the RI participants.

Method

Participants. Eighty-five undergraduate students from Indiana University served as participants in order to fulfill a course requirement. Forty-three and forty-two participants were in the role-irrelevant and role-relevant groups, respectively.

Procedure. The materials were identical to those used in Experiments 1 and 2. The individual trials were identical to those in Experiment 2. The main difference between Experiments 2 and 3 was the method for determining the trial deadline. Rather than using a fixed deadline, the event deadline varied from trial to trial as a function of participants’ accuracy. The trial duration started at 1400 msec for all participants. Whenever they made three correct responses (not necessarily consecutively), the trial duration was decreased by 25 msec. Whenever they made an incorrect response, the trial duration was increased by 25 msec. By this technique, the durations are modified online so as to increase or decrease the task’s difficulty. Assuming that practice effects are negligible and that insufficient time is the only reason for incorrect responses, after many trials the trial duration will be customized for the individual participant, producing an accuracy of about 75%. Even if these assumptions are invalid, the final trial deadline reflects participants’ facility with a particular type of display.

As with Experiment 2, the experiment was broken down into two stages, one for same orientation trials and one for different orientation trials. Within these stages, the set of four trial types was randomly intermixed. Participants saw 96 trials (the basic set of 12 trials repeated 8 times) in each of the two stages.

Participants were given 12 practice trials in an effort to increase their response accuracy. During these 12 trials, participants received explanations on all of their error trials.

Design. The primary dependent variable was the final trial duration achieved by a participant. This value reflects the duration required for a participant to obtain an accuracy of roughly 75% by the end of a stage of the experiment. A separate final deadline was obtained for the same and different orientation stages of the experiment. The independent variables were orientation (2 levels, manipulated within each participant) and instructions (manipulated across participants).

Results and Discussion

The titration procedure controls for the overall performance for the RR and RI group. Hence the data analysis performed here entails a single 2 (group: RI vs. RR) X 2 (orientation: same or different) mixed analysis of variance. Note also, our method did not strictly follow the up-down procedure that was introduced by Levitt (1971). By Levitt’s procedure, three consecutive correct responses were required for the deadline to be decreased. Here we allowed a decrement of exposure duration whenever three correct responses, consecutive or not, occurred. However, when we performed data analyses restricted to changes in exposure duration that were a result of consecutive correct responses, essentially the same results were found as when data analyses were not limited to consecutive correct responses. We only report the findings from the latter below.

As shown in Figure 11, the main effect of group failed to reach significance, F < 1, indicating that as RR and RI participants approached the same level of performance, the durations they needed to process the stimulus displays were about the same (M = 1369.34 ms for RR group and M = 1385.40 ms for the RI group). The main effect of orientation and its interaction with group were both reliable, F(1, 83) = 53.94, MSE = 10174.41, p < .0001, and F(1, 83) = 38.88, p < .0001, respectively. It is not surprising that in general stimuli shown in the same orientations (M = 1321.21 ms) required less time to judge than those shown in different orientations (M = 1433.72 ms). Further analyses of the interaction show that, when judging stimuli presented in the same orientation, RR participants required less time to reach the criterion performance (M = 1264.27 ms) than did RI participants (M = 1376.82 ms), F(1, 83) = 13.54, p < .0001. When judging stimuli presented in different orientations, however, RR participants needed more time (M = 1393.98 ms) than RI participants (M = 1474.40 ms) to reach criterion performance, F(1, 83) = 5.46, p = .022. These findings were quite consistent with the findings from the previous experiments, where we found evidence suggesting that it was more difficult for RI participants, relative RR participants, to deal with stimulus objects presented in the same orientation, because only the RR participants could take advantage of an image-based processing strategy. For example, because displays with 2 MOPs required a "same" response for RI participants, they could not simply use overall object identity even in the same orientation condition. In contrast, it was more difficult, sometimes extremely troublesome, for RR participants to deal with stimulus objects presented in different orientations. Simply identifying matching object parts suffices for RI participants to always respond correctly, but RR participants could not achieve high levels of accuracy without constructing complete structural descriptions in which object parts are identified and bound to their location within their host object.

-------------------------------------

Insert Figure 11 about here

-------------------------------------

In summary, then, the results of Experiment 3 corroborate those obtained in the first two experiments, and as such lend further support to our hypothesis that three levels of representations may be computed to meet the requirement of different combinations of types of stimuli and judgment rule.

Experiment 4

The three-level representation hypothesis examined in the Experiments 1 and 2 assumes not only that full structural descriptions were more difficult, but also require more time to compute. The dependent measure used in those experiments was the response accuracy with which participants made their same-different judgments. Whether or not the computations of various levels of representation would be differentially time-consuming is not entirely clear. The goal of Experiment 4 was to provide some evidence with respect to this assumption regarding time consumption. That is, we used response latency, rather than accuracy, as the main dependent measure. Using response latency as dependent measure requires that the majority of participants’ responses are given accurately so that judgment speed can be unambiguously interpreted. Therefore, we increased the exposure time of stimulus displays to 3 sec to ensure performances of a relatively high level of accuracy.

Method

Participants. Twenty-four college students from the National Chung-Cheng University participated in the experiment to fulfill a course requirement. The participants were randomly assigned to two groups, half to the RR group and the other half to the RI group. All participants had normal or corrected-to-normal vision.

Procedure. The same materials used in the previous experiments were used in Experiment 4. On each trial, a pair of objects was simultaneously shown on the screen for 3 sec, following the offset of a centrally positioned fixation point ("+"). Participants were told to respond as quickly as they could, without sacrificing accuracy, as to whether or not the displayed objects were the same or different. Upon participants’ completion of a trial, the fixation reappeared on the screen, signaling the start of the next trial. As with Experiment 2, the trials were divided into two blocks, each containing 144 trials resulting from combinations of stimulus type and angle of rotation in the picture plane. As before, in one block the compared objects were always presented in pairs in the same orientations randomly selected from 0^o, 120^o, and 240^o, and were always presented in different orientations in the other block. The order of administering the two blocks of trials was randomly determined for each participant, such that half had the block of same orientations first, and half had the block of different orientations first.

An IBM-AT compatible PC, equipped with a NEC 3FG Multisync color monitor, was used for stimulus presentation and response recording.

Design. As in Experiment 2, a mixed design was adopted in Experiment 4. The rule of judgment (RI vs. RR) was a between-participants factor, and orientation and type of stimulus display were within-participants factors.

Results

The response latencies for each condition were first submitted to a group (RR vs. RI) X 2 (orientation) X 4 (type of stimulus display) mixed analysis of variance. On average, it took RR participants 1509.42 ms to make their judgments, and 1611.13 ms for RI participants to make theirs; the difference fails to reach significance, F(1, 22) = 1.53, p > .23. The main effect of orientation was reliable, F(1, 23) = 100.61, MSE = 43800.5, p < .0001; participants were faster in making judgments when stimulus objects were displayed in the same orientations (M = 1408.77 ms) than when in different orientations (M = 1711.78 ms). Finally, participants made judgments with different speeds as a function of display type, F(3, 66) = 5.95, MSE = 13879.6, p < .0001. Further analyses showed that participants needed more time to judge 2 MOPs (M = 1622.32 ms) than to judge the other three types of displays (M’s = 1539.09 ms for 2 MIPs, 1541.45 ms for 1 MIP, and 1542.22 ms for 1 MOP, respectively). There were no reliable differences among the latter three.

The interaction between group and orientation was reliable, F(1, 22) = 64.58, MSE = 43800.5, p < .005. Likewise the interaction between orientation and stimulus type, and between group and stimulus type, were reliable, F(3, 66) = 5.61, MSE = 7592.70, p = .002, and F(3, 66) = 7.00, MSE = 13879.6, p < .0001, respectively. Finally the three-way interaction of group, type, and orientation was reliable, F(3, 66) = 7.69, MSE = 7592.7, p < .0001. Given the significant two-way and three-way interactions, we further break down our analyses into two separate groups.

Role-irrelevant group. For the RI participants, as in Experiments 1 and 2, the main effect of type was reliable, F(3, 33) = 10.45, MSE = 11042.3, p < .001. As shown in the left panel of Figure 12, it took more time for participants to judge 2 MOPs displays (M = 1687.38 ms) than to judge 1 MOP (M = 1622.64 ms), 1 MIP (M = 1615.33 ms) and 2 MIPs (M = 1519.14 ms) displays. The main effect of orientation was also reliable, F(1, 11) = 7.74, MSE = 11806, p < .001. Stimulus objects presented in the same orientations required less time (M = 1581.00 ms) to judge than those presented in different orientations (M = 1641.25 ms). Finally the interaction between type and orientation is reliable, F(3, 33) = 13.05, MSE = 4729.62, p < .001. As shown in the left panel of Figure 13, RI participants showed strong differences in their times judging the four types of displays when the objects were shown in the same orientation, but a seemingly convergent pattern when objects were shown in different orientations. When juxtaposing this pattern of results with the response accuracy data, shown in top half of Table 1, we come to the conclusion that RI participants experienced great difficulty, both in terms of speed and accuracy, in making positive responses to 2 MOPs shown in the same orientation. In contrast, they had little difficulty making positive responses to 2 MIPs shown in the same orientation.

---------------------------------------------

Insert Figures 12 and 13 about here

---------------------------------------------

Role-relevant group. For the RR participants, as for the RI participants, the main effect of type was reliable, F(3, 33) = 3.85, MSE = 16717, p = .018. As shown in the right panel of Figure 12, RR participants responded about equally quickly judging 1MIP (M = 1467.55 ms) and 1 MOP (M = 1461.801 ms) displays, and were about equally slow in judging 2 MIPs (M = 1551.03 ms) and 2 MOPs (M = 1557.25 ms) displays. The main effect of orientation was reliable too, F(1, 11) = 93.31, MSE = 75795, p < .001, reflecting the fact that RR participants were much faster judging objects displayed in the same orientations (M = 1235.53 ms) than judging those displayed in disparate orientations (M = 1782.30 ms). Finally the interaction between type and orientation was reliable, F(3, 33) = 3.57, MSE = 10455.8, p = .02. As shown in the right panel of Figure 13, while RR participants consistently exhibited a pattern of faster responses with objects displayed in the same rather than different orientations, the difference between the same and different orientations was not identical among the four types of display. The impact of orientation was greater for 2 MIPs and 2 MOPs than for 1 MIP and 1 MOP.

Discussion

The goal of Experiment 4 was to collect evidence in terms of response latency to offer support for the assumptions that were collateral to the three-level representation hypothesis examined in the present studies. Those assumptions that were added as the three-level representation hypothesis was elaborated to account for the results of Experiments 1 to 3 concern the speed with which each level of representation may be computed. We proposed that it should be relatively easy, and hence consume little time, to compute representations at the image level. This assumption was borne out by the finding that RR participants were invariably faster judging all four types of display when objects were shown in the same rather than different orientations. This should not be very surprising given that for RR participants, outputs at the image level are a sufficient source of information to make "same" and "different" judgments. In contrast, for RR participants to judge objects presented in different orientations was a different and more difficult task. As we suggested earlier in Experiments 1 and 2, RR participants could rely upon neither the representations computed at the image level, nor those computed at the part level (as a result of parsing) when the compared objects were shown in different orientations. Rather, they need to compute the full structural description of each object to ascertain whether the displayed objects shared the same parts as well as whether the parts shared the same spatial roles. Alternatively, it may be possible for participants to judge the misaligned pair of objects not by computing the structural descriptions, but by mentally rotating the two objects into alignment and then using the outputs at the image level for comparison. If this were the case, we should have witnessed an increment in response latency between the same and different orientations, and furthermore the increment should be approximately of the same magnitude for all types of display. Judging from the right panel of Figure 13, this does appear to be the case, but not completely. It might have been the case that participants first mentally rotated the objects into alignment and then made comparison based on image-level representations for 2 MOPs, 1 MIP, and 1 MOP. But this appears not to be the case for 2 MIPs in that the increment between same and different orientations was much greater than for the rest of the displays. The resolution of this issue will have to await further evidence.

Our assumption regarding the time requirements for computing various levels of representation was also borne out by the findings from RI participants. RI participants were very fast in judging 2 MIPs shown in the same orientation. In contrast, they were very slow in judging 2 MOPs presented in the same orientations. As we elaborated in Experiment 1, when presented in the same orientations, image resemblance may be a reliable sign for positive responses (e.g., 2 MIPs); however, dissimilarity in images is not a reliable sign for negative responses (e.g., 2 MOPs). Despite the high image-based dissimilarity associated with 2 MOPs, they should receive a "same" rather than a "different" response, and that positive response comes with the cost of computing representations at a higher level than the raw image (i.e., part-level). In general, any pair of objects presented in the same orientation that does not indicate a strong match at the image level would have to wait for the output at the part level. This appears to be the reason that response latency for 2 MOPs, 1 MIP and 1 MOP were all greater than that for 2 MIPs.

It is also interesting to note the disparate patterns of result associated with the four types of display presented in the same versus different orientations for the RI group. In particular, the convergence in response latencies for the four types of display in different orientations suggests that the objects were dealt with in a similar manner. Most likely, processing of the misaligned objects, regardless the type of display, had to forego the outputs from the image level, and rely consistently upon representations computed at the level of part parsing.

Another implication stemming from the results from RI participants is that the conflict caused by the output responses from different levels of representation was much stronger in 2 MOPs displays presented in the same orientation than in other types of display. It is conceivable that while the image-level representation may strongly suggest a negative response for 2 MOPs, the RI participants would have to wait for the output at the part level to override the signal from image-level to produce a positive response.

In summary, the results of Experiment 4 offer support for our assumptions regarding the time course and temporal differences that were reflected in participants’ performance using accuracy as a dependent measure. These findings, taken together with those in the previous experiments, lend further credibility to the postulation of a three-level representation. Our hypothesis may suggest a way of reconciling disputes regarding internal representations for object recognition.

Experiment 5

A curious and yet consistent finding throughout the four experiments reported thus far was that RR participants experienced tremendous difficulty, both in terms of response latency (Experiments 1 and 2) and response latency (Experiment 4), figuring out the correct answer to 2 MOPs displayed in disparate orientations. We have suspected, and have collected evidence for, the assumption that for RR subjects to make judgments for any misaligned objects requires computation of full structural descriptions, especially when objects have identical parts (i.e., 2 MIPs and 2 MOPs). If that is the case, why should RR subjects suffer more judging 2 MOPs than judging 2 MIPs? After all, structural descriptions should give rise to the same quality of representation for both types of display.

One possible answer may lie in the obligatory nature of binding caused by the connectedness between parts of an object (Saiki & Hummel, 1998; Palmer & Rock, 1994). Recall that the multipart objects we used in the present study were all connected objects rendered by 3D graphics software. The depiction of each object was such that the parts were physically and explicitly connected to at least one other part of the same object (cf. Saiki & Hummel, 1996, 1998). The fact that those objects were clearly connected may have the inadvertent effect of making the computation of structural descriptions a far more difficult task for them than for, say, 2D patterns or unconnected letter strings used in the previous studies (e.g., Goldstone & Medin, 1994; Proctor & Healy, 1984). In order to achieve a full structural description, the visual system must first decompose the whole object into its constituent components (Biederman, 1987; Hummel & Biederman, 1992; Marr & Nishihara, 1978). According to both Marr (1982) and Biederman (1987), parsing can be done by segmenting an object at paired and matched deep concavities on its bounding contour (see also Hoffman & Richards, 1985). The objects we used can readily conform to such a scheme and be parsed into three different parts (a truncated cone and two smaller components). However, this may be why computing full structural description is a difficult and most likely an attention-demanding task (Treisman, 1988, 1993; Biederman, 1987; Hummel & Biederman, 1992; Hummel & Stankiewicz, 1996, 1998, Shyi & Cheng, 1996). Parsing (object segmentation) does not come for free and neither does the subsequent structural description. Note that parsing may be an effortful act precisely because it is asking the visual system to break down the displayed object into constituents, which contrasts with the fact that the object is connected. Furthermore, the result of parsing may have to be maintained and stored temporarily, perhaps through the medium of visuospatial working memory (Jonides & Smith, 1997; Shah & Miyake, 1996), before the visual system can continue the computation of structural description. In Experiment 5, we set out to explore such a hypothesis by comparing participants' performance with displayed objects that were either connected or disconnected. In particular we would like to know whether the poor performance with 2 MOPs for the RR subjects would diminish when parts of the displayed objects are no longer physically connected.

Method

Participants. Fifty-seven college students from the National Chung-Cheng University participated in the experiment to fulfill a course requirement. Twenty-eight participants were randomly assigned to the connected group; 13 and 15 participants of that group were further assigned to the RI and the RR condition respectively. The remaining 29 participants were randomly assigned to the disconnected group. Of them, thirteen were assigned to the RI condition and 16 to the RR condition. All had normal or corrected-to-normal vision.

Materials and apparatus. For the connected group, the stimulus materials were exactly identical to those used in the previous experiments. For the disconnected group, the stimulus objects were the same as those used for the connected group except that the parts of the object were disconnected from one another (see Figure 14 for illustration). The parts in the disconnected objects remained in the same spatial relationship as those in the connected objects. The disconnections were created by moving a part outward along its main axis from the base cone. As a result the disconnected part had on average a gap of .3 cm, subtending a visual angle of about .29 degrees between the nearest contours of the part and the base cone.

-----------------------------------

Insert Figure 14 about here

-----------------------------------

The same apparatus for stimulus presentation and response recording used in Experiment 4 was used here.

Design and procedure. As in Experiment 4, a mixed design was adopted in Experiment 5. The rule of judgment (RI vs. RR) and group (connected vs. disconnected) were between-participant factors, while orientation and type of stimulus display were within-participant factors. The procedure was almost identical to that of Experiment 2 except for the following changes: First, we lengthened the exposure time to 1 sec and 2 sec so that participants’ performance with 2 MOPs could be above chance-level. Second, participants were given a deadline of 500 ms to respond after the offset of stimulus display. At the stimulus offset, a beep was delivered to alert the participants to give their responses. In all, participants had either 1500 ms or 2500 ms to respond before reaching the deadline. As before, the word "Overtime" appeared on the display screen if a participant failed to provide a response within the time limit. There were 288 trials for both the connected and the disconnected group.

Results and Discussion

The connected group was basically a replication of Experiment 2 with lengthened deadlines; and the more interesting results should come from the disconnected group and its comparison to the connected group. In what follows, we first report the findings of an overall analysis, followed by those that were performed for the connected and disconnected groups respectively.

The overall analysis. The mean accuracy for each condition was first submitted to a 2 (connectedness) X 2 (group) X 2 (deadline) X 2 (orientation) X 4 (type of display) mixed ANOVA, with the first two variables as between-participants factors and the remaining three as within-participants factors. The main effect of connectedness was reliable, F(1, 53) = 7.57, MSE = .047, p = .008, indicating that in general participants in the disconnected group (M = .92) performed somewhat better than those in the connected group (M = .88). The main effect of group was not reliable, however, F(1, 53) = 1.58, p = .21, indicating that RR participants (M = .91) performed at about the same level as their RI counterparts (M = .89). The interaction between connectedness and group also failed to be reliable, F < 1.

As in the previous experiments, the main effect of deadline was reliable, F(1, 53) = 111.82, MSE = .014, p < .0001. Participants did much better with longer deadline (i.e., 2 s) (M = .94) than with shorter deadline (i.e., 1 s) (M = .86). The main effect of orientation was also reliable, F(1, 53) = 12.64, MSE = .015, p = .0008, indicating that, as before, participants were better at judging objects presented in the same orientation (M = .92) rather than different orientations (M = .89). Finally the main effect of type was reliable, F(3, 159) = 19.17, MSE = .020, p < .0001; participants performed best with 2 MIPs (M = .94), followed by 1 MOP (M = .93), 1 MIP (M = .90), and worst with 2 MOPs (M = .85).

The 2-way interaction of group and orientation was reliable, F(1, 53) = 26.10, MSE = .015, p < .0001. The orientation did not matter much for the RI participants (M's = .89 and .90 for same and different orientation, respectively); but it did matter for the RR participants in that same orientation led to a better performance (M = .95) than different orientation did (M = .88). The interaction of deadline and group was only marginally reliable, F(1, 53) = 3.85, p = .055, in that the impact of deadline for RI participants (M's = .84 and .94 for short and long deadline, respectively) tended to be slightly larger than for the RR participants (M's = .88 and .95 for short and long deadlines respectively). The interaction between type and connectedness was reliable, leading to two different patterns for judging the four types of displays between connected and disconnected groups. As shown in Figure 15, participants in the connected group exhibited a pattern very similar to that found in Experiment 2--they did best judging 2 MIPs, followed by 1 MOP, 1 MIP, and did worst judging 2 MOPs. Although the disconnected group exhibited a similar pattern in terms of the order of absolute performance, the differences among the 4 types of displays were greatly reduced. Comparing across the two levels of connectedness, it seems that participants' performance with 2 MOPs benefited most from disconnecting the attached parts from the base cone, while their performance with other types of display either remained the same or was changed only slightly. The interaction between type and group also was reliable, F(3, 159) = 5.25, MSE = .020, p = .0018. As shown in Figure 16, there was no overall difference between RR and RI; however, the order among the four types of display were not identical. For RI, performance was best with 2 MIPs (M = .94), followed by 1 MOP (M = .89), 1 MIP (M = .89) and worst with 2 MOPs (M = .86). For RR, participants performed best with 1 MOP (M = .96), followed by 2 MIPs (M = .93), 1 MIP (M = .91), and worst with 2 MOPs (M = .84).

-------------------------------------------

Insert Figures 15 & 16 about here

-------------------------------------------

The two-way interaction between deadline and type was reliable, F(3, 159) = 5.08, MSE = .008, p = .0022. As shown in Figure 17, lengthening the deadline from 1 to 2 seconds helped 2 MOPs more than other types of displays, a replication of what was found in Experiment 2. Finally the interaction between orientation and type was also reliable, F(3, 159) = 3.38. MSE = .013, p = .012. As shown in Figure 18, participants' performance with 2 MIPs was hurt most with objects presented out of alignment, whereas the impact of misalignment was minimal for other types of displays.

--------------------------------------------

Insert Figures 17 & 18 about here

--------------------------------------------

A number of higher order interactions were also reliable, including the 3-way interaction of deadline, type, and connectedness, F(3, 159) = 3.29, MSE = .008, p = .022, that of deadline, type, and group, F(3, 159) = 3.34, MSE = .008, p = .021, and that of orientation, type, and group, F(3, 159) = 20.93, MSE = .013, p < .0001, and finally the 4-way interaction of connectedness, orientation, type, and group, F(3, 159) = 6.47, MSE = .013, p = .0004. Given the last set of higher order interactions, and to illustrate how participants performed with connected and disconnected objects, we felt justified in further breaking down our analyses in terms of connectedness as follows.

The connected group. The mean response accuracy for each condition was first submitted to a 2 (group) X 2 (deadline) X 2 (orientation) X 4 (type of display) mixed ANOVA. The main effect of condition was not reliable, F < 1, indicating that RR subjects (M = .89) performed as well as the RI subjects (M = .88). As before, the main effect of orientation was reliable, F(1, 29) = 9.05, MSE = .016, p = .005, indicating that participants were more accurate judging objects presented in the same orientation (M = .90) than in different orientations (M = .87). The main effect of deadline was reliable, F(1, 29) = 72.66, reflecting the fact that, as in Experiments 1 and 2, with more time available, participants would perform better (M = .83 for 1 s, and M = .93 for 2 s, respectively).

With regard to interaction effects, only the two-way interaction between group and orientation, and that between deadline and type of display were reliable, F(1, 29) = 9.05, MSE = .016, p = .005, and F(3, 87) = 7.52, MSE = .001, p < .001, respectively. Finally, the three-way interaction between group, orientation, and type of display was reliable, F(3, 87) = 19.68, MSE = .017, p < .001. None of the other interactions reach significance, F's < 1, or p's > .15. Given most of the interaction effects involves group (rule of judgment), we further break down the analyses for the RI and the RR condition separately.

For the RI condition, the main effect of orientation fails to reach significance, F(1, 15) = 1.52, p = .24. The interaction between orientation and type of display was reliable, however, F(3, 45) = 8.01, MSE = .017, p < .001. Further analyses, as shown in the upper left panel of Figure 19, reveal that the patterns for the four types of display that were presented in the same or different orientation were not identical. Again the response accuracy for objects shown in the same orientation exhibited a divergent pattern, while accuracy for objects presented in different orientations exhibited a convergent pattern. More specifically, while same orientation increased the accuracy in judging 2 MIPs, it decreased accuracy in judging 2 MOPs.

-----------------------------------

Insert Figure 19 about here

-----------------------------------

For the RR condition, objects presented in the same orientations were more accurately judged than those presented in different orientations, F(1, 14) = 30.78, MSE = .015, p < .001. Furthermore the interaction between orientation and type of display was reliable, F(3, 42) = 12.48, MSE = .016, p < .001. As shown in the lower left panel of Figure 19, RR participants produced better performance in judging 2 MIPs and 2 MOPs with same orientations rather than different orientations, presumably because structural descriptions are required for the different orientation displays.

In summary, the results from the connected group exhibit a strikingly similar pattern to that obtained in Experiment 2. The lengthened deadlines used here result in an elevation of overall performance, but the patterns of results with respect to rule of judgment, orientation, and type of display remain virtually identical. These findings, together with those from the earlier experiments, speak to the robustness of basic results and lend strong support to the three-level representation hypothesis proposed in the present study.

The disconnected group. The response accuracy for each condition was submitted to the same mixed ANOVA as before. As for the connected group, the main effect of deadline was reliable, F(1, 30) = 60.90, MSE = .032, p < .001, in that with more time available, participants made more accurate judgments (M = .89 for 1 s, and M = .96 for 2 s, respectively). Furthermore, participants made slightly more accurate judgments with objects presented in the same orientation (M = .94) than in different orientations (M = .91), F(1, 30) = 5.88, MSE = .013, p = .022. The main effect of type of display was also reliable, F(3, 90) = 3.22, MSE = .013, p = .026, in that participants were more accurate in judging 2 MIPs than 2 MOPs, t(30) = 3.31, p = .002, but no reliable differences existed among other paired comparisons. Finally, as for the connected group, the overall accuracy for the RR condition (M = .93) was virtually identical to that for the RI condition (M = .92), F(1, 30) = 1.69, p > .2. (See the two panels on the right in Figure 19.)

With regard to interaction effects, we found that the two-way interaction between rule of judgment and orientation was reliable, F(1, 30) = 9.60, MSE = .013, p = .004, as were those between rule of judgment and type of display, and between orientation and type of display, F(1, 30) = 5.82, MSE = .013, p = .002, and F(3, 90) = 6.64, MSE = .008, p < .001, respectively. Finally the 3-way interaction involving rule of judgment, orientation, and type of display reached significance, F(3, 90) = 4.19, MSE = .008, p = .008. The 3-way interaction involving rule, deadline, and type of display was only marginally significant, F(3, 90) = 2.63, p = .055. No other interaction effects were reliable, F's < 1 or p's > .08. As with the connected group, we break down further analyses in terms of rule of judgment.

For the RI condition, as in the connected group, the main effect of orientation was not reliable, F < 1; participants judged objects presented in the same orientation (M = .91) as accurately as they did with those presented in different orientations (M = .92). The interaction between orientation and type of display was reliable, however, F(3, 45) = 11.06, MSE = .005, p < .001. As shown in the upper right panel of Figure 19, having the same orientation was beneficial for judging 2 MIPs, F(1, 15) = 12.53, p = .003. However, unlike the connected group, it made no difference whether 2 MOPs displays were judged with objects that had same or different orientations, F(1, 15) = 3.03, p = .10. The interaction between deadline and type of display was not reliable, either, F(3, 45) = 1.65, p = .19.

For the RR condition, the main effect of orientation and its interaction with type of display were both reliable, F(1, 15) = 13.12, MSE = .015, p = .003, and F(3, 45) = 3.16, MSE = .012, p = .034. Further analyses reveal that, as shown in the lower right panel of Figure 19, RR participants made reliably more accurate judgments for objects presented in the same rather than different orientations for 2 MIPs displays, F(1, 15) = 28.94, p < .001, but not for other types of display, including 2 MOPs.

Additional analyses. In order to highlight the contrast between connected and disconnected pairs of objects, we performed a few additional analyses for RI and RR subjects separately. We will, however, restrict our report to those that have not been revealed in the analyses described thus far. First, for the RI participants, the two-way interaction of orientation and type of display, and the three-way interaction of orientation, type, and connectedness were both reliable, F(3, 72) = 12.05, MSE = .012, p < .0001, and F(3, 72) = 2.99, MSE = .012, p = .037, respectively. As shown in the upper panel of Figure 19, RI participants' performance with connected objects reveal a highly similar pattern to that in Experiment 2 (see the left panel of Figure 7), in particular a large difference between 2 MIPs and 2 MOPs with same orientations and a much smaller difference between them with different orientations. With disconnected objects, shown in the upper right panel of the same figure, the differences among the various type of displays were greatly reduced, especially that between 2 MIPs and 2 MOPs at the same orientations.

For the RR participants, the main effect of connectedness was not reliable, F(1, 29) = 2.96, p > .09, indicating that although RR subjects did slightly better judging disconnected objects (M = .94) than judging connected objects (M = .91), the difference was not reliable. As in the previous experiments, the main effects of orientation, type of display, and their interaction were all reliable, F(1, 29) = 29.71, MSE = .005, p < .0001, F(3, 87) = 10.18, MSE = .007, p < .0001, and F(3, 87) = 6.05, MSE = .004, p = .0009, respectively. Most notably, shown in the lower right panel of Figure 19, for the disconnected group, there was a diminished difference between 2 MIPs and 2 MOPs presented at different orientations.

In summary, the performance of the disconnected group, albeit exhibiting some similarity, was different from that of the connected group. Most importantly, the discrepancy between the two groups converges on the interaction between orientation and type of display across the two rules of judgment (RI vs. RR). For the connected group, the interaction was reliable for both RI and RR, and follow-up comparisons demonstrated that the alignment of orientation (a) helps 2 MIPs but hurts 2 MOPs for RI, and (b) helps both 2 MIPs and 2 MOPs for RR. In particular performance by the RR group on 2 MOPs display was seriously compromised by misalignment. For the disconnected group, again the interaction between orientation and type of display was reliable for both RI and RR. Unlike their counterparts in the connected group, however, the alignment in orientation helped 2 MIPs but did not hurt 2 MOPs for RI. Furthermore, the misalignment hurts 2 MIPs but not 2 MOPs for RR. These findings together replicated those reported in the earlier experiments. Furthermore they lend support to the hypothesis that the reason why RR participants performed so poorly with 2 MOPs presented in different orientations may have to do with the processing demand imposed by parsing connected objects into constituent components and the subsequent computation of full structural descriptions.

General Discussion

The goal of the present study was to provide evidence for the proposal of three tiers of representations to account for the recognition and discrimination of multipart 3D objects. Our hypothesis was examined in the context of a same-different matching task in which judgments were made about whether or not object parts, and their spatial relationships, were the same between pairs of objects presented in same or different orientations. Specifically, we proposed that 2-D images are easier and quicker to compute than part-based representations, which in turn are easier and quicker to compute than full structural descriptions. In what follows, we will first give a brief summary of the main findings from the series of experiments, followed by a discussion of the implications of our findings for a number of theoretical issues.

A Summary of Main Findings

In Experiments 1 and 2, we consistently found that participants performed best judging objects that not only shared their constituent components but also these components occupied the same locations within each object (i.e., 2 MIPs). This is particularly so when 2 MIPs objects were shown in the same orientation, regardless of whether or not the judgments were based on the rule that was role-irrelevant (RI) or role-relevant (RR). This condition, we think, conforms to a situation where only representations at the 2-D image level were computed and used. The visual system is apparently rather efficient 鵈앀 constructing this level of representation, and as a result, was very good at performing judgments based on image-based representations.

Another consistent finding from the first two experiments has to do with performance with objects sharing their constituent components but having these components occupy different locations within each object (i.e., 2 MOPs). The RI participants did poorly when 2 MOPs were presented in the same rather than different orientations. The RR participants, in contrast, performed much better when 2 MOPs were shown in the same rather than different orientations. These findings together suggest that RI participants relied upon part-level representations--parsing an object into a collection of its constituent components--for judging 2 MOPs prese檌ጅ㰥㪙嬨픘☊졕�㖀檌ጅ〥褐訄呦頫脨䨄呤頫脨풏ᐉ꣄ざɑꧫ orientation젓. Furthermore the findings suggest that RR participants relied upon image-based representations for judging 2 MOPs presented in the same orientation and full structural descriptions for judging 2 MOPs shown in different orientations..

These findings were replicated in Experiment 4, in which response latency was used as a dependent measure, and in Experiment 5, in which the exposure times (deadline) were lengthened, demonstrating the robustness of the findings and lending support to our three-level representation hypothesis. Most interestingly, we also found in Experiment 5, when the connection between the parts of objects 呧頫脨 removed, that performance of RI and RR participants with 2 MOPs was not hurt by misalignment of orientation any more. This final set of findings suggest퐝쐉ꨲ䰕삔阷䓫삻賵ժ┓衰罱俴縳ዩ誸굑ꉠ踄 different parts in a connected objects 栵㺑 a highly automatic and obligatory process such that parsing or ɝ�;ꊐｭ㸢䛠Ң 蠀䨠縀鸱ㄬɽთ%﵃儊瀤䜙ᙊ臌®쓔ौက鑁䐑䄐 an effortful mental act.

Viewpoint Constancy and Representations Underlying Object Recognition

One of the key problems that the visual system faces and has to solve in order for us to see the world around us concerns variation in retinal images. These variations are caused by differences in factors affecting the image formation process such as illumination, spectral composition of the light source and surface reflectance of an object, and viewing distance, among others. For almost two decades, researchers have paid close attention to yet another image variation that the visual system has to deal with, which is the difference of the projected shape in retinal images caused by disparate viewpoints of observation. Note that most, if not all, researchers have assumed that shape is by far the dominant route through which we recognize objects (Biederman, 1987; Biederman & Ju, 1988; Ullman, 1989, 1996). The visual system must have some mechanism to deal with shape variations arising from the same objects without mistakenly registering rotated images of the same object as indexing different objects. For any familiar object, say the mug sitting beside the computer in your office, we have countless occasions to record observation from different points of view. While the images projected on the retina are not all the same, variations in projected images hardly prevent us from successfully recognizing that they point to the same mug. How can we provide an adequate and sufficient account for a feat that the visual system achieves so seamlessly?

There have been two general classes of answer to this question. One class, "ᄀॄက鑁䐑䄐"ᄀ by the work of Marr and his colleagues two decades ago, is that the visual system will, after going through a series of processingॄက鑁䐑䄐, filter out variations in the projected image and arrive at a 3-D object model as the basis for object recognition. This 3-D object model has the desirable property of tolerating variations caused by viewpoint, or put in other words, the disparate projected images would always lead to the computation and activation of an identical 3-D object model, as long as they arise from the same object. Object recognition is achieved by matching the constructed 3-D model to one that has been stored in the long-term memory. This proposal was later elaborated and expanded by Biederman and his colleagues in the RBC theory (Biederman, 1987) and subsequently the JIM model (Hummel & Biederman, 1992). Another answer to the question, most representative by the work of Tarr and his colleagues, is that the visual system does not have a centralized, single representation for all images arising from the same object. Rather the system may encode and store numerous projected images of the same object corresponding to variations in viewpoint. Object recognition is achieved simply by matching the projected image, after some normalization process, with one that has been stored in memory some time ago.

There "ᄀॄ been evidence accrued to support both positions. For the structural-descriptions theories, espoused by Marr, Biederman, Hummel, evidence has come from naming and priming studiesက鑁䐑䄐"ᄀॄက鑁䐑䄐"ᄀ(a) ॄက鑁䐑䄐" an intermediate-level representation corresponding to the constituent parts comprising an object (e.g., Biederman,1987; Biederman & Cooper, 1991; ᄀॄက鑁䐑䄐"ᄀ�ν횏촴读涶㞗�⃪湊㓓흍ﴙ➯頷ۛꨀ率 parsing is made at matched deep concavities of an object’s bounding contour (Biederman, 1987), and (c) a single structural description is all that is needed for object recognition as long as its construction is not obscured by drastic transformations that would lead to a different set of parts or components (Biederman & Gerhardstein, 1993, 1995). For the normalization theories, espoused by Tarr, Edelman, Bülthoff, among others, evidence has come from studies in which participants learned the shapes of novel objects from a limited set of viewpoints and were then tested with the objects presented at either the previously learned, and hence familiar, viewpoints or novel viewpoints. The results from those studies (e.g., Tarr, 1995; Tarr & Pinker, 1989) suggest that participants have to go through some process of normalizaiton (mental rotation, axis alignment, or interpolation between learned views) before they can provide an accurate response.

Most recently, researchers have gradually come to the conclusion that both kinds of representation, view-invariant and view-specific, may be needed in the course of object recognition, since it has not proved possible to 悈꒦홬w蚨줨뛕쟭ღ" refute the evidence for either position (Hummel & Stankiewicz, 1996; Tarr & Bülthoff, 1995, 1998). The more challenging issue is to specify the conditions under which each kind of representation may be computed and used, and this was the idea that motivated the present study.

We proposed that three levels of representation (i.e., 2-D images, part-level representation, and full structural description) may be computed and used for the recognition of multipart 3-D objects. The results of the present study not only provide strong support for our hypothesis but also provide the specification of the conditions under which different levels of representation may be needed. For instance, our image-level representation is quite consistent with the characterization by Tarr’s multiple-view theory in that when objects are presented in the same orientations, then fast image-level representations are used. This appears to be the case for both RI and RR participants when judging 2 MIPs. When judging 2 MOPs presented in the same orientation, on the other hand, RI participants may have relied upon a part-level representation because image level representations could not provide an accurate answer. Finally when judging 2 MIPs and 2MOPs presented in different orientations, RR participants may have relied upon full structural descriptions. Alternatively, when objects are misaligned, a transformation like mental rotation may be executed to bring the objects into some form of coherence (Tarr, 1995; Tarr & Pinker, 1989). However, evidence is still needed to provide an unequivocal support to such a claim.

The nature of image-level representations. It is not completely clear, however, what exactly constitutes an image-level representation. Tarr and his colleagues, for example, deny that each "view" of an object in their theory of multiple views corresponds to a template in a very literal sense (Tarr & Bülthoff, 1995, 1998). Yet they so far have failed to specify what kind of representation it is, if not a template. Interestingly, a number of neurophysiological studies have demonstrated that there are neurons in the inferior temporal lobe that are differentially sensitive to the profile of a face or an object viewed from different angles (e.g., Logothetis & Sheinberg, 1996; Perrett, Oram, Hietanen, & Benson, 1994). However, the tuning curves of these neurons are never very sharp in that each of the neuron can respond to images projected from a range of viewing angles, even though the responses tend to peak at a specific point of view. Analogously, there are a handful of different cone types in the retina of our visual system, sensitive to different ranges of the wavelength spectrum. These different wavelength-sensitive cones are sufficient to give rise to people’s tremendous capacity for color vision. We can see this kind of coarse coding throughout the various streams of the visual system registering different aspects of a stimulus, such as color, lightness, depth, and motion (Desimone & Duncan, 1995). It would not be surprising if there are only a handful of view-sensitive neurons, which nonetheless can give rise to our capability to recognize an object from a variety of viewpoints.

For now, we tend to agree with Hummel and Stankiewicz’s (1996) proposal that the image level of representation is akin to the SSM they proposed, and can be used to mediate the priming effect for same-orientation objects presented at a different locations in the frontal parallel plane or with transformations in size (Stankiewicz et al., 1998; Treisman & DeScheppe, 1996; see also Lawson & Humphreys, 1996). Our results and those from other researchers, however, are far from settling the issue, and its answer will have to await further studies.

The part-level representation and full structural description. Biederman and his colleagues have provided some evidence suggesting the existence of an intermediate level of representation that lies between the physical image and the full identity of an object (Biederman, 1987; Biederman & Cooper, 1991). Our results are also consistent with such a proposal. For example, in almost all our experiments we repeatedly observed that for RI participants it was always easier to judge 2 MOPs presented in different orientations rather than the same orientation. We suggest that the poorer performance with 2 MOPs at the same orientation may reflect that processing has to move up to the part-based level, because of the unreliability of the representation at the image level. This account presupposes the existence of part-level representation, which we think is reasonable in that neither an image-level representation nor an (over-computed) structural description, nor their combination, would suffice.

Finally we think that it is also reasonable and even necessary to assume the existence of representation implicating full structural descriptions. This is a level of representation that embodies a precise and complete specification of the parts comprising an object and the spatial role that each part has relative to other parts of the same object (or even relative to other parts arising from another object) (Saiki & Hummel, 1996, 1998). The major piece of evidence we gathered to support the notion of full structural descriptions is the joint finding that RI participants were better at judging 2 MOPs with different orientations than with same orientations, whereas the opposite was true for the RR participants. Furthermore, RR participants consistently exhibited poor performance with 2 MOPs at different orientations despite various manipulations attempting to elevate their performance. These findings suggest that (a) a level of representation different from that used by RI participants to judge 2 MOPs objects was required by RR participants to judge the same type of objects, and (b) this different level of representation may well be a structural description that carries information regarding both the parts and the spatial relations among parts. This information meets the demands required from the task instructions imposed on the RR participants.

The foregoing characterization of the need for structural descriptions raises the general issue of exactly when structural descriptions are needed for the purpose of object recognition. In the past, theorists who held a favorable view of structural description theories have provided points of advantage that structural description may have over template matching or feature-list models of object recognition. However, it is not clear why the visual system would need to construct a full structural description in every instance of recognition. In particular, in terms of the multiple-view theories, it seems a gross waste of mental resources for the visual system to compute structural description only to identify a single, isolated object, since representations created by mechanisms at a lower, less costly, level would probably suffice. We would propose, instead, that structural description may only be computed when the visual system has to discriminate more than one multipart objects, and the discrimination cannot be made on the basis of the constituent components but requires knowledge of the spatial role that each component has within its host object.

The Role of Spatial Attention in Object Recognition

By now, it seems quite clear that attention is needed for object recognition especially when recognition requires the construction of a full structural description of the object. It should be noted that in this series of experiments we did not venture into systematic manipulations of attention either by short exposure times to create illusory conjunctions (Shyi & Cheng, 1996) or by spatial cueing to direct allocation of attention (Shyi & Chen, 1998). Nonetheless it is quite clear that attention is needed in the various conditions that were used here. Just looking at the exposure time that was required to yield a reasonable level of performance, there can be little doubt that attention plays a role in deciding whether the displayed objects were identical or not.

The more interesting finding, however, is not so much that the construction of full structural descriptions requires attention, but "why." The answer that Hummel and Stankiewicz (1996) provided, along with that offered by Hummel and Biederman (1992), was that dynamic binding among attributes (including relative spatial location) is error prone and requires time to develop. The main function of attention therefore is to foster the correct binding between attributes belonging to the same geons or, one level up, geons belonging to the same object. That is, the main function of attention is to prevent the accidental synchrony of firing between attributes of different geons or geons of different objects. Our results, especially those from Experiment 5, suggest yet another functional role for attention, and that is to actively parse a connected object into its constituent parts and to maintain the segregation of components. It is somewhat odd to note that neither the neural network model of Hummel and Biederman (1992) (i.e., JIM) nor that of Hummel and Stankiewicz (1996) (i.e., JIM2) has an explicit way to deal with objects that have parts that are connected to one another. Temporal synchrony of units representing attributes of the geon or geons of the same object fire together and asynchrony among units not belonging to the same geon or the same object only speak to the co-occurrence (i.e., binding) of the attributes or geons, but implies little about whether the parts are connected or not, and if so, in what way. The nodes in those models may, by some form of cohort activation, collectively represent the detection of a particular geon or that one geon is above or below another (or any other kind of spatial relation between them), but it is not clear how in either model the nodes would represent the state that geons are connected. (Note that it may not be as serious a problem in the case of detecting the existence of a particular geon, because according to Biederman’s characterization, the attributes of geons are topological properties, rather than literal physical descriptions, of a visual stimulus. As such there is no such issue with respect to connectedness at the individual geon level.) In that sense, it is not clear then how JIM and JIM2 would deal with the fact that parts are connected to at least one other part in a cohesive, bounding object.

We would suggest that the visual system needs attention in order to maintain a clear and stable description of the structural relationship and the nature of connection among various parts. Note that keeping a structural description of a connected object may be very unnatural and hence effortful because it entails reversing the object's original state of being connected. With the objects that we used and the task that the RR participants were instructed to perform, there were numerous occasions where they would have to maintain a very specific and precise description of what structure each object had. For example, consider the case where they have to judge whether two objects that had 2 MOPs in common were the same or not. When one of them was rotated out of alignment with respect to the other, the RR participants appeared to have a difficult time responding "different," suggesting that it was an effortful and attention-demanding act to maintain the integrity of structural descriptions, especially for those undertaking some form of transformation.

References

Biederman, I. (1987). Recognition-by-components: A theory of human image understanding. Psychological Review, 94, 115-147.

Biederman, I. (1995). Visual object recognition. In S. M. Kosslyn & D. N. Osherson (Eds.), An invitation to cognitive science (vol. 2): Visual cognition (pp. 121-166). Cambridge, MA: MIT Press.

Biederman, I., & Cooper, E. E. (1991). Priming contour-deleted images: Evidence for intermediate representations in visual object recognition. Cognitive Psychology, 23, 393-419.

Biederman, I., & Gerhardstein, P. C. (1993). Recognizing depth-rotated objects: Evidence and conditions for three-dimensional viewpoint invariance. Journal of Experimental Psychology: Human Perception and Performance, 19, 1162-1182.

Biederman, I., & Gerhardstein, P. C. (1995). Viewpoint-dependent mechanisms in visual object recognition; Reply to Tarr and Bülthoff (1995). Journal of Experimental Psychology: Human Perception and Performance, 21, 1506-1514.

Biederman, I., & Ju, G. (1988). Surface versus edge-based determinants of visual recognition. Cognitive Psychology, 20, 38-64.

Desimone, R., & Duncan, J. (1995). Neural mechanisms of selective visual attention. In Annual Review of Neuroscience, 18, 193-222.

Edelman, S., & Bülthoff, H. H. (1992). Orientation dependence in the recognition of familiar and novel view of 3D objects. Vision Research, 32, 2385-2400.

Goldstone, R. L. (1994). Similarity, interactive activation, and mapping. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 3-28.

Goldstone, R. L., & Medin, D. L. (1994). Time course of comparison. Journal of Experimental Psychology: Learning, Memory, and Cognition, 20, 29-50.

Hummel, J. E., & Biederman, I. (1992). Dynamic binding in a neural network for shape recognition. Psychological Review, 99, 480-517.

Hummel, J. E., & Stankiewicz, B. J. (1996). An architecture for rapid, hierarchical structural description. In T. Inui & J. McClelland (Eds.), Attention and Performance XVI (pp. 93-121). Cambridge, MA:MIT Press.

Jonides, J., & Smith, E. (1997). The architecture of working memory. In M. D. Rugg (Ed.), Cognitive neuroscience (pp. 243-276). Cambridge, MA: MIT Press.

Lawson, R., & Humphreys, G. W. (1996). View specificity in object processing: Evidence from picture matching. Journal of Experimental Psychology: Human Perception & Performance,22, 395-416.

Levitt, H. (1971). Transformed up-down methods in psychoacoustics. Journal of the Acoustical Society of America, 49, 467-477.

Logothesis, N. K., & Sheinberg, D. L. (1996). Recognition and Representation of visual objects in primates: Psychophysics and physiology. In R. Linas & P. S. Churchland (Eds.), The mind-brain continuum (pp. 147-172). Cambridge, MA:MIT Press.

Marr, D., & Nishihara, H. K. (1978). Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of Royal Society of London (B), 200, 269-294.

Marr, D. (1982). Vision. San Francisco: Freeman.

Medin, D. L., Goldstone, R. L., & Gentner, D. (1993). Respects for similarity. Psychological Review, 100, 254-278.

Palmer, S. E. (1978). Fundamental aspects of cognitive representation. In E. Rosch & B. Lloyd (Eds.), Cognition and categorization (pp. 261-304). Hillsdale, NJ: Erlbaum.

Palmer, S. E. (1999). Vision science: Photons to phenomenology. Cambridge, MA: MIT Press.

Palmer, S. E., & Rock, I. (1994). Rethinking perceptual organization: The role of uniform connectedness. Psychonomic Bulletin & Review, 1, 29-55.

Perrett, D. I., Oram, M. W., Hietanen, J. K., & Benson, P. J. (1994). Issues of representation in object vision. In M. J. Farah & G. Ratcliff (Eds.), The neuropsychology of high-level vision (pp. 33-62). Hillsdale, NJ: Erlbaum.

Prinzmetal, W., Presti, D., & Posner, M. I. (1986). Does attention affect visual feature integration? Journal of Experimental Psychology: Human Perception and Performance, 12, 361-369.

Proctor, R. W., & Healy, A. F. (1985). Order-relevant and order-irrelevant decision rules in multiletter matching. Journal of Experimental Psychology: Learning, Memory, and Cognition, 11, 519-537.

Rumelhart, D. E., & McClelland, J. L. (1982). An interactive activation model of context effects in letter perception: Part 2. The contextual enhancement effect and some tests and extension of the model. Psychological Review, 89, 60-94.

Saiki, J., & Hummel, J. E. (1996). Attribute conjunctions and the part configuration advantage in object category learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 22, 1002-1019.

Saiki, J., & Hummel, J. E. (1998). Connectedness and the integration of parts with relations in shape perception. Journal of Experimental Psychology: Human Perception and Performance, 24, 227-251.

Shah, P., & Miyake, A. (1996). The separability of working memory resources for spatial thinking and language processing: An individual difference approach. Journal of Experimental Psychology: General, 125, 4-27.

Stankiewicz, B. J., Hummel, J. E., & Cooper, E. E. (1998). The role of attention in priming for left-right reflections of object images: Evidence for a dual representation of object shape. Journal of Experimental Psychology: Human Perception and Performance, 24, 732-744.

Shyi, G. C.-W., & Cheng, S.-K. (1996). The role of spatial attention in visual object recognition. Paper presented at the XXVI International Congress of Psychology, August 16-22, Montreal, Canada.

Shyi, G. C.-W., & Chen, S.-W. (1998). Spatial precueing does not reduce illusory conjunctions in perceiving 3D objects. Paper presented at the 39th Annual Meeting of the Psychonomic Society, Nov. 18-22, Dallas, Texas.

Tarr, M. J. (1995). Rotating objects to recognize them: A case study on the role of viewpoint dependency in the recognition of three-dimensional objects. Psychonomic Bulletin & Review, 2(1), 55-82.

Tarr, M. J., & Bülthoff, H. H. (1995). Is human object recognition better described by geon structural description or by multiple views? Comments on Biederman and Gerhardstein (1993). Journal of Experimental Psychology: Human Perception & Performance, 21, 1494-1505.

Tarr, M. J., & Bülthoff, H. H. (1998). Image-based object recognition in man, monkey, and machine. In M. J. Tarr & H. H. Bülthoff (Eds.), Object recognition in man, monkey, and machine (pp. 1-20). Cambridge, MA: MIT Press.

Tarr, M. J., & Pinker, S. (1989). Mental rotation and orientation-dependence in shape recognition. Cognitive Psychology, 21, 233-282.

Tarr, M. J., Williams, P., Hayward, W. G., & Gauthier, I. (1998). Three-dimensional object recognition is viewpoint dependent. Nature: Neuroscience, 1, 275-277.

Treisman, A. (1988). Features and objects: The fourteenth Bartlett memorial lecture. Quarterly Journal of Experimental Psychology, 40 A, 201-237.

Treisman, A. (1993). The perception of features and objects. In A. Baddeley, & L. Weiskrantz (Eds.), Attention: Selection, awareness, and control (pp. 5-35). Oxford, England: Oxford University Press.

Treisman, A. M., & DeScheppe, B. (1996). Object tokens, attention and visual memory. In T. Inui & J. McClelland (Eds.), Attention and performance XVI (pp. 15-46). Cambridge, MA:MIT Press.

Treisman, A. M., & Gelade, G. (1980). A feature-integration theory of attention. Cognitive Psychology, 12, 97-136.

Treisman, A. M., & Schmidt, H. (1982). Illusory conjunctions in perception of objects. Cognitive Psychology, 14, 107-141.

Ullman, S. (1989). Aligning pictorial descriptions: An approach to object recognition. Cognition, 32, 193-254.

Ullman, S. (1996). High-level vision. Cambridge, MA:MIT Press.

Wolfe, J. M. (1998). Visual search. In H. Pashler (Ed.), Attention (pp. 13-73). East Sussex, UK: Psychology Press.

Table 1

The Mean Response Accuracy as a Function of Rule of Judgment, Orientation, and Type of Display in Experiment 4 (N = 24).

__________________________________________________________________________

Type of Display

_______________________________________________

Group Orientation 2MIPs 2MOPs 1MIP 1MOP

__________________________________________________________________________

RI SAME 0.98 0.75 0.89 0.92

DIFF 0.94 0.94 0.93 0.93

___________________________________________________________________________

RR SAME 0.98 0.87 0.86 0.94

DIFF 0.89 0.67 0.87 0.94

____________________________________________________________________________

Figure Captions

Figure 1. The set of 12 stimulus objects used in Experiments 1 to 5. Note that all objects comprised a truncated cone with two small parts attached to the upper left and lower right side.

Figure 2. Examples illustrating the relations between the a standard (top row in each panel) and a sample object (the bottom row in each panel) as the objects were presented in either the same (upper panel) or different (lower panel) orientations.

Figure 3. The mean response accuracy as a function of group (RI vs. RR) and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 1 (N = 119).

Figure 4. The mean accuracy as a function of group (RI vs. RR), type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) and orientation in Experiment 1 (N = 119).

Figure 5. The mean response accuracy as a function of type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) and block for the RR group in Experiment 1 (N = 119).

Figure 6. The mean response accuracy as a function of type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP), deadline (300, 600, and 900 ms) and block for the RR group in Experiment 1 (N = 119).

Figure 7. The mean response accuracy as a function of group (RI vs. RR) and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 2 (N = 58).

Figure 8. The mean response accuracy as a function of type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) and block for the RI group in Experiment 2 (N = 58).

Figure 9. The mean response accuracy as a function of group (RI vs. RR), type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) and orientation in Experiment 2 (N = 58).

Figure 10. The mean response accuracy as a function of type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP), orientation and block in Experiment 2 (N = 58).

Figure 11. The mean exposure deadline for RI and RR participants to reach the equivalent level of performance as displayed were presented in either the same or different orientations in Experiment 3 (N = 85).

Figure 12. The mean response latency (ms) as a function of group (RI vs. RR) and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 4 (N = 24).

Figure 13. The mean response latency (ms) as a function of group (RI vs. RR), type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) and orientation in Experiment 4 (N = 24).

Figure 14. An illustration of a connected object and its disconnected counterpart used in Experiment 5.

Figure 15. The mean response accuracy as a function of connectedness and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 5 (N = 57).

Figure 16. The mean response accuracy as a function of rule of judgment (RI vs. RR) and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 5 (N = 57).

Figure 17. The mean response accuracy as a function of deadline (1 s vs. 2 s) and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 5 (N = 57).

Figure 18. The mean accuracy as a function of orientation and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) in Experiment 5 (N = 57).

Figure 19. The mean response accuracy as a function of rule of judgment (RI vs. RR), orientation, and type of stimulus relations (2 MIPs, 2 MOPs, 1 MIP, and 1 MOP) for the connected group ( two panels on the left) and the disconnected group (two panels on the right) in Experiment 5 (N = 57).

Figure 1