comparison thesis/cortex.org @ 447:284316604be0

minor changes from Dylan.
author Robert McIntyre <rlm@mit.edu>
date Tue, 25 Mar 2014 11:30:15 -0400
parents 3e91585b2a1c
children af13fc73e851
comparison
equal deleted inserted replaced
446:3e91585b2a1c 447:284316604be0
8 * Empathy and Embodiment as problem solving strategies 8 * Empathy and Embodiment as problem solving strategies
9 9
10 By the end of this thesis, you will have seen a novel approach to 10 By the end of this thesis, you will have seen a novel approach to
11 interpreting video using embodiment and empathy. You will have also 11 interpreting video using embodiment and empathy. You will have also
12 seen one way to efficiently implement empathy for embodied 12 seen one way to efficiently implement empathy for embodied
13 creatures. Finally, you will become familiar with =CORTEX=, a 13 creatures. Finally, you will become familiar with =CORTEX=, a system
14 system for designing and simulating creatures with rich senses, 14 for designing and simulating creatures with rich senses, which you
15 which you may choose to use in your own research. 15 may choose to use in your own research.
16 16
17 This is the core vision of my thesis: That one of the important ways 17 This is the core vision of my thesis: That one of the important ways
18 in which we understand others is by imagining ourselves in their 18 in which we understand others is by imagining ourselves in their
19 position and emphatically feeling experiences relative to our own 19 position and emphatically feeling experiences relative to our own
20 bodies. By understanding events in terms of our own previous 20 bodies. By understanding events in terms of our own previous
24 is happening in a video and being completely lost in a sea of 24 is happening in a video and being completely lost in a sea of
25 incomprehensible color and movement. 25 incomprehensible color and movement.
26 26
27 ** Recognizing actions in video is extremely difficult 27 ** Recognizing actions in video is extremely difficult
28 28
29 Consider for example the problem of determining what is happening in 29 Consider for example the problem of determining what is happening
30 a video of which this is one frame: 30 in a video of which this is one frame:
31 31
32 #+caption: A cat drinking some water. Identifying this action is 32 #+caption: A cat drinking some water. Identifying this action is
33 #+caption: beyond the state of the art for computers. 33 #+caption: beyond the state of the art for computers.
34 #+ATTR_LaTeX: :width 7cm 34 #+ATTR_LaTeX: :width 7cm
35 [[./images/cat-drinking.jpg]] 35 [[./images/cat-drinking.jpg]]
36 36
37 It is currently impossible for any computer program to reliably 37 It is currently impossible for any computer program to reliably
38 label such a video as "drinking". And rightly so -- it is a very 38 label such a video as ``drinking''. And rightly so -- it is a very
39 hard problem! What features can you describe in terms of low level 39 hard problem! What features can you describe in terms of low level
40 functions of pixels that can even begin to describe at a high level 40 functions of pixels that can even begin to describe at a high level
41 what is happening here? 41 what is happening here?
42 42
43 Or suppose that you are building a program that recognizes 43 Or suppose that you are building a program that recognizes chairs.
44 chairs. How could you ``see'' the chair in figure 44 How could you ``see'' the chair in figure \ref{invisible-chair} and
45 \ref{invisible-chair} and figure \ref{hidden-chair}? 45 figure \ref{hidden-chair}?
46 46
47 #+caption: When you look at this, do you think ``chair''? I certainly do. 47 #+caption: When you look at this, do you think ``chair''? I certainly do.
48 #+name: invisible-chair 48 #+name: invisible-chair
49 #+ATTR_LaTeX: :width 10cm 49 #+ATTR_LaTeX: :width 10cm
50 [[./images/invisible-chair.png]] 50 [[./images/invisible-chair.png]]
67 67
68 Each of these examples tells us something about what might be going 68 Each of these examples tells us something about what might be going
69 on in our minds as we easily solve these recognition problems. 69 on in our minds as we easily solve these recognition problems.
70 70
71 The hidden chairs show us that we are strongly triggered by cues 71 The hidden chairs show us that we are strongly triggered by cues
72 relating to the position of human bodies, and that we can 72 relating to the position of human bodies, and that we can determine
73 determine the overall physical configuration of a human body even 73 the overall physical configuration of a human body even if much of
74 if much of that body is occluded. 74 that body is occluded.
75 75
76 The picture of the girl pushing against the wall tells us that we 76 The picture of the girl pushing against the wall tells us that we
77 have common sense knowledge about the kinetics of our own bodies. 77 have common sense knowledge about the kinetics of our own bodies.
78 We know well how our muscles would have to work to maintain us in 78 We know well how our muscles would have to work to maintain us in
79 most positions, and we can easily project this self-knowledge to 79 most positions, and we can easily project this self-knowledge to
83 83
84 I propose a system that can express the types of recognition 84 I propose a system that can express the types of recognition
85 problems above in a form amenable to computation. It is split into 85 problems above in a form amenable to computation. It is split into
86 four parts: 86 four parts:
87 87
88 - Free/Guided Play :: The creature moves around and experiences the 88 - Free/Guided Play (Training) :: The creature moves around and
89 world through its unique perspective. Many otherwise 89 experiences the world through its unique perspective. Many
90 complicated actions are easily described in the language of a 90 otherwise complicated actions are easily described in the
91 full suite of body-centered, rich senses. For example, 91 language of a full suite of body-centered, rich senses. For
92 drinking is the feeling of water sliding down your throat, and 92 example, drinking is the feeling of water sliding down your
93 cooling your insides. It's often accompanied by bringing your 93 throat, and cooling your insides. It's often accompanied by
94 hand close to your face, or bringing your face close to 94 bringing your hand close to your face, or bringing your face
95 water. Sitting down is the feeling of bending your knees, 95 close to water. Sitting down is the feeling of bending your
96 activating your quadriceps, then feeling a surface with your 96 knees, activating your quadriceps, then feeling a surface with
97 bottom and relaxing your legs. These body-centered action 97 your bottom and relaxing your legs. These body-centered action
98 descriptions can be either learned or hard coded. 98 descriptions can be either learned or hard coded.
99 - Alignment :: When trying to interpret a video or image, the 99 - Alignment (Posture imitation) :: When trying to interpret a video
100 creature takes a model of itself and aligns it with 100 or image, the creature takes a model of itself and aligns it
101 whatever it sees. This can be a rather loose 101 with whatever it sees. This alignment can even cross species,
102 alignment that can cross species, as when humans try 102 as when humans try to align themselves with things like
103 to align themselves with things like ponies, dogs, 103 ponies, dogs, or other humans with a different body type.
104 or other humans with a different body type. 104 - Empathy (Sensory extrapolation) :: The alignment triggers
105 - Empathy :: The alignment triggers the memories of previous 105 associations with sensory data from prior experiences. For
106 experience. For example, the alignment itself easily 106 example, the alignment itself easily maps to proprioceptive
107 maps to proprioceptive data. Any sounds or obvious 107 data. Any sounds or obvious skin contact in the video can to a
108 skin contact in the video can to a lesser extent 108 lesser extent trigger previous experience. Segments of
109 trigger previous experience. The creatures previous 109 previous experiences are stitched together to form a coherent
110 experience is chained together in short bursts to 110 and complete sensory portrait of the scene.
111 coherently describe the new scene. 111 - Recognition (Classification) :: With the scene described in terms
112 - Recognition :: With the scene now described in terms of past 112 of first person sensory events, the creature can now run its
113 experience, the creature can now run its 113 action-identification programs on this synthesized sensory
114 action-identification programs on this synthesized 114 data, just as it would if it were actually experiencing the
115 sensory data, just as it would if it were actually 115 scene first-hand. If previous experience has been accurately
116 experiencing the scene first-hand. If previous 116 retrieved, and if it is analogous enough to the scene, then
117 experience has been accurately retrieved, and if 117 the creature will correctly identify the action in the scene.
118 it is analogous enough to the scene, then the 118
119 creature will correctly identify the action in the
120 scene.
121
122
123 For example, I think humans are able to label the cat video as 119 For example, I think humans are able to label the cat video as
124 "drinking" because they imagine /themselves/ as the cat, and 120 ``drinking'' because they imagine /themselves/ as the cat, and
125 imagine putting their face up against a stream of water and 121 imagine putting their face up against a stream of water and
126 sticking out their tongue. In that imagined world, they can feel 122 sticking out their tongue. In that imagined world, they can feel
127 the cool water hitting their tongue, and feel the water entering 123 the cool water hitting their tongue, and feel the water entering
128 their body, and are able to recognize that /feeling/ as 124 their body, and are able to recognize that /feeling/ as drinking.
129 drinking. So, the label of the action is not really in the pixels 125 So, the label of the action is not really in the pixels of the
130 of the image, but is found clearly in a simulation inspired by 126 image, but is found clearly in a simulation inspired by those
131 those pixels. An imaginative system, having been trained on 127 pixels. An imaginative system, having been trained on drinking and
132 drinking and non-drinking examples and learning that the most 128 non-drinking examples and learning that the most important
133 important component of drinking is the feeling of water sliding 129 component of drinking is the feeling of water sliding down one's
134 down one's throat, would analyze a video of a cat drinking in the 130 throat, would analyze a video of a cat drinking in the following
135 following manner: 131 manner:
136 132
137 1. Create a physical model of the video by putting a "fuzzy" model 133 1. Create a physical model of the video by putting a ``fuzzy''
138 of its own body in place of the cat. Possibly also create a 134 model of its own body in place of the cat. Possibly also create
139 simulation of the stream of water. 135 a simulation of the stream of water.
140 136
141 2. Play out this simulated scene and generate imagined sensory 137 2. Play out this simulated scene and generate imagined sensory
142 experience. This will include relevant muscle contractions, a 138 experience. This will include relevant muscle contractions, a
143 close up view of the stream from the cat's perspective, and most 139 close up view of the stream from the cat's perspective, and most
144 importantly, the imagined feeling of water entering the 140 importantly, the imagined feeling of water entering the
182 #+caption: curling, wiggling, and resting. 178 #+caption: curling, wiggling, and resting.
183 #+name: worm-intro 179 #+name: worm-intro
184 #+ATTR_LaTeX: :width 15cm 180 #+ATTR_LaTeX: :width 15cm
185 [[./images/worm-intro-white.png]] 181 [[./images/worm-intro-white.png]]
186 182
187 #+caption: The actions of a worm in a video can be recognized by 183 #+caption: =EMPATH= recognized and classified each of these poses by
188 #+caption: proprioceptive data and sentory predicates by filling 184 #+caption: inferring the complete sensory experience from
189 #+caption: in the missing sensory detail with previous experience. 185 #+caption: proprioceptive data.
190 #+name: worm-recognition-intro 186 #+name: worm-recognition-intro
191 #+ATTR_LaTeX: :width 15cm 187 #+ATTR_LaTeX: :width 15cm
192 [[./images/worm-poses.png]] 188 [[./images/worm-poses.png]]
193
194 189
195 One powerful advantage of empathic problem solving is that it 190 One powerful advantage of empathic problem solving is that it
196 factors the action recognition problem into two easier problems. To 191 factors the action recognition problem into two easier problems. To
197 use empathy, you need an /aligner/, which takes the video and a 192 use empathy, you need an /aligner/, which takes the video and a
198 model of your body, and aligns the model with the video. Then, you 193 model of your body, and aligns the model with the video. Then, you
199 need a /recognizer/, which uses the aligned model to interpret the 194 need a /recognizer/, which uses the aligned model to interpret the
200 action. The power in this method lies in the fact that you describe 195 action. The power in this method lies in the fact that you describe
201 all actions form a body-centered, rich viewpoint. This way, if you 196 all actions form a body-centered, viewpoint You are less tied to
197 the particulars of any visual representation of the actions. If you
202 teach the system what ``running'' is, and you have a good enough 198 teach the system what ``running'' is, and you have a good enough
203 aligner, the system will from then on be able to recognize running 199 aligner, the system will from then on be able to recognize running
204 from any point of view, even strange points of view like above or 200 from any point of view, even strange points of view like above or
205 underneath the runner. This is in contrast to action recognition 201 underneath the runner. This is in contrast to action recognition
206 schemes that try to identify actions using a non-embodied approach 202 schemes that try to identify actions using a non-embodied approach
207 such as TODO:REFERENCE. If these systems learn about running as viewed 203 such as TODO:REFERENCE. If these systems learn about running as
208 from the side, they will not automatically be able to recognize 204 viewed from the side, they will not automatically be able to
209 running from any other viewpoint. 205 recognize running from any other viewpoint.
210 206
211 Another powerful advantage is that using the language of multiple 207 Another powerful advantage is that using the language of multiple
212 body-centered rich senses to describe body-centerd actions offers a 208 body-centered rich senses to describe body-centerd actions offers a
213 massive boost in descriptive capability. Consider how difficult it 209 massive boost in descriptive capability. Consider how difficult it
214 would be to compose a set of HOG filters to describe the action of 210 would be to compose a set of HOG filters to describe the action of
215 a simple worm-creature "curling" so that its head touches its tail, 211 a simple worm-creature ``curling'' so that its head touches its
216 and then behold the simplicity of describing thus action in a 212 tail, and then behold the simplicity of describing thus action in a
217 language designed for the task (listing \ref{grand-circle-intro}): 213 language designed for the task (listing \ref{grand-circle-intro}):
218 214
219 #+caption: Body-centerd actions are best expressed in a body-centered 215 #+caption: Body-centerd actions are best expressed in a body-centered
220 #+caption: language. This code detects when the worm has curled into a 216 #+caption: language. This code detects when the worm has curled into a
221 #+caption: full circle. Imagine how you would replicate this functionality 217 #+caption: full circle. Imagine how you would replicate this functionality
291 empathy, using =CORTEX= as a base. Empathy in this context is the 287 empathy, using =CORTEX= as a base. Empathy in this context is the
292 ability to observe another creature and infer what sorts of sensations 288 ability to observe another creature and infer what sorts of sensations
293 that creature is feeling. My empathy algorithm involves multiple 289 that creature is feeling. My empathy algorithm involves multiple
294 phases. First is free-play, where the creature moves around and gains 290 phases. First is free-play, where the creature moves around and gains
295 sensory experience. From this experience I construct a representation 291 sensory experience. From this experience I construct a representation
296 of the creature's sensory state space, which I call \phi-space. Using 292 of the creature's sensory state space, which I call \Phi-space. Using
297 \phi-space, I construct an efficient function for enriching the 293 \Phi-space, I construct an efficient function for enriching the
298 limited data that comes from observing another creature with a full 294 limited data that comes from observing another creature with a full
299 compliment of imagined sensory data based on previous experience. I 295 compliment of imagined sensory data based on previous experience. I
300 can then use the imagined sensory data to recognize what the observed 296 can then use the imagined sensory data to recognize what the observed
301 creature is doing and feeling, using straightforward embodied action 297 creature is doing and feeling, using straightforward embodied action
302 predicates. This is all demonstrated with using a simple worm-like 298 predicates. This is all demonstrated with using a simple worm-like
312 308
313 309
314 310
315 * COMMENT names for cortex 311 * COMMENT names for cortex
316 - bioland 312 - bioland
313
314
315
316
317 # An anatomical joke:
318 # - Training
319 # - Skeletal imitation
320 # - Sensory fleshing-out
321 # - Classification