Mercurial > cortex
comparison thesis/cortex.org @ 447:284316604be0
minor changes from Dylan.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Tue, 25 Mar 2014 11:30:15 -0400 |
parents | 3e91585b2a1c |
children | af13fc73e851 |
comparison
equal
deleted
inserted
replaced
446:3e91585b2a1c | 447:284316604be0 |
---|---|
8 * Empathy and Embodiment as problem solving strategies | 8 * Empathy and Embodiment as problem solving strategies |
9 | 9 |
10 By the end of this thesis, you will have seen a novel approach to | 10 By the end of this thesis, you will have seen a novel approach to |
11 interpreting video using embodiment and empathy. You will have also | 11 interpreting video using embodiment and empathy. You will have also |
12 seen one way to efficiently implement empathy for embodied | 12 seen one way to efficiently implement empathy for embodied |
13 creatures. Finally, you will become familiar with =CORTEX=, a | 13 creatures. Finally, you will become familiar with =CORTEX=, a system |
14 system for designing and simulating creatures with rich senses, | 14 for designing and simulating creatures with rich senses, which you |
15 which you may choose to use in your own research. | 15 may choose to use in your own research. |
16 | 16 |
17 This is the core vision of my thesis: That one of the important ways | 17 This is the core vision of my thesis: That one of the important ways |
18 in which we understand others is by imagining ourselves in their | 18 in which we understand others is by imagining ourselves in their |
19 position and emphatically feeling experiences relative to our own | 19 position and emphatically feeling experiences relative to our own |
20 bodies. By understanding events in terms of our own previous | 20 bodies. By understanding events in terms of our own previous |
24 is happening in a video and being completely lost in a sea of | 24 is happening in a video and being completely lost in a sea of |
25 incomprehensible color and movement. | 25 incomprehensible color and movement. |
26 | 26 |
27 ** Recognizing actions in video is extremely difficult | 27 ** Recognizing actions in video is extremely difficult |
28 | 28 |
29 Consider for example the problem of determining what is happening in | 29 Consider for example the problem of determining what is happening |
30 a video of which this is one frame: | 30 in a video of which this is one frame: |
31 | 31 |
32 #+caption: A cat drinking some water. Identifying this action is | 32 #+caption: A cat drinking some water. Identifying this action is |
33 #+caption: beyond the state of the art for computers. | 33 #+caption: beyond the state of the art for computers. |
34 #+ATTR_LaTeX: :width 7cm | 34 #+ATTR_LaTeX: :width 7cm |
35 [[./images/cat-drinking.jpg]] | 35 [[./images/cat-drinking.jpg]] |
36 | 36 |
37 It is currently impossible for any computer program to reliably | 37 It is currently impossible for any computer program to reliably |
38 label such a video as "drinking". And rightly so -- it is a very | 38 label such a video as ``drinking''. And rightly so -- it is a very |
39 hard problem! What features can you describe in terms of low level | 39 hard problem! What features can you describe in terms of low level |
40 functions of pixels that can even begin to describe at a high level | 40 functions of pixels that can even begin to describe at a high level |
41 what is happening here? | 41 what is happening here? |
42 | 42 |
43 Or suppose that you are building a program that recognizes | 43 Or suppose that you are building a program that recognizes chairs. |
44 chairs. How could you ``see'' the chair in figure | 44 How could you ``see'' the chair in figure \ref{invisible-chair} and |
45 \ref{invisible-chair} and figure \ref{hidden-chair}? | 45 figure \ref{hidden-chair}? |
46 | 46 |
47 #+caption: When you look at this, do you think ``chair''? I certainly do. | 47 #+caption: When you look at this, do you think ``chair''? I certainly do. |
48 #+name: invisible-chair | 48 #+name: invisible-chair |
49 #+ATTR_LaTeX: :width 10cm | 49 #+ATTR_LaTeX: :width 10cm |
50 [[./images/invisible-chair.png]] | 50 [[./images/invisible-chair.png]] |
67 | 67 |
68 Each of these examples tells us something about what might be going | 68 Each of these examples tells us something about what might be going |
69 on in our minds as we easily solve these recognition problems. | 69 on in our minds as we easily solve these recognition problems. |
70 | 70 |
71 The hidden chairs show us that we are strongly triggered by cues | 71 The hidden chairs show us that we are strongly triggered by cues |
72 relating to the position of human bodies, and that we can | 72 relating to the position of human bodies, and that we can determine |
73 determine the overall physical configuration of a human body even | 73 the overall physical configuration of a human body even if much of |
74 if much of that body is occluded. | 74 that body is occluded. |
75 | 75 |
76 The picture of the girl pushing against the wall tells us that we | 76 The picture of the girl pushing against the wall tells us that we |
77 have common sense knowledge about the kinetics of our own bodies. | 77 have common sense knowledge about the kinetics of our own bodies. |
78 We know well how our muscles would have to work to maintain us in | 78 We know well how our muscles would have to work to maintain us in |
79 most positions, and we can easily project this self-knowledge to | 79 most positions, and we can easily project this self-knowledge to |
83 | 83 |
84 I propose a system that can express the types of recognition | 84 I propose a system that can express the types of recognition |
85 problems above in a form amenable to computation. It is split into | 85 problems above in a form amenable to computation. It is split into |
86 four parts: | 86 four parts: |
87 | 87 |
88 - Free/Guided Play :: The creature moves around and experiences the | 88 - Free/Guided Play (Training) :: The creature moves around and |
89 world through its unique perspective. Many otherwise | 89 experiences the world through its unique perspective. Many |
90 complicated actions are easily described in the language of a | 90 otherwise complicated actions are easily described in the |
91 full suite of body-centered, rich senses. For example, | 91 language of a full suite of body-centered, rich senses. For |
92 drinking is the feeling of water sliding down your throat, and | 92 example, drinking is the feeling of water sliding down your |
93 cooling your insides. It's often accompanied by bringing your | 93 throat, and cooling your insides. It's often accompanied by |
94 hand close to your face, or bringing your face close to | 94 bringing your hand close to your face, or bringing your face |
95 water. Sitting down is the feeling of bending your knees, | 95 close to water. Sitting down is the feeling of bending your |
96 activating your quadriceps, then feeling a surface with your | 96 knees, activating your quadriceps, then feeling a surface with |
97 bottom and relaxing your legs. These body-centered action | 97 your bottom and relaxing your legs. These body-centered action |
98 descriptions can be either learned or hard coded. | 98 descriptions can be either learned or hard coded. |
99 - Alignment :: When trying to interpret a video or image, the | 99 - Alignment (Posture imitation) :: When trying to interpret a video |
100 creature takes a model of itself and aligns it with | 100 or image, the creature takes a model of itself and aligns it |
101 whatever it sees. This can be a rather loose | 101 with whatever it sees. This alignment can even cross species, |
102 alignment that can cross species, as when humans try | 102 as when humans try to align themselves with things like |
103 to align themselves with things like ponies, dogs, | 103 ponies, dogs, or other humans with a different body type. |
104 or other humans with a different body type. | 104 - Empathy (Sensory extrapolation) :: The alignment triggers |
105 - Empathy :: The alignment triggers the memories of previous | 105 associations with sensory data from prior experiences. For |
106 experience. For example, the alignment itself easily | 106 example, the alignment itself easily maps to proprioceptive |
107 maps to proprioceptive data. Any sounds or obvious | 107 data. Any sounds or obvious skin contact in the video can to a |
108 skin contact in the video can to a lesser extent | 108 lesser extent trigger previous experience. Segments of |
109 trigger previous experience. The creatures previous | 109 previous experiences are stitched together to form a coherent |
110 experience is chained together in short bursts to | 110 and complete sensory portrait of the scene. |
111 coherently describe the new scene. | 111 - Recognition (Classification) :: With the scene described in terms |
112 - Recognition :: With the scene now described in terms of past | 112 of first person sensory events, the creature can now run its |
113 experience, the creature can now run its | 113 action-identification programs on this synthesized sensory |
114 action-identification programs on this synthesized | 114 data, just as it would if it were actually experiencing the |
115 sensory data, just as it would if it were actually | 115 scene first-hand. If previous experience has been accurately |
116 experiencing the scene first-hand. If previous | 116 retrieved, and if it is analogous enough to the scene, then |
117 experience has been accurately retrieved, and if | 117 the creature will correctly identify the action in the scene. |
118 it is analogous enough to the scene, then the | 118 |
119 creature will correctly identify the action in the | |
120 scene. | |
121 | |
122 | |
123 For example, I think humans are able to label the cat video as | 119 For example, I think humans are able to label the cat video as |
124 "drinking" because they imagine /themselves/ as the cat, and | 120 ``drinking'' because they imagine /themselves/ as the cat, and |
125 imagine putting their face up against a stream of water and | 121 imagine putting their face up against a stream of water and |
126 sticking out their tongue. In that imagined world, they can feel | 122 sticking out their tongue. In that imagined world, they can feel |
127 the cool water hitting their tongue, and feel the water entering | 123 the cool water hitting their tongue, and feel the water entering |
128 their body, and are able to recognize that /feeling/ as | 124 their body, and are able to recognize that /feeling/ as drinking. |
129 drinking. So, the label of the action is not really in the pixels | 125 So, the label of the action is not really in the pixels of the |
130 of the image, but is found clearly in a simulation inspired by | 126 image, but is found clearly in a simulation inspired by those |
131 those pixels. An imaginative system, having been trained on | 127 pixels. An imaginative system, having been trained on drinking and |
132 drinking and non-drinking examples and learning that the most | 128 non-drinking examples and learning that the most important |
133 important component of drinking is the feeling of water sliding | 129 component of drinking is the feeling of water sliding down one's |
134 down one's throat, would analyze a video of a cat drinking in the | 130 throat, would analyze a video of a cat drinking in the following |
135 following manner: | 131 manner: |
136 | 132 |
137 1. Create a physical model of the video by putting a "fuzzy" model | 133 1. Create a physical model of the video by putting a ``fuzzy'' |
138 of its own body in place of the cat. Possibly also create a | 134 model of its own body in place of the cat. Possibly also create |
139 simulation of the stream of water. | 135 a simulation of the stream of water. |
140 | 136 |
141 2. Play out this simulated scene and generate imagined sensory | 137 2. Play out this simulated scene and generate imagined sensory |
142 experience. This will include relevant muscle contractions, a | 138 experience. This will include relevant muscle contractions, a |
143 close up view of the stream from the cat's perspective, and most | 139 close up view of the stream from the cat's perspective, and most |
144 importantly, the imagined feeling of water entering the | 140 importantly, the imagined feeling of water entering the |
182 #+caption: curling, wiggling, and resting. | 178 #+caption: curling, wiggling, and resting. |
183 #+name: worm-intro | 179 #+name: worm-intro |
184 #+ATTR_LaTeX: :width 15cm | 180 #+ATTR_LaTeX: :width 15cm |
185 [[./images/worm-intro-white.png]] | 181 [[./images/worm-intro-white.png]] |
186 | 182 |
187 #+caption: The actions of a worm in a video can be recognized by | 183 #+caption: =EMPATH= recognized and classified each of these poses by |
188 #+caption: proprioceptive data and sentory predicates by filling | 184 #+caption: inferring the complete sensory experience from |
189 #+caption: in the missing sensory detail with previous experience. | 185 #+caption: proprioceptive data. |
190 #+name: worm-recognition-intro | 186 #+name: worm-recognition-intro |
191 #+ATTR_LaTeX: :width 15cm | 187 #+ATTR_LaTeX: :width 15cm |
192 [[./images/worm-poses.png]] | 188 [[./images/worm-poses.png]] |
193 | |
194 | 189 |
195 One powerful advantage of empathic problem solving is that it | 190 One powerful advantage of empathic problem solving is that it |
196 factors the action recognition problem into two easier problems. To | 191 factors the action recognition problem into two easier problems. To |
197 use empathy, you need an /aligner/, which takes the video and a | 192 use empathy, you need an /aligner/, which takes the video and a |
198 model of your body, and aligns the model with the video. Then, you | 193 model of your body, and aligns the model with the video. Then, you |
199 need a /recognizer/, which uses the aligned model to interpret the | 194 need a /recognizer/, which uses the aligned model to interpret the |
200 action. The power in this method lies in the fact that you describe | 195 action. The power in this method lies in the fact that you describe |
201 all actions form a body-centered, rich viewpoint. This way, if you | 196 all actions form a body-centered, viewpoint You are less tied to |
197 the particulars of any visual representation of the actions. If you | |
202 teach the system what ``running'' is, and you have a good enough | 198 teach the system what ``running'' is, and you have a good enough |
203 aligner, the system will from then on be able to recognize running | 199 aligner, the system will from then on be able to recognize running |
204 from any point of view, even strange points of view like above or | 200 from any point of view, even strange points of view like above or |
205 underneath the runner. This is in contrast to action recognition | 201 underneath the runner. This is in contrast to action recognition |
206 schemes that try to identify actions using a non-embodied approach | 202 schemes that try to identify actions using a non-embodied approach |
207 such as TODO:REFERENCE. If these systems learn about running as viewed | 203 such as TODO:REFERENCE. If these systems learn about running as |
208 from the side, they will not automatically be able to recognize | 204 viewed from the side, they will not automatically be able to |
209 running from any other viewpoint. | 205 recognize running from any other viewpoint. |
210 | 206 |
211 Another powerful advantage is that using the language of multiple | 207 Another powerful advantage is that using the language of multiple |
212 body-centered rich senses to describe body-centerd actions offers a | 208 body-centered rich senses to describe body-centerd actions offers a |
213 massive boost in descriptive capability. Consider how difficult it | 209 massive boost in descriptive capability. Consider how difficult it |
214 would be to compose a set of HOG filters to describe the action of | 210 would be to compose a set of HOG filters to describe the action of |
215 a simple worm-creature "curling" so that its head touches its tail, | 211 a simple worm-creature ``curling'' so that its head touches its |
216 and then behold the simplicity of describing thus action in a | 212 tail, and then behold the simplicity of describing thus action in a |
217 language designed for the task (listing \ref{grand-circle-intro}): | 213 language designed for the task (listing \ref{grand-circle-intro}): |
218 | 214 |
219 #+caption: Body-centerd actions are best expressed in a body-centered | 215 #+caption: Body-centerd actions are best expressed in a body-centered |
220 #+caption: language. This code detects when the worm has curled into a | 216 #+caption: language. This code detects when the worm has curled into a |
221 #+caption: full circle. Imagine how you would replicate this functionality | 217 #+caption: full circle. Imagine how you would replicate this functionality |
291 empathy, using =CORTEX= as a base. Empathy in this context is the | 287 empathy, using =CORTEX= as a base. Empathy in this context is the |
292 ability to observe another creature and infer what sorts of sensations | 288 ability to observe another creature and infer what sorts of sensations |
293 that creature is feeling. My empathy algorithm involves multiple | 289 that creature is feeling. My empathy algorithm involves multiple |
294 phases. First is free-play, where the creature moves around and gains | 290 phases. First is free-play, where the creature moves around and gains |
295 sensory experience. From this experience I construct a representation | 291 sensory experience. From this experience I construct a representation |
296 of the creature's sensory state space, which I call \phi-space. Using | 292 of the creature's sensory state space, which I call \Phi-space. Using |
297 \phi-space, I construct an efficient function for enriching the | 293 \Phi-space, I construct an efficient function for enriching the |
298 limited data that comes from observing another creature with a full | 294 limited data that comes from observing another creature with a full |
299 compliment of imagined sensory data based on previous experience. I | 295 compliment of imagined sensory data based on previous experience. I |
300 can then use the imagined sensory data to recognize what the observed | 296 can then use the imagined sensory data to recognize what the observed |
301 creature is doing and feeling, using straightforward embodied action | 297 creature is doing and feeling, using straightforward embodied action |
302 predicates. This is all demonstrated with using a simple worm-like | 298 predicates. This is all demonstrated with using a simple worm-like |
312 | 308 |
313 | 309 |
314 | 310 |
315 * COMMENT names for cortex | 311 * COMMENT names for cortex |
316 - bioland | 312 - bioland |
313 | |
314 | |
315 | |
316 | |
317 # An anatomical joke: | |
318 # - Training | |
319 # - Skeletal imitation | |
320 # - Sensory fleshing-out | |
321 # - Classification |