Mercurial > cortex
comparison thesis/cortex.org @ 441:c20de2267d39
completeing first third of first chapter.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Mon, 24 Mar 2014 20:59:35 -0400 |
parents | b01c070b03d4 |
children | eaf8c591372b |
comparison
equal
deleted
inserted
replaced
440:b01c070b03d4 | 441:c20de2267d39 |
---|---|
8 * Empathy and Embodiment as problem solving strategies | 8 * Empathy and Embodiment as problem solving strategies |
9 | 9 |
10 By the end of this thesis, you will have seen a novel approach to | 10 By the end of this thesis, you will have seen a novel approach to |
11 interpreting video using embodiment and empathy. You will have also | 11 interpreting video using embodiment and empathy. You will have also |
12 seen one way to efficiently implement empathy for embodied | 12 seen one way to efficiently implement empathy for embodied |
13 creatures. | 13 creatures. Finally, you will become familiar with =CORTEX=, a |
14 | 14 system for designing and simulating creatures with rich senses, |
15 The core vision of this thesis is that one of the important ways in | 15 which you may choose to use in your own research. |
16 which we understand others is by imagining ourselves in their | 16 |
17 posistion and empathicaly feeling experiences based on our own past | 17 This is the core vision of my thesis: That one of the important ways |
18 experiences and imagination. | 18 in which we understand others is by imagining ourselves in their |
19 | 19 position and emphatically feeling experiences relative to our own |
20 By understanding events in terms of our own previous corperal | 20 bodies. By understanding events in terms of our own previous |
21 experience, we greatly constrain the possibilities of what would | 21 corporeal experience, we greatly constrain the possibilities of what |
22 otherwise be an unweidly exponential search. This extra constraint | 22 would otherwise be an unwieldy exponential search. This extra |
23 can be the difference between easily understanding what is happening | 23 constraint can be the difference between easily understanding what |
24 in a video and being completely lost in a sea of incomprehensible | 24 is happening in a video and being completely lost in a sea of |
25 color and movement. | 25 incomprehensible color and movement. |
26 | 26 |
27 ** Recognizing actions in video is extremely difficult | 27 ** Recognizing actions in video is extremely difficult |
28 | 28 |
29 Consider for example the problem of determining what is happening in | 29 Consider for example the problem of determining what is happening in |
30 a video of which this is one frame: | 30 a video of which this is one frame: |
31 | 31 |
32 #+caption: A cat drinking some water. Identifying this action is | 32 #+caption: A cat drinking some water. Identifying this action is |
33 #+caption: beyond the state of the art for computers. | 33 #+caption: beyond the state of the art for computers. |
34 #+ATTR_LaTeX: :width 7cm | 34 #+ATTR_LaTeX: :width 7cm |
35 [[./images/cat-drinking.jpg]] | 35 [[./images/cat-drinking.jpg]] |
36 | 36 |
37 It is currently impossible for any computer program to reliably | 37 It is currently impossible for any computer program to reliably |
38 label such an video as "drinking". And rightly so -- it is a very | 38 label such an video as "drinking". And rightly so -- it is a very |
39 hard problem! What features can you describe in terms of low level | 39 hard problem! What features can you describe in terms of low level |
40 functions of pixels that can even begin to describe what is | 40 functions of pixels that can even begin to describe at a high level |
41 happening here? | 41 what is happening here? |
42 | 42 |
43 Or suppose that you are building a program that recognizes | 43 Or suppose that you are building a program that recognizes |
44 chairs. How could you ``see'' the chair in the following pictures? | 44 chairs. How could you ``see'' the chair in figure |
45 | 45 \ref{invisible-chair} and figure \ref{hidden-chair}? |
46 #+caption: When you look at this, do you think ``chair''? I certainly do. | 46 |
47 #+ATTR_LaTeX: :width 10cm | 47 #+caption: When you look at this, do you think ``chair''? I certainly do. |
48 [[./images/invisible-chair.png]] | 48 #+name: invisible-chair |
49 | 49 #+ATTR_LaTeX: :width 10cm |
50 #+caption: The chair in this image is quite obvious to humans, but I | 50 [[./images/invisible-chair.png]] |
51 #+caption: doubt that any computer program can find it. | 51 |
52 #+ATTR_LaTeX: :width 10cm | 52 #+caption: The chair in this image is quite obvious to humans, but I |
53 [[./images/fat-person-sitting-at-desk.jpg]] | 53 #+caption: doubt that any computer program can find it. |
54 | 54 #+name: hidden-chair |
55 Finally, how is it that you can easily tell the difference between | 55 #+ATTR_LaTeX: :width 10cm |
56 how the girls /muscles/ are working in \ref{girl}? | 56 [[./images/fat-person-sitting-at-desk.jpg]] |
57 | 57 |
58 #+caption: The mysterious ``common sense'' appears here as you are able | 58 Finally, how is it that you can easily tell the difference between |
59 #+caption: to ``see'' the difference in how the girl's arm muscles | 59 how the girls /muscles/ are working in figure \ref{girl}? |
60 #+caption: are activated differently in the two images. | 60 |
61 #+name: girl | 61 #+caption: The mysterious ``common sense'' appears here as you are able |
62 #+ATTR_LaTeX: :width 10cm | 62 #+caption: to discern the difference in how the girl's arm muscles |
63 [[./images/wall-push.png]] | 63 #+caption: are activated between the two images. |
64 | 64 #+name: girl |
65 | 65 #+ATTR_LaTeX: :width 10cm |
66 These problems are difficult because the language of pixels is far | 66 [[./images/wall-push.png]] |
67 removed from what we would consider to be an acceptable description | 67 |
68 of the events in these images. In order to process them, we must | 68 Each of these examples tells us something about what might be going |
69 raise the images into some higher level of abstraction where their | 69 on in our minds as we easily solve these recognition problems. |
70 descriptions become more similar to how we would describe them in | 70 |
71 English. The question is, how can we raise | 71 The hidden chairs show us that we are strongly triggered by cues |
72 | 72 relating to the position of human bodies, and that we can |
73 | 73 determine the overall physical configuration of a human body even |
74 I think humans are able to label such video as "drinking" because | 74 if much of that body is occluded. |
75 they imagine /themselves/ as the cat, and imagine putting their face | 75 |
76 up against a stream of water and sticking out their tongue. In that | 76 The picture of the girl pushing against the wall tells us that we |
77 imagined world, they can feel the cool water hitting their tongue, | 77 have common sense knowledge about the kinetics of our own bodies. |
78 and feel the water entering their body, and are able to recognize | 78 We know well how our muscles would have to work to maintain us in |
79 that /feeling/ as drinking. So, the label of the action is not | 79 most positions, and we can easily project this self-knowledge to |
80 really in the pixels of the image, but is found clearly in a | 80 imagined positions triggered by images of the human body. |
81 simulation inspired by those pixels. An imaginative system, having | 81 |
82 been trained on drinking and non-drinking examples and learning that | 82 ** =EMPATH= neatly solves recognition problems |
83 the most important component of drinking is the feeling of water | 83 |
84 sliding down one's throat, would analyze a video of a cat drinking | 84 I propose a system that can express the types of recognition |
85 in the following manner: | 85 problems above in a form amenable to computation. It is split into |
86 | 86 four parts: |
87 - Create a physical model of the video by putting a "fuzzy" model | 87 |
88 of its own body in place of the cat. Also, create a simulation of | 88 - Free/Guided Play :: The creature moves around and experiences the |
89 the stream of water. | 89 world through its unique perspective. Many otherwise |
90 | 90 complicated actions are easily described in the language of a |
91 - Play out this simulated scene and generate imagined sensory | 91 full suite of body-centered, rich senses. For example, |
92 experience. This will include relevant muscle contractions, a | 92 drinking is the feeling of water sliding down your throat, and |
93 close up view of the stream from the cat's perspective, and most | 93 cooling your insides. It's often accompanied by bringing your |
94 importantly, the imagined feeling of water entering the mouth. | 94 hand close to your face, or bringing your face close to |
95 | 95 water. Sitting down is the feeling of bending your knees, |
96 - The action is now easily identified as drinking by the sense of | 96 activating your quadriceps, then feeling a surface with your |
97 taste alone. The other senses (such as the tongue moving in and | 97 bottom and relaxing your legs. These body-centered action |
98 out) help to give plausibility to the simulated action. Note that | 98 descriptions can be either learned or hard coded. |
99 the sense of vision, while critical in creating the simulation, | 99 - Alignment :: When trying to interpret a video or image, the |
100 is not critical for identifying the action from the simulation. | 100 creature takes a model of itself and aligns it with |
101 | 101 whatever it sees. This can be a rather loose |
102 cat drinking, mimes, leaning, common sense | 102 alignment that can cross species, as when humans try |
103 | 103 to align themselves with things like ponies, dogs, |
104 ** =EMPATH= neatly solves recognition problems | 104 or other humans with a different body type. |
105 | 105 - Empathy :: The alignment triggers the memories of previous |
106 factorization , right language, etc | 106 experience. For example, the alignment itself easily |
107 | 107 maps to proprioceptive data. Any sounds or obvious |
108 a new possibility for the question ``what is a chair?'' -- it's the | 108 skin contact in the video can to a lesser extent |
109 feeling of your butt on something and your knees bent, with your | 109 trigger previous experience. The creatures previous |
110 back muscles and legs relaxed. | 110 experience is chained together in short bursts to |
111 coherently describe the new scene. | |
112 - Recognition :: With the scene now described in terms of past | |
113 experience, the creature can now run its | |
114 action-identification programs on this synthesized | |
115 sensory data, just as it would if it were actually | |
116 experiencing the scene first-hand. If previous | |
117 experience has been accurately retrieved, and if | |
118 it is analogous enough to the scene, then the | |
119 creature will correctly identify the action in the | |
120 scene. | |
121 | |
122 | |
123 For example, I think humans are able to label the cat video as | |
124 "drinking" because they imagine /themselves/ as the cat, and | |
125 imagine putting their face up against a stream of water and | |
126 sticking out their tongue. In that imagined world, they can feel | |
127 the cool water hitting their tongue, and feel the water entering | |
128 their body, and are able to recognize that /feeling/ as | |
129 drinking. So, the label of the action is not really in the pixels | |
130 of the image, but is found clearly in a simulation inspired by | |
131 those pixels. An imaginative system, having been trained on | |
132 drinking and non-drinking examples and learning that the most | |
133 important component of drinking is the feeling of water sliding | |
134 down one's throat, would analyze a video of a cat drinking in the | |
135 following manner: | |
136 | |
137 1. Create a physical model of the video by putting a "fuzzy" model | |
138 of its own body in place of the cat. Possibly also create a | |
139 simulation of the stream of water. | |
140 | |
141 2. Play out this simulated scene and generate imagined sensory | |
142 experience. This will include relevant muscle contractions, a | |
143 close up view of the stream from the cat's perspective, and most | |
144 importantly, the imagined feeling of water entering the | |
145 mouth. The imagined sensory experience can come from both a | |
146 simulation of the event, but can also be pattern-matched from | |
147 previous, similar embodied experience. | |
148 | |
149 3. The action is now easily identified as drinking by the sense of | |
150 taste alone. The other senses (such as the tongue moving in and | |
151 out) help to give plausibility to the simulated action. Note that | |
152 the sense of vision, while critical in creating the simulation, | |
153 is not critical for identifying the action from the simulation. | |
154 | |
155 For the chair examples, the process is even easier: | |
156 | |
157 1. Align a model of your body to the person in the image. | |
158 | |
159 2. Generate proprioceptive sensory data from this alignment. | |
160 | |
161 3. Use the imagined proprioceptive data as a key to lookup related | |
162 sensory experience associated with that particular proproceptive | |
163 feeling. | |
164 | |
165 4. Retrieve the feeling of your bottom resting on a surface and | |
166 your leg muscles relaxed. | |
167 | |
168 5. This sensory information is consistent with the =sitting?= | |
169 sensory predicate, so you (and the entity in the image) must be | |
170 sitting. | |
171 | |
172 6. There must be a chair-like object since you are sitting. | |
173 | |
174 Empathy offers yet another alternative to the age-old AI | |
175 representation question: ``What is a chair?'' --- A chair is the | |
176 feeling of sitting. | |
177 | |
178 My program, =EMPATH= uses this empathic problem solving technique | |
179 to interpret the actions of a simple, worm-like creature. | |
180 | |
181 #+caption: The worm performs many actions during free play such as | |
182 #+caption: curling, wiggling, and resting. | |
183 #+name: worm-intro | |
184 #+ATTR_LaTeX: :width 10cm | |
185 [[./images/wall-push.png]] | |
186 | |
187 #+caption: This sensory predicate detects when the worm is resting on the | |
188 #+caption: ground. | |
189 #+name: resting-intro | |
190 #+begin_listing clojure | |
191 #+begin_src clojure | |
192 (defn resting? | |
193 "Is the worm resting on the ground?" | |
194 [experiences] | |
195 (every? | |
196 (fn [touch-data] | |
197 (< 0.9 (contact worm-segment-bottom touch-data))) | |
198 (:touch (peek experiences)))) | |
199 #+end_src | |
200 #+end_listing | |
201 | |
202 #+caption: Body-centerd actions are best expressed in a body-centered | |
203 #+caption: language. This code detects when the worm has curled into a | |
204 #+caption: full circle. Imagine how you would replicate this functionality | |
205 #+caption: using low-level pixel features such as HOG filters! | |
206 #+name: grand-circle-intro | |
207 #+begin_listing clojure | |
208 #+begin_src clojure | |
209 (defn grand-circle? | |
210 "Does the worm form a majestic circle (one end touching the other)?" | |
211 [experiences] | |
212 (and (curled? experiences) | |
213 (let [worm-touch (:touch (peek experiences)) | |
214 tail-touch (worm-touch 0) | |
215 head-touch (worm-touch 4)] | |
216 (and (< 0.55 (contact worm-segment-bottom-tip tail-touch)) | |
217 (< 0.55 (contact worm-segment-top-tip head-touch)))))) | |
218 #+end_src | |
219 #+end_listing | |
220 | |
221 #+caption: Even complicated actions such as ``wiggling'' are fairly simple | |
222 #+caption: to describe with a rich enough language. | |
223 #+name: wiggling-intro | |
224 #+begin_listing clojure | |
225 #+begin_src clojure | |
226 (defn wiggling? | |
227 "Is the worm wiggling?" | |
228 [experiences] | |
229 (let [analysis-interval 0x40] | |
230 (when (> (count experiences) analysis-interval) | |
231 (let [a-flex 3 | |
232 a-ex 2 | |
233 muscle-activity | |
234 (map :muscle (vector:last-n experiences analysis-interval)) | |
235 base-activity | |
236 (map #(- (% a-flex) (% a-ex)) muscle-activity)] | |
237 (= 2 | |
238 (first | |
239 (max-indexed | |
240 (map #(Math/abs %) | |
241 (take 20 (fft base-activity)))))))))) | |
242 #+end_src | |
243 #+end_listing | |
244 | |
245 #+caption: The actions of a worm in a video can be recognized by | |
246 #+caption: proprioceptive data and sentory predicates by filling | |
247 #+caption: in the missing sensory detail with previous experience. | |
248 #+name: worm-recognition-intro | |
249 #+ATTR_LaTeX: :width 10cm | |
250 [[./images/wall-push.png]] | |
251 | |
252 | |
253 | |
254 One powerful advantage of empathic problem solving is that it | |
255 factors the action recognition problem into two easier problems. To | |
256 use empathy, you need an /aligner/, which takes the video and a | |
257 model of your body, and aligns the model with the video. Then, you | |
258 need a /recognizer/, which uses the aligned model to interpret the | |
259 action. The power in this method lies in the fact that you describe | |
260 all actions form a body-centered, rich viewpoint. This way, if you | |
261 teach the system what ``running'' is, and you have a good enough | |
262 aligner, the system will from then on be able to recognize running | |
263 from any point of view, even strange points of view like above or | |
264 underneath the runner. This is in contrast to action recognition | |
265 schemes that try to identify actions using a non-embodied approach | |
266 such as TODO:REFERENCE. If these systems learn about running as viewed | |
267 from the side, they will not automatically be able to recognize | |
268 running from any other viewpoint. | |
269 | |
270 Another powerful advantage is that using the language of multiple | |
271 body-centered rich senses to describe body-centerd actions offers a | |
272 massive boost in descriptive capability. Consider how difficult it | |
273 would be to compose a set of HOG filters to describe the action of | |
274 a simple worm-creature "curling" so that its head touches its tail, | |
275 and then behold the simplicity of describing thus action in a | |
276 language designed for the task (listing \ref{grand-circle-intro}): | |
277 | |
111 | 278 |
112 ** =CORTEX= is a toolkit for building sensate creatures | 279 ** =CORTEX= is a toolkit for building sensate creatures |
113 | 280 |
114 Hand integration demo | 281 Hand integration demo |
115 | 282 |
149 | 316 |
150 ** \Phi-space describes the worm's experiences | 317 ** \Phi-space describes the worm's experiences |
151 | 318 |
152 ** Empathy is the process of tracing though \Phi-space | 319 ** Empathy is the process of tracing though \Phi-space |
153 | 320 |
154 ** Efficient action recognition =EMPATH= | 321 ** Efficient action recognition with =EMPATH= |
155 | 322 |
156 * Contributions | 323 * Contributions |
157 - Built =CORTEX=, a comprehensive platform for embodied AI | 324 - Built =CORTEX=, a comprehensive platform for embodied AI |
158 experiments. Has many new features lacking in other systems, such | 325 experiments. Has many new features lacking in other systems, such |
159 as sound. Easy to model/create new creatures. | 326 as sound. Easy to model/create new creatures. |