comparison thesis/cortex.org @ 441:c20de2267d39

completeing first third of first chapter.
author Robert McIntyre <rlm@mit.edu>
date Mon, 24 Mar 2014 20:59:35 -0400
parents b01c070b03d4
children eaf8c591372b
comparison
equal deleted inserted replaced
440:b01c070b03d4 441:c20de2267d39
8 * Empathy and Embodiment as problem solving strategies 8 * Empathy and Embodiment as problem solving strategies
9 9
10 By the end of this thesis, you will have seen a novel approach to 10 By the end of this thesis, you will have seen a novel approach to
11 interpreting video using embodiment and empathy. You will have also 11 interpreting video using embodiment and empathy. You will have also
12 seen one way to efficiently implement empathy for embodied 12 seen one way to efficiently implement empathy for embodied
13 creatures. 13 creatures. Finally, you will become familiar with =CORTEX=, a
14 14 system for designing and simulating creatures with rich senses,
15 The core vision of this thesis is that one of the important ways in 15 which you may choose to use in your own research.
16 which we understand others is by imagining ourselves in their 16
17 posistion and empathicaly feeling experiences based on our own past 17 This is the core vision of my thesis: That one of the important ways
18 experiences and imagination. 18 in which we understand others is by imagining ourselves in their
19 19 position and emphatically feeling experiences relative to our own
20 By understanding events in terms of our own previous corperal 20 bodies. By understanding events in terms of our own previous
21 experience, we greatly constrain the possibilities of what would 21 corporeal experience, we greatly constrain the possibilities of what
22 otherwise be an unweidly exponential search. This extra constraint 22 would otherwise be an unwieldy exponential search. This extra
23 can be the difference between easily understanding what is happening 23 constraint can be the difference between easily understanding what
24 in a video and being completely lost in a sea of incomprehensible 24 is happening in a video and being completely lost in a sea of
25 color and movement. 25 incomprehensible color and movement.
26 26
27 ** Recognizing actions in video is extremely difficult 27 ** Recognizing actions in video is extremely difficult
28 28
29 Consider for example the problem of determining what is happening in 29 Consider for example the problem of determining what is happening in
30 a video of which this is one frame: 30 a video of which this is one frame:
31 31
32 #+caption: A cat drinking some water. Identifying this action is 32 #+caption: A cat drinking some water. Identifying this action is
33 #+caption: beyond the state of the art for computers. 33 #+caption: beyond the state of the art for computers.
34 #+ATTR_LaTeX: :width 7cm 34 #+ATTR_LaTeX: :width 7cm
35 [[./images/cat-drinking.jpg]] 35 [[./images/cat-drinking.jpg]]
36 36
37 It is currently impossible for any computer program to reliably 37 It is currently impossible for any computer program to reliably
38 label such an video as "drinking". And rightly so -- it is a very 38 label such an video as "drinking". And rightly so -- it is a very
39 hard problem! What features can you describe in terms of low level 39 hard problem! What features can you describe in terms of low level
40 functions of pixels that can even begin to describe what is 40 functions of pixels that can even begin to describe at a high level
41 happening here? 41 what is happening here?
42 42
43 Or suppose that you are building a program that recognizes 43 Or suppose that you are building a program that recognizes
44 chairs. How could you ``see'' the chair in the following pictures? 44 chairs. How could you ``see'' the chair in figure
45 45 \ref{invisible-chair} and figure \ref{hidden-chair}?
46 #+caption: When you look at this, do you think ``chair''? I certainly do. 46
47 #+ATTR_LaTeX: :width 10cm 47 #+caption: When you look at this, do you think ``chair''? I certainly do.
48 [[./images/invisible-chair.png]] 48 #+name: invisible-chair
49 49 #+ATTR_LaTeX: :width 10cm
50 #+caption: The chair in this image is quite obvious to humans, but I 50 [[./images/invisible-chair.png]]
51 #+caption: doubt that any computer program can find it. 51
52 #+ATTR_LaTeX: :width 10cm 52 #+caption: The chair in this image is quite obvious to humans, but I
53 [[./images/fat-person-sitting-at-desk.jpg]] 53 #+caption: doubt that any computer program can find it.
54 54 #+name: hidden-chair
55 Finally, how is it that you can easily tell the difference between 55 #+ATTR_LaTeX: :width 10cm
56 how the girls /muscles/ are working in \ref{girl}? 56 [[./images/fat-person-sitting-at-desk.jpg]]
57 57
58 #+caption: The mysterious ``common sense'' appears here as you are able 58 Finally, how is it that you can easily tell the difference between
59 #+caption: to ``see'' the difference in how the girl's arm muscles 59 how the girls /muscles/ are working in figure \ref{girl}?
60 #+caption: are activated differently in the two images. 60
61 #+name: girl 61 #+caption: The mysterious ``common sense'' appears here as you are able
62 #+ATTR_LaTeX: :width 10cm 62 #+caption: to discern the difference in how the girl's arm muscles
63 [[./images/wall-push.png]] 63 #+caption: are activated between the two images.
64 64 #+name: girl
65 65 #+ATTR_LaTeX: :width 10cm
66 These problems are difficult because the language of pixels is far 66 [[./images/wall-push.png]]
67 removed from what we would consider to be an acceptable description 67
68 of the events in these images. In order to process them, we must 68 Each of these examples tells us something about what might be going
69 raise the images into some higher level of abstraction where their 69 on in our minds as we easily solve these recognition problems.
70 descriptions become more similar to how we would describe them in 70
71 English. The question is, how can we raise 71 The hidden chairs show us that we are strongly triggered by cues
72 72 relating to the position of human bodies, and that we can
73 73 determine the overall physical configuration of a human body even
74 I think humans are able to label such video as "drinking" because 74 if much of that body is occluded.
75 they imagine /themselves/ as the cat, and imagine putting their face 75
76 up against a stream of water and sticking out their tongue. In that 76 The picture of the girl pushing against the wall tells us that we
77 imagined world, they can feel the cool water hitting their tongue, 77 have common sense knowledge about the kinetics of our own bodies.
78 and feel the water entering their body, and are able to recognize 78 We know well how our muscles would have to work to maintain us in
79 that /feeling/ as drinking. So, the label of the action is not 79 most positions, and we can easily project this self-knowledge to
80 really in the pixels of the image, but is found clearly in a 80 imagined positions triggered by images of the human body.
81 simulation inspired by those pixels. An imaginative system, having 81
82 been trained on drinking and non-drinking examples and learning that 82 ** =EMPATH= neatly solves recognition problems
83 the most important component of drinking is the feeling of water 83
84 sliding down one's throat, would analyze a video of a cat drinking 84 I propose a system that can express the types of recognition
85 in the following manner: 85 problems above in a form amenable to computation. It is split into
86 86 four parts:
87 - Create a physical model of the video by putting a "fuzzy" model 87
88 of its own body in place of the cat. Also, create a simulation of 88 - Free/Guided Play :: The creature moves around and experiences the
89 the stream of water. 89 world through its unique perspective. Many otherwise
90 90 complicated actions are easily described in the language of a
91 - Play out this simulated scene and generate imagined sensory 91 full suite of body-centered, rich senses. For example,
92 experience. This will include relevant muscle contractions, a 92 drinking is the feeling of water sliding down your throat, and
93 close up view of the stream from the cat's perspective, and most 93 cooling your insides. It's often accompanied by bringing your
94 importantly, the imagined feeling of water entering the mouth. 94 hand close to your face, or bringing your face close to
95 95 water. Sitting down is the feeling of bending your knees,
96 - The action is now easily identified as drinking by the sense of 96 activating your quadriceps, then feeling a surface with your
97 taste alone. The other senses (such as the tongue moving in and 97 bottom and relaxing your legs. These body-centered action
98 out) help to give plausibility to the simulated action. Note that 98 descriptions can be either learned or hard coded.
99 the sense of vision, while critical in creating the simulation, 99 - Alignment :: When trying to interpret a video or image, the
100 is not critical for identifying the action from the simulation. 100 creature takes a model of itself and aligns it with
101 101 whatever it sees. This can be a rather loose
102 cat drinking, mimes, leaning, common sense 102 alignment that can cross species, as when humans try
103 103 to align themselves with things like ponies, dogs,
104 ** =EMPATH= neatly solves recognition problems 104 or other humans with a different body type.
105 105 - Empathy :: The alignment triggers the memories of previous
106 factorization , right language, etc 106 experience. For example, the alignment itself easily
107 107 maps to proprioceptive data. Any sounds or obvious
108 a new possibility for the question ``what is a chair?'' -- it's the 108 skin contact in the video can to a lesser extent
109 feeling of your butt on something and your knees bent, with your 109 trigger previous experience. The creatures previous
110 back muscles and legs relaxed. 110 experience is chained together in short bursts to
111 coherently describe the new scene.
112 - Recognition :: With the scene now described in terms of past
113 experience, the creature can now run its
114 action-identification programs on this synthesized
115 sensory data, just as it would if it were actually
116 experiencing the scene first-hand. If previous
117 experience has been accurately retrieved, and if
118 it is analogous enough to the scene, then the
119 creature will correctly identify the action in the
120 scene.
121
122
123 For example, I think humans are able to label the cat video as
124 "drinking" because they imagine /themselves/ as the cat, and
125 imagine putting their face up against a stream of water and
126 sticking out their tongue. In that imagined world, they can feel
127 the cool water hitting their tongue, and feel the water entering
128 their body, and are able to recognize that /feeling/ as
129 drinking. So, the label of the action is not really in the pixels
130 of the image, but is found clearly in a simulation inspired by
131 those pixels. An imaginative system, having been trained on
132 drinking and non-drinking examples and learning that the most
133 important component of drinking is the feeling of water sliding
134 down one's throat, would analyze a video of a cat drinking in the
135 following manner:
136
137 1. Create a physical model of the video by putting a "fuzzy" model
138 of its own body in place of the cat. Possibly also create a
139 simulation of the stream of water.
140
141 2. Play out this simulated scene and generate imagined sensory
142 experience. This will include relevant muscle contractions, a
143 close up view of the stream from the cat's perspective, and most
144 importantly, the imagined feeling of water entering the
145 mouth. The imagined sensory experience can come from both a
146 simulation of the event, but can also be pattern-matched from
147 previous, similar embodied experience.
148
149 3. The action is now easily identified as drinking by the sense of
150 taste alone. The other senses (such as the tongue moving in and
151 out) help to give plausibility to the simulated action. Note that
152 the sense of vision, while critical in creating the simulation,
153 is not critical for identifying the action from the simulation.
154
155 For the chair examples, the process is even easier:
156
157 1. Align a model of your body to the person in the image.
158
159 2. Generate proprioceptive sensory data from this alignment.
160
161 3. Use the imagined proprioceptive data as a key to lookup related
162 sensory experience associated with that particular proproceptive
163 feeling.
164
165 4. Retrieve the feeling of your bottom resting on a surface and
166 your leg muscles relaxed.
167
168 5. This sensory information is consistent with the =sitting?=
169 sensory predicate, so you (and the entity in the image) must be
170 sitting.
171
172 6. There must be a chair-like object since you are sitting.
173
174 Empathy offers yet another alternative to the age-old AI
175 representation question: ``What is a chair?'' --- A chair is the
176 feeling of sitting.
177
178 My program, =EMPATH= uses this empathic problem solving technique
179 to interpret the actions of a simple, worm-like creature.
180
181 #+caption: The worm performs many actions during free play such as
182 #+caption: curling, wiggling, and resting.
183 #+name: worm-intro
184 #+ATTR_LaTeX: :width 10cm
185 [[./images/wall-push.png]]
186
187 #+caption: This sensory predicate detects when the worm is resting on the
188 #+caption: ground.
189 #+name: resting-intro
190 #+begin_listing clojure
191 #+begin_src clojure
192 (defn resting?
193 "Is the worm resting on the ground?"
194 [experiences]
195 (every?
196 (fn [touch-data]
197 (< 0.9 (contact worm-segment-bottom touch-data)))
198 (:touch (peek experiences))))
199 #+end_src
200 #+end_listing
201
202 #+caption: Body-centerd actions are best expressed in a body-centered
203 #+caption: language. This code detects when the worm has curled into a
204 #+caption: full circle. Imagine how you would replicate this functionality
205 #+caption: using low-level pixel features such as HOG filters!
206 #+name: grand-circle-intro
207 #+begin_listing clojure
208 #+begin_src clojure
209 (defn grand-circle?
210 "Does the worm form a majestic circle (one end touching the other)?"
211 [experiences]
212 (and (curled? experiences)
213 (let [worm-touch (:touch (peek experiences))
214 tail-touch (worm-touch 0)
215 head-touch (worm-touch 4)]
216 (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
217 (< 0.55 (contact worm-segment-top-tip head-touch))))))
218 #+end_src
219 #+end_listing
220
221 #+caption: Even complicated actions such as ``wiggling'' are fairly simple
222 #+caption: to describe with a rich enough language.
223 #+name: wiggling-intro
224 #+begin_listing clojure
225 #+begin_src clojure
226 (defn wiggling?
227 "Is the worm wiggling?"
228 [experiences]
229 (let [analysis-interval 0x40]
230 (when (> (count experiences) analysis-interval)
231 (let [a-flex 3
232 a-ex 2
233 muscle-activity
234 (map :muscle (vector:last-n experiences analysis-interval))
235 base-activity
236 (map #(- (% a-flex) (% a-ex)) muscle-activity)]
237 (= 2
238 (first
239 (max-indexed
240 (map #(Math/abs %)
241 (take 20 (fft base-activity))))))))))
242 #+end_src
243 #+end_listing
244
245 #+caption: The actions of a worm in a video can be recognized by
246 #+caption: proprioceptive data and sentory predicates by filling
247 #+caption: in the missing sensory detail with previous experience.
248 #+name: worm-recognition-intro
249 #+ATTR_LaTeX: :width 10cm
250 [[./images/wall-push.png]]
251
252
253
254 One powerful advantage of empathic problem solving is that it
255 factors the action recognition problem into two easier problems. To
256 use empathy, you need an /aligner/, which takes the video and a
257 model of your body, and aligns the model with the video. Then, you
258 need a /recognizer/, which uses the aligned model to interpret the
259 action. The power in this method lies in the fact that you describe
260 all actions form a body-centered, rich viewpoint. This way, if you
261 teach the system what ``running'' is, and you have a good enough
262 aligner, the system will from then on be able to recognize running
263 from any point of view, even strange points of view like above or
264 underneath the runner. This is in contrast to action recognition
265 schemes that try to identify actions using a non-embodied approach
266 such as TODO:REFERENCE. If these systems learn about running as viewed
267 from the side, they will not automatically be able to recognize
268 running from any other viewpoint.
269
270 Another powerful advantage is that using the language of multiple
271 body-centered rich senses to describe body-centerd actions offers a
272 massive boost in descriptive capability. Consider how difficult it
273 would be to compose a set of HOG filters to describe the action of
274 a simple worm-creature "curling" so that its head touches its tail,
275 and then behold the simplicity of describing thus action in a
276 language designed for the task (listing \ref{grand-circle-intro}):
277
111 278
112 ** =CORTEX= is a toolkit for building sensate creatures 279 ** =CORTEX= is a toolkit for building sensate creatures
113 280
114 Hand integration demo 281 Hand integration demo
115 282
149 316
150 ** \Phi-space describes the worm's experiences 317 ** \Phi-space describes the worm's experiences
151 318
152 ** Empathy is the process of tracing though \Phi-space 319 ** Empathy is the process of tracing though \Phi-space
153 320
154 ** Efficient action recognition =EMPATH= 321 ** Efficient action recognition with =EMPATH=
155 322
156 * Contributions 323 * Contributions
157 - Built =CORTEX=, a comprehensive platform for embodied AI 324 - Built =CORTEX=, a comprehensive platform for embodied AI
158 experiments. Has many new features lacking in other systems, such 325 experiments. Has many new features lacking in other systems, such
159 as sound. Easy to model/create new creatures. 326 as sound. Easy to model/create new creatures.