view thesis/cortex.org @ 447:284316604be0

minor changes from Dylan.
author Robert McIntyre <rlm@mit.edu>
date Tue, 25 Mar 2014 11:30:15 -0400
parents 3e91585b2a1c
children af13fc73e851
line wrap: on
line source
1 #+title: =CORTEX=
2 #+author: Robert McIntyre
3 #+email: rlm@mit.edu
4 #+description: Using embodied AI to facilitate Artificial Imagination.
5 #+keywords: AI, clojure, embodiment
8 * Empathy and Embodiment as problem solving strategies
10 By the end of this thesis, you will have seen a novel approach to
11 interpreting video using embodiment and empathy. You will have also
12 seen one way to efficiently implement empathy for embodied
13 creatures. Finally, you will become familiar with =CORTEX=, a system
14 for designing and simulating creatures with rich senses, which you
15 may choose to use in your own research.
17 This is the core vision of my thesis: That one of the important ways
18 in which we understand others is by imagining ourselves in their
19 position and emphatically feeling experiences relative to our own
20 bodies. By understanding events in terms of our own previous
21 corporeal experience, we greatly constrain the possibilities of what
22 would otherwise be an unwieldy exponential search. This extra
23 constraint can be the difference between easily understanding what
24 is happening in a video and being completely lost in a sea of
25 incomprehensible color and movement.
27 ** Recognizing actions in video is extremely difficult
29 Consider for example the problem of determining what is happening
30 in a video of which this is one frame:
32 #+caption: A cat drinking some water. Identifying this action is
33 #+caption: beyond the state of the art for computers.
34 #+ATTR_LaTeX: :width 7cm
35 [[./images/cat-drinking.jpg]]
37 It is currently impossible for any computer program to reliably
38 label such a video as ``drinking''. And rightly so -- it is a very
39 hard problem! What features can you describe in terms of low level
40 functions of pixels that can even begin to describe at a high level
41 what is happening here?
43 Or suppose that you are building a program that recognizes chairs.
44 How could you ``see'' the chair in figure \ref{invisible-chair} and
45 figure \ref{hidden-chair}?
47 #+caption: When you look at this, do you think ``chair''? I certainly do.
48 #+name: invisible-chair
49 #+ATTR_LaTeX: :width 10cm
50 [[./images/invisible-chair.png]]
52 #+caption: The chair in this image is quite obvious to humans, but I
53 #+caption: doubt that any computer program can find it.
54 #+name: hidden-chair
55 #+ATTR_LaTeX: :width 10cm
56 [[./images/fat-person-sitting-at-desk.jpg]]
58 Finally, how is it that you can easily tell the difference between
59 how the girls /muscles/ are working in figure \ref{girl}?
61 #+caption: The mysterious ``common sense'' appears here as you are able
62 #+caption: to discern the difference in how the girl's arm muscles
63 #+caption: are activated between the two images.
64 #+name: girl
65 #+ATTR_LaTeX: :width 10cm
66 [[./images/wall-push.png]]
68 Each of these examples tells us something about what might be going
69 on in our minds as we easily solve these recognition problems.
71 The hidden chairs show us that we are strongly triggered by cues
72 relating to the position of human bodies, and that we can determine
73 the overall physical configuration of a human body even if much of
74 that body is occluded.
76 The picture of the girl pushing against the wall tells us that we
77 have common sense knowledge about the kinetics of our own bodies.
78 We know well how our muscles would have to work to maintain us in
79 most positions, and we can easily project this self-knowledge to
80 imagined positions triggered by images of the human body.
82 ** =EMPATH= neatly solves recognition problems
84 I propose a system that can express the types of recognition
85 problems above in a form amenable to computation. It is split into
86 four parts:
88 - Free/Guided Play (Training) :: The creature moves around and
89 experiences the world through its unique perspective. Many
90 otherwise complicated actions are easily described in the
91 language of a full suite of body-centered, rich senses. For
92 example, drinking is the feeling of water sliding down your
93 throat, and cooling your insides. It's often accompanied by
94 bringing your hand close to your face, or bringing your face
95 close to water. Sitting down is the feeling of bending your
96 knees, activating your quadriceps, then feeling a surface with
97 your bottom and relaxing your legs. These body-centered action
98 descriptions can be either learned or hard coded.
99 - Alignment (Posture imitation) :: When trying to interpret a video
100 or image, the creature takes a model of itself and aligns it
101 with whatever it sees. This alignment can even cross species,
102 as when humans try to align themselves with things like
103 ponies, dogs, or other humans with a different body type.
104 - Empathy (Sensory extrapolation) :: The alignment triggers
105 associations with sensory data from prior experiences. For
106 example, the alignment itself easily maps to proprioceptive
107 data. Any sounds or obvious skin contact in the video can to a
108 lesser extent trigger previous experience. Segments of
109 previous experiences are stitched together to form a coherent
110 and complete sensory portrait of the scene.
111 - Recognition (Classification) :: With the scene described in terms
112 of first person sensory events, the creature can now run its
113 action-identification programs on this synthesized sensory
114 data, just as it would if it were actually experiencing the
115 scene first-hand. If previous experience has been accurately
116 retrieved, and if it is analogous enough to the scene, then
117 the creature will correctly identify the action in the scene.
119 For example, I think humans are able to label the cat video as
120 ``drinking'' because they imagine /themselves/ as the cat, and
121 imagine putting their face up against a stream of water and
122 sticking out their tongue. In that imagined world, they can feel
123 the cool water hitting their tongue, and feel the water entering
124 their body, and are able to recognize that /feeling/ as drinking.
125 So, the label of the action is not really in the pixels of the
126 image, but is found clearly in a simulation inspired by those
127 pixels. An imaginative system, having been trained on drinking and
128 non-drinking examples and learning that the most important
129 component of drinking is the feeling of water sliding down one's
130 throat, would analyze a video of a cat drinking in the following
131 manner:
133 1. Create a physical model of the video by putting a ``fuzzy''
134 model of its own body in place of the cat. Possibly also create
135 a simulation of the stream of water.
137 2. Play out this simulated scene and generate imagined sensory
138 experience. This will include relevant muscle contractions, a
139 close up view of the stream from the cat's perspective, and most
140 importantly, the imagined feeling of water entering the
141 mouth. The imagined sensory experience can come from a
142 simulation of the event, but can also be pattern-matched from
143 previous, similar embodied experience.
145 3. The action is now easily identified as drinking by the sense of
146 taste alone. The other senses (such as the tongue moving in and
147 out) help to give plausibility to the simulated action. Note that
148 the sense of vision, while critical in creating the simulation,
149 is not critical for identifying the action from the simulation.
151 For the chair examples, the process is even easier:
153 1. Align a model of your body to the person in the image.
155 2. Generate proprioceptive sensory data from this alignment.
157 3. Use the imagined proprioceptive data as a key to lookup related
158 sensory experience associated with that particular proproceptive
159 feeling.
161 4. Retrieve the feeling of your bottom resting on a surface, your
162 knees bent, and your leg muscles relaxed.
164 5. This sensory information is consistent with the =sitting?=
165 sensory predicate, so you (and the entity in the image) must be
166 sitting.
168 6. There must be a chair-like object since you are sitting.
170 Empathy offers yet another alternative to the age-old AI
171 representation question: ``What is a chair?'' --- A chair is the
172 feeling of sitting.
174 My program, =EMPATH= uses this empathic problem solving technique
175 to interpret the actions of a simple, worm-like creature.
177 #+caption: The worm performs many actions during free play such as
178 #+caption: curling, wiggling, and resting.
179 #+name: worm-intro
180 #+ATTR_LaTeX: :width 15cm
181 [[./images/worm-intro-white.png]]
183 #+caption: =EMPATH= recognized and classified each of these poses by
184 #+caption: inferring the complete sensory experience from
185 #+caption: proprioceptive data.
186 #+name: worm-recognition-intro
187 #+ATTR_LaTeX: :width 15cm
188 [[./images/worm-poses.png]]
190 One powerful advantage of empathic problem solving is that it
191 factors the action recognition problem into two easier problems. To
192 use empathy, you need an /aligner/, which takes the video and a
193 model of your body, and aligns the model with the video. Then, you
194 need a /recognizer/, which uses the aligned model to interpret the
195 action. The power in this method lies in the fact that you describe
196 all actions form a body-centered, viewpoint You are less tied to
197 the particulars of any visual representation of the actions. If you
198 teach the system what ``running'' is, and you have a good enough
199 aligner, the system will from then on be able to recognize running
200 from any point of view, even strange points of view like above or
201 underneath the runner. This is in contrast to action recognition
202 schemes that try to identify actions using a non-embodied approach
203 such as TODO:REFERENCE. If these systems learn about running as
204 viewed from the side, they will not automatically be able to
205 recognize running from any other viewpoint.
207 Another powerful advantage is that using the language of multiple
208 body-centered rich senses to describe body-centerd actions offers a
209 massive boost in descriptive capability. Consider how difficult it
210 would be to compose a set of HOG filters to describe the action of
211 a simple worm-creature ``curling'' so that its head touches its
212 tail, and then behold the simplicity of describing thus action in a
213 language designed for the task (listing \ref{grand-circle-intro}):
215 #+caption: Body-centerd actions are best expressed in a body-centered
216 #+caption: language. This code detects when the worm has curled into a
217 #+caption: full circle. Imagine how you would replicate this functionality
218 #+caption: using low-level pixel features such as HOG filters!
219 #+name: grand-circle-intro
220 #+begin_listing clojure
221 #+begin_src clojure
222 (defn grand-circle?
223 "Does the worm form a majestic circle (one end touching the other)?"
224 [experiences]
225 (and (curled? experiences)
226 (let [worm-touch (:touch (peek experiences))
227 tail-touch (worm-touch 0)
228 head-touch (worm-touch 4)]
229 (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
230 (< 0.55 (contact worm-segment-top-tip head-touch))))))
231 #+end_src
232 #+end_listing
235 ** =CORTEX= is a toolkit for building sensate creatures
237 Hand integration demo
239 ** Contributions
241 * Building =CORTEX=
243 ** To explore embodiment, we need a world, body, and senses
245 ** Because of Time, simulation is perferable to reality
247 ** Video game engines are a great starting point
249 ** Bodies are composed of segments connected by joints
251 ** Eyes reuse standard video game components
253 ** Hearing is hard; =CORTEX= does it right
255 ** Touch uses hundreds of hair-like elements
257 ** Proprioception is the sense that makes everything ``real''
259 ** Muscles are both effectors and sensors
261 ** =CORTEX= brings complex creatures to life!
263 ** =CORTEX= enables many possiblities for further research
265 * Empathy in a simulated worm
267 ** Embodiment factors action recognition into managable parts
269 ** Action recognition is easy with a full gamut of senses
271 ** Digression: bootstrapping touch using free exploration
273 ** \Phi-space describes the worm's experiences
275 ** Empathy is the process of tracing though \Phi-space
277 ** Efficient action recognition with =EMPATH=
279 * Contributions
280 - Built =CORTEX=, a comprehensive platform for embodied AI
281 experiments. Has many new features lacking in other systems, such
282 as sound. Easy to model/create new creatures.
283 - created a novel concept for action recognition by using artificial
284 imagination.
286 In the second half of the thesis I develop a computational model of
287 empathy, using =CORTEX= as a base. Empathy in this context is the
288 ability to observe another creature and infer what sorts of sensations
289 that creature is feeling. My empathy algorithm involves multiple
290 phases. First is free-play, where the creature moves around and gains
291 sensory experience. From this experience I construct a representation
292 of the creature's sensory state space, which I call \Phi-space. Using
293 \Phi-space, I construct an efficient function for enriching the
294 limited data that comes from observing another creature with a full
295 compliment of imagined sensory data based on previous experience. I
296 can then use the imagined sensory data to recognize what the observed
297 creature is doing and feeling, using straightforward embodied action
298 predicates. This is all demonstrated with using a simple worm-like
299 creature, and recognizing worm-actions based on limited data.
301 Embodied representation using multiple senses such as touch,
302 proprioception, and muscle tension turns out be be exceedingly
303 efficient at describing body-centered actions. It is the ``right
304 language for the job''. For example, it takes only around 5 lines of
305 LISP code to describe the action of ``curling'' using embodied
306 primitives. It takes about 8 lines to describe the seemingly
307 complicated action of wiggling.
311 * COMMENT names for cortex
312 - bioland
317 # An anatomical joke:
318 # - Training
319 # - Skeletal imitation
320 # - Sensory fleshing-out
321 # - Classification