view thesis/cortex.org @ 448:af13fc73e851

completing second part of first chapter.
author Robert McIntyre <rlm@mit.edu>
date Tue, 25 Mar 2014 22:54:41 -0400
parents 284316604be0
children 09b7c8dd4365
line wrap: on
line source
1 #+title: =CORTEX=
2 #+author: Robert McIntyre
3 #+email: rlm@mit.edu
4 #+description: Using embodied AI to facilitate Artificial Imagination.
5 #+keywords: AI, clojure, embodiment
8 * Empathy and Embodiment as problem solving strategies
10 By the end of this thesis, you will have seen a novel approach to
11 interpreting video using embodiment and empathy. You will have also
12 seen one way to efficiently implement empathy for embodied
13 creatures. Finally, you will become familiar with =CORTEX=, a system
14 for designing and simulating creatures with rich senses, which you
15 may choose to use in your own research.
17 This is the core vision of my thesis: That one of the important ways
18 in which we understand others is by imagining ourselves in their
19 position and emphatically feeling experiences relative to our own
20 bodies. By understanding events in terms of our own previous
21 corporeal experience, we greatly constrain the possibilities of what
22 would otherwise be an unwieldy exponential search. This extra
23 constraint can be the difference between easily understanding what
24 is happening in a video and being completely lost in a sea of
25 incomprehensible color and movement.
27 ** Recognizing actions in video is extremely difficult
29 Consider for example the problem of determining what is happening
30 in a video of which this is one frame:
32 #+caption: A cat drinking some water. Identifying this action is
33 #+caption: beyond the state of the art for computers.
34 #+ATTR_LaTeX: :width 7cm
35 [[./images/cat-drinking.jpg]]
37 It is currently impossible for any computer program to reliably
38 label such a video as ``drinking''. And rightly so -- it is a very
39 hard problem! What features can you describe in terms of low level
40 functions of pixels that can even begin to describe at a high level
41 what is happening here?
43 Or suppose that you are building a program that recognizes chairs.
44 How could you ``see'' the chair in figure \ref{hidden-chair}?
46 #+caption: The chair in this image is quite obvious to humans, but I
47 #+caption: doubt that any modern computer vision program can find it.
48 #+name: hidden-chair
49 #+ATTR_LaTeX: :width 10cm
50 [[./images/fat-person-sitting-at-desk.jpg]]
52 Finally, how is it that you can easily tell the difference between
53 how the girls /muscles/ are working in figure \ref{girl}?
55 #+caption: The mysterious ``common sense'' appears here as you are able
56 #+caption: to discern the difference in how the girl's arm muscles
57 #+caption: are activated between the two images.
58 #+name: girl
59 #+ATTR_LaTeX: :width 7cm
60 [[./images/wall-push.png]]
62 Each of these examples tells us something about what might be going
63 on in our minds as we easily solve these recognition problems.
65 The hidden chairs show us that we are strongly triggered by cues
66 relating to the position of human bodies, and that we can determine
67 the overall physical configuration of a human body even if much of
68 that body is occluded.
70 The picture of the girl pushing against the wall tells us that we
71 have common sense knowledge about the kinetics of our own bodies.
72 We know well how our muscles would have to work to maintain us in
73 most positions, and we can easily project this self-knowledge to
74 imagined positions triggered by images of the human body.
76 ** =EMPATH= neatly solves recognition problems
78 I propose a system that can express the types of recognition
79 problems above in a form amenable to computation. It is split into
80 four parts:
82 - Free/Guided Play :: The creature moves around and experiences the
83 world through its unique perspective. Many otherwise
84 complicated actions are easily described in the language of a
85 full suite of body-centered, rich senses. For example,
86 drinking is the feeling of water sliding down your throat, and
87 cooling your insides. It's often accompanied by bringing your
88 hand close to your face, or bringing your face close to water.
89 Sitting down is the feeling of bending your knees, activating
90 your quadriceps, then feeling a surface with your bottom and
91 relaxing your legs. These body-centered action descriptions
92 can be either learned or hard coded.
93 - Posture Imitation :: When trying to interpret a video or image,
94 the creature takes a model of itself and aligns it with
95 whatever it sees. This alignment can even cross species, as
96 when humans try to align themselves with things like ponies,
97 dogs, or other humans with a different body type.
98 - Empathy :: The alignment triggers associations with
99 sensory data from prior experiences. For example, the
100 alignment itself easily maps to proprioceptive data. Any
101 sounds or obvious skin contact in the video can to a lesser
102 extent trigger previous experience. Segments of previous
103 experiences are stitched together to form a coherent and
104 complete sensory portrait of the scene.
105 - Recognition :: With the scene described in terms of first
106 person sensory events, the creature can now run its
107 action-identification programs on this synthesized sensory
108 data, just as it would if it were actually experiencing the
109 scene first-hand. If previous experience has been accurately
110 retrieved, and if it is analogous enough to the scene, then
111 the creature will correctly identify the action in the scene.
113 For example, I think humans are able to label the cat video as
114 ``drinking'' because they imagine /themselves/ as the cat, and
115 imagine putting their face up against a stream of water and
116 sticking out their tongue. In that imagined world, they can feel
117 the cool water hitting their tongue, and feel the water entering
118 their body, and are able to recognize that /feeling/ as drinking.
119 So, the label of the action is not really in the pixels of the
120 image, but is found clearly in a simulation inspired by those
121 pixels. An imaginative system, having been trained on drinking and
122 non-drinking examples and learning that the most important
123 component of drinking is the feeling of water sliding down one's
124 throat, would analyze a video of a cat drinking in the following
125 manner:
127 1. Create a physical model of the video by putting a ``fuzzy''
128 model of its own body in place of the cat. Possibly also create
129 a simulation of the stream of water.
131 2. Play out this simulated scene and generate imagined sensory
132 experience. This will include relevant muscle contractions, a
133 close up view of the stream from the cat's perspective, and most
134 importantly, the imagined feeling of water entering the
135 mouth. The imagined sensory experience can come from a
136 simulation of the event, but can also be pattern-matched from
137 previous, similar embodied experience.
139 3. The action is now easily identified as drinking by the sense of
140 taste alone. The other senses (such as the tongue moving in and
141 out) help to give plausibility to the simulated action. Note that
142 the sense of vision, while critical in creating the simulation,
143 is not critical for identifying the action from the simulation.
145 For the chair examples, the process is even easier:
147 1. Align a model of your body to the person in the image.
149 2. Generate proprioceptive sensory data from this alignment.
151 3. Use the imagined proprioceptive data as a key to lookup related
152 sensory experience associated with that particular proproceptive
153 feeling.
155 4. Retrieve the feeling of your bottom resting on a surface, your
156 knees bent, and your leg muscles relaxed.
158 5. This sensory information is consistent with the =sitting?=
159 sensory predicate, so you (and the entity in the image) must be
160 sitting.
162 6. There must be a chair-like object since you are sitting.
164 Empathy offers yet another alternative to the age-old AI
165 representation question: ``What is a chair?'' --- A chair is the
166 feeling of sitting.
168 My program, =EMPATH= uses this empathic problem solving technique
169 to interpret the actions of a simple, worm-like creature.
171 #+caption: The worm performs many actions during free play such as
172 #+caption: curling, wiggling, and resting.
173 #+name: worm-intro
174 #+ATTR_LaTeX: :width 15cm
175 [[./images/worm-intro-white.png]]
177 #+caption: =EMPATH= recognized and classified each of these poses by
178 #+caption: inferring the complete sensory experience from
179 #+caption: proprioceptive data.
180 #+name: worm-recognition-intro
181 #+ATTR_LaTeX: :width 15cm
182 [[./images/worm-poses.png]]
184 One powerful advantage of empathic problem solving is that it
185 factors the action recognition problem into two easier problems. To
186 use empathy, you need an /aligner/, which takes the video and a
187 model of your body, and aligns the model with the video. Then, you
188 need a /recognizer/, which uses the aligned model to interpret the
189 action. The power in this method lies in the fact that you describe
190 all actions form a body-centered viewpoint. You are less tied to
191 the particulars of any visual representation of the actions. If you
192 teach the system what ``running'' is, and you have a good enough
193 aligner, the system will from then on be able to recognize running
194 from any point of view, even strange points of view like above or
195 underneath the runner. This is in contrast to action recognition
196 schemes that try to identify actions using a non-embodied approach.
197 If these systems learn about running as viewed from the side, they
198 will not automatically be able to recognize running from any other
199 viewpoint.
201 Another powerful advantage is that using the language of multiple
202 body-centered rich senses to describe body-centerd actions offers a
203 massive boost in descriptive capability. Consider how difficult it
204 would be to compose a set of HOG filters to describe the action of
205 a simple worm-creature ``curling'' so that its head touches its
206 tail, and then behold the simplicity of describing thus action in a
207 language designed for the task (listing \ref{grand-circle-intro}):
209 #+caption: Body-centerd actions are best expressed in a body-centered
210 #+caption: language. This code detects when the worm has curled into a
211 #+caption: full circle. Imagine how you would replicate this functionality
212 #+caption: using low-level pixel features such as HOG filters!
213 #+name: grand-circle-intro
214 #+begin_listing clojure
215 #+begin_src clojure
216 (defn grand-circle?
217 "Does the worm form a majestic circle (one end touching the other)?"
218 [experiences]
219 (and (curled? experiences)
220 (let [worm-touch (:touch (peek experiences))
221 tail-touch (worm-touch 0)
222 head-touch (worm-touch 4)]
223 (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
224 (< 0.55 (contact worm-segment-top-tip head-touch))))))
225 #+end_src
226 #+end_listing
229 ** =CORTEX= is a toolkit for building sensate creatures
231 I built =CORTEX= to be a general AI research platform for doing
232 experiments involving multiple rich senses and a wide variety and
233 number of creatures. I intend it to be useful as a library for many
234 more projects than just this one. =CORTEX= was necessary to meet a
235 need among AI researchers at CSAIL and beyond, which is that people
236 often will invent neat ideas that are best expressed in the
237 language of creatures and senses, but in order to explore those
238 ideas they must first build a platform in which they can create
239 simulated creatures with rich senses! There are many ideas that
240 would be simple to execute (such as =EMPATH=), but attached to them
241 is the multi-month effort to make a good creature simulator. Often,
242 that initial investment of time proves to be too much, and the
243 project must make do with a lesser environment.
245 =CORTEX= is well suited as an environment for embodied AI research
246 for three reasons:
248 - You can create new creatures using Blender, a popular 3D modeling
249 program. Each sense can be specified using special blender nodes
250 with biologically inspired paramaters. You need not write any
251 code to create a creature, and can use a wide library of
252 pre-existing blender models as a base for your own creatures.
254 - =CORTEX= implements a wide variety of senses, including touch,
255 proprioception, vision, hearing, and muscle tension. Complicated
256 senses like touch, and vision involve multiple sensory elements
257 embedded in a 2D surface. You have complete control over the
258 distribution of these sensor elements through the use of simple
259 png image files. In particular, =CORTEX= implements more
260 comprehensive hearing than any other creature simulation system
261 available.
263 - =CORTEX= supports any number of creatures and any number of
264 senses. Time in =CORTEX= dialates so that the simulated creatures
265 always precieve a perfectly smooth flow of time, regardless of
266 the actual computational load.
268 =CORTEX= is built on top of =jMonkeyEngine3=, which is a video game
269 engine designed to create cross-platform 3D desktop games. =CORTEX=
270 is mainly written in clojure, a dialect of =LISP= that runs on the
271 java virtual machine (JVM). The API for creating and simulating
272 creatures is entirely expressed in clojure. Hearing is implemented
273 as a layer of clojure code on top of a layer of java code on top of
274 a layer of =C++= code which implements a modified version of
275 =OpenAL= to support multiple listeners. =CORTEX= is the only
276 simulation environment that I know of that can support multiple
277 entities that can each hear the world from their own perspective.
278 Other senses also require a small layer of Java code. =CORTEX= also
279 uses =bullet=, a physics simulator written in =C=.
281 #+caption: Here is the worm from above modeled in Blender, a free
282 #+caption: 3D-modeling program. Senses and joints are described
283 #+caption: using special nodes in Blender.
284 #+name: worm-recognition-intro
285 #+ATTR_LaTeX: :width 12cm
286 [[./images/blender-worm.png]]
288 During one test with =CORTEX=, I created 3,000 entities each with
289 their own independent senses and ran them all at only 1/80 real
290 time. In another test, I created a detailed model of my own hand,
291 equipped with a realistic distribution of touch (more sensitive at
292 the fingertips), as well as eyes and ears, and it ran at around 1/4
293 real time.
295 #+caption: Here is the worm from above modeled in Blender, a free
296 #+caption: 3D-modeling program. Senses and joints are described
297 #+caption: using special nodes in Blender.
298 #+name: worm-recognition-intro
299 #+ATTR_LaTeX: :width 15cm
300 [[./images/full-hand.png]]
306 ** Contributions
308 * Building =CORTEX=
310 ** To explore embodiment, we need a world, body, and senses
312 ** Because of Time, simulation is perferable to reality
314 ** Video game engines are a great starting point
316 ** Bodies are composed of segments connected by joints
318 ** Eyes reuse standard video game components
320 ** Hearing is hard; =CORTEX= does it right
322 ** Touch uses hundreds of hair-like elements
324 ** Proprioception is the sense that makes everything ``real''
326 ** Muscles are both effectors and sensors
328 ** =CORTEX= brings complex creatures to life!
330 ** =CORTEX= enables many possiblities for further research
332 * Empathy in a simulated worm
334 ** Embodiment factors action recognition into managable parts
336 ** Action recognition is easy with a full gamut of senses
338 ** Digression: bootstrapping touch using free exploration
340 ** \Phi-space describes the worm's experiences
342 ** Empathy is the process of tracing though \Phi-space
344 ** Efficient action recognition with =EMPATH=
346 * Contributions
347 - Built =CORTEX=, a comprehensive platform for embodied AI
348 experiments. Has many new features lacking in other systems, such
349 as sound. Easy to model/create new creatures.
350 - created a novel concept for action recognition by using artificial
351 imagination.
353 In the second half of the thesis I develop a computational model of
354 empathy, using =CORTEX= as a base. Empathy in this context is the
355 ability to observe another creature and infer what sorts of sensations
356 that creature is feeling. My empathy algorithm involves multiple
357 phases. First is free-play, where the creature moves around and gains
358 sensory experience. From this experience I construct a representation
359 of the creature's sensory state space, which I call \Phi-space. Using
360 \Phi-space, I construct an efficient function for enriching the
361 limited data that comes from observing another creature with a full
362 compliment of imagined sensory data based on previous experience. I
363 can then use the imagined sensory data to recognize what the observed
364 creature is doing and feeling, using straightforward embodied action
365 predicates. This is all demonstrated with using a simple worm-like
366 creature, and recognizing worm-actions based on limited data.
368 Embodied representation using multiple senses such as touch,
369 proprioception, and muscle tension turns out be be exceedingly
370 efficient at describing body-centered actions. It is the ``right
371 language for the job''. For example, it takes only around 5 lines of
372 LISP code to describe the action of ``curling'' using embodied
373 primitives. It takes about 8 lines to describe the seemingly
374 complicated action of wiggling.
378 * COMMENT names for cortex
379 - bioland
384 # An anatomical joke:
385 # - Training
386 # - Skeletal imitation
387 # - Sensory fleshing-out
388 # - Classification