Mercurial > cortex
view thesis/cortex.org @ 446:3e91585b2a1c
save.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Tue, 25 Mar 2014 03:24:28 -0400 |
parents | 47cfbe84f00e |
children | 284316604be0 |
line wrap: on
line source
1 #+title: =CORTEX=2 #+author: Robert McIntyre3 #+email: rlm@mit.edu4 #+description: Using embodied AI to facilitate Artificial Imagination.5 #+keywords: AI, clojure, embodiment8 * Empathy and Embodiment as problem solving strategies10 By the end of this thesis, you will have seen a novel approach to11 interpreting video using embodiment and empathy. You will have also12 seen one way to efficiently implement empathy for embodied13 creatures. Finally, you will become familiar with =CORTEX=, a14 system for designing and simulating creatures with rich senses,15 which you may choose to use in your own research.17 This is the core vision of my thesis: That one of the important ways18 in which we understand others is by imagining ourselves in their19 position and emphatically feeling experiences relative to our own20 bodies. By understanding events in terms of our own previous21 corporeal experience, we greatly constrain the possibilities of what22 would otherwise be an unwieldy exponential search. This extra23 constraint can be the difference between easily understanding what24 is happening in a video and being completely lost in a sea of25 incomprehensible color and movement.27 ** Recognizing actions in video is extremely difficult29 Consider for example the problem of determining what is happening in30 a video of which this is one frame:32 #+caption: A cat drinking some water. Identifying this action is33 #+caption: beyond the state of the art for computers.34 #+ATTR_LaTeX: :width 7cm35 [[./images/cat-drinking.jpg]]37 It is currently impossible for any computer program to reliably38 label such a video as "drinking". And rightly so -- it is a very39 hard problem! What features can you describe in terms of low level40 functions of pixels that can even begin to describe at a high level41 what is happening here?43 Or suppose that you are building a program that recognizes44 chairs. How could you ``see'' the chair in figure45 \ref{invisible-chair} and figure \ref{hidden-chair}?47 #+caption: When you look at this, do you think ``chair''? I certainly do.48 #+name: invisible-chair49 #+ATTR_LaTeX: :width 10cm50 [[./images/invisible-chair.png]]52 #+caption: The chair in this image is quite obvious to humans, but I53 #+caption: doubt that any computer program can find it.54 #+name: hidden-chair55 #+ATTR_LaTeX: :width 10cm56 [[./images/fat-person-sitting-at-desk.jpg]]58 Finally, how is it that you can easily tell the difference between59 how the girls /muscles/ are working in figure \ref{girl}?61 #+caption: The mysterious ``common sense'' appears here as you are able62 #+caption: to discern the difference in how the girl's arm muscles63 #+caption: are activated between the two images.64 #+name: girl65 #+ATTR_LaTeX: :width 10cm66 [[./images/wall-push.png]]68 Each of these examples tells us something about what might be going69 on in our minds as we easily solve these recognition problems.71 The hidden chairs show us that we are strongly triggered by cues72 relating to the position of human bodies, and that we can73 determine the overall physical configuration of a human body even74 if much of that body is occluded.76 The picture of the girl pushing against the wall tells us that we77 have common sense knowledge about the kinetics of our own bodies.78 We know well how our muscles would have to work to maintain us in79 most positions, and we can easily project this self-knowledge to80 imagined positions triggered by images of the human body.82 ** =EMPATH= neatly solves recognition problems84 I propose a system that can express the types of recognition85 problems above in a form amenable to computation. It is split into86 four parts:88 - Free/Guided Play :: The creature moves around and experiences the89 world through its unique perspective. Many otherwise90 complicated actions are easily described in the language of a91 full suite of body-centered, rich senses. For example,92 drinking is the feeling of water sliding down your throat, and93 cooling your insides. It's often accompanied by bringing your94 hand close to your face, or bringing your face close to95 water. Sitting down is the feeling of bending your knees,96 activating your quadriceps, then feeling a surface with your97 bottom and relaxing your legs. These body-centered action98 descriptions can be either learned or hard coded.99 - Alignment :: When trying to interpret a video or image, the100 creature takes a model of itself and aligns it with101 whatever it sees. This can be a rather loose102 alignment that can cross species, as when humans try103 to align themselves with things like ponies, dogs,104 or other humans with a different body type.105 - Empathy :: The alignment triggers the memories of previous106 experience. For example, the alignment itself easily107 maps to proprioceptive data. Any sounds or obvious108 skin contact in the video can to a lesser extent109 trigger previous experience. The creatures previous110 experience is chained together in short bursts to111 coherently describe the new scene.112 - Recognition :: With the scene now described in terms of past113 experience, the creature can now run its114 action-identification programs on this synthesized115 sensory data, just as it would if it were actually116 experiencing the scene first-hand. If previous117 experience has been accurately retrieved, and if118 it is analogous enough to the scene, then the119 creature will correctly identify the action in the120 scene.123 For example, I think humans are able to label the cat video as124 "drinking" because they imagine /themselves/ as the cat, and125 imagine putting their face up against a stream of water and126 sticking out their tongue. In that imagined world, they can feel127 the cool water hitting their tongue, and feel the water entering128 their body, and are able to recognize that /feeling/ as129 drinking. So, the label of the action is not really in the pixels130 of the image, but is found clearly in a simulation inspired by131 those pixels. An imaginative system, having been trained on132 drinking and non-drinking examples and learning that the most133 important component of drinking is the feeling of water sliding134 down one's throat, would analyze a video of a cat drinking in the135 following manner:137 1. Create a physical model of the video by putting a "fuzzy" model138 of its own body in place of the cat. Possibly also create a139 simulation of the stream of water.141 2. Play out this simulated scene and generate imagined sensory142 experience. This will include relevant muscle contractions, a143 close up view of the stream from the cat's perspective, and most144 importantly, the imagined feeling of water entering the145 mouth. The imagined sensory experience can come from a146 simulation of the event, but can also be pattern-matched from147 previous, similar embodied experience.149 3. The action is now easily identified as drinking by the sense of150 taste alone. The other senses (such as the tongue moving in and151 out) help to give plausibility to the simulated action. Note that152 the sense of vision, while critical in creating the simulation,153 is not critical for identifying the action from the simulation.155 For the chair examples, the process is even easier:157 1. Align a model of your body to the person in the image.159 2. Generate proprioceptive sensory data from this alignment.161 3. Use the imagined proprioceptive data as a key to lookup related162 sensory experience associated with that particular proproceptive163 feeling.165 4. Retrieve the feeling of your bottom resting on a surface, your166 knees bent, and your leg muscles relaxed.168 5. This sensory information is consistent with the =sitting?=169 sensory predicate, so you (and the entity in the image) must be170 sitting.172 6. There must be a chair-like object since you are sitting.174 Empathy offers yet another alternative to the age-old AI175 representation question: ``What is a chair?'' --- A chair is the176 feeling of sitting.178 My program, =EMPATH= uses this empathic problem solving technique179 to interpret the actions of a simple, worm-like creature.181 #+caption: The worm performs many actions during free play such as182 #+caption: curling, wiggling, and resting.183 #+name: worm-intro184 #+ATTR_LaTeX: :width 15cm185 [[./images/worm-intro-white.png]]187 #+caption: The actions of a worm in a video can be recognized by188 #+caption: proprioceptive data and sentory predicates by filling189 #+caption: in the missing sensory detail with previous experience.190 #+name: worm-recognition-intro191 #+ATTR_LaTeX: :width 15cm192 [[./images/worm-poses.png]]195 One powerful advantage of empathic problem solving is that it196 factors the action recognition problem into two easier problems. To197 use empathy, you need an /aligner/, which takes the video and a198 model of your body, and aligns the model with the video. Then, you199 need a /recognizer/, which uses the aligned model to interpret the200 action. The power in this method lies in the fact that you describe201 all actions form a body-centered, rich viewpoint. This way, if you202 teach the system what ``running'' is, and you have a good enough203 aligner, the system will from then on be able to recognize running204 from any point of view, even strange points of view like above or205 underneath the runner. This is in contrast to action recognition206 schemes that try to identify actions using a non-embodied approach207 such as TODO:REFERENCE. If these systems learn about running as viewed208 from the side, they will not automatically be able to recognize209 running from any other viewpoint.211 Another powerful advantage is that using the language of multiple212 body-centered rich senses to describe body-centerd actions offers a213 massive boost in descriptive capability. Consider how difficult it214 would be to compose a set of HOG filters to describe the action of215 a simple worm-creature "curling" so that its head touches its tail,216 and then behold the simplicity of describing thus action in a217 language designed for the task (listing \ref{grand-circle-intro}):219 #+caption: Body-centerd actions are best expressed in a body-centered220 #+caption: language. This code detects when the worm has curled into a221 #+caption: full circle. Imagine how you would replicate this functionality222 #+caption: using low-level pixel features such as HOG filters!223 #+name: grand-circle-intro224 #+begin_listing clojure225 #+begin_src clojure226 (defn grand-circle?227 "Does the worm form a majestic circle (one end touching the other)?"228 [experiences]229 (and (curled? experiences)230 (let [worm-touch (:touch (peek experiences))231 tail-touch (worm-touch 0)232 head-touch (worm-touch 4)]233 (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))234 (< 0.55 (contact worm-segment-top-tip head-touch))))))235 #+end_src236 #+end_listing239 ** =CORTEX= is a toolkit for building sensate creatures241 Hand integration demo243 ** Contributions245 * Building =CORTEX=247 ** To explore embodiment, we need a world, body, and senses249 ** Because of Time, simulation is perferable to reality251 ** Video game engines are a great starting point253 ** Bodies are composed of segments connected by joints255 ** Eyes reuse standard video game components257 ** Hearing is hard; =CORTEX= does it right259 ** Touch uses hundreds of hair-like elements261 ** Proprioception is the sense that makes everything ``real''263 ** Muscles are both effectors and sensors265 ** =CORTEX= brings complex creatures to life!267 ** =CORTEX= enables many possiblities for further research269 * Empathy in a simulated worm271 ** Embodiment factors action recognition into managable parts273 ** Action recognition is easy with a full gamut of senses275 ** Digression: bootstrapping touch using free exploration277 ** \Phi-space describes the worm's experiences279 ** Empathy is the process of tracing though \Phi-space281 ** Efficient action recognition with =EMPATH=283 * Contributions284 - Built =CORTEX=, a comprehensive platform for embodied AI285 experiments. Has many new features lacking in other systems, such286 as sound. Easy to model/create new creatures.287 - created a novel concept for action recognition by using artificial288 imagination.290 In the second half of the thesis I develop a computational model of291 empathy, using =CORTEX= as a base. Empathy in this context is the292 ability to observe another creature and infer what sorts of sensations293 that creature is feeling. My empathy algorithm involves multiple294 phases. First is free-play, where the creature moves around and gains295 sensory experience. From this experience I construct a representation296 of the creature's sensory state space, which I call \phi-space. Using297 \phi-space, I construct an efficient function for enriching the298 limited data that comes from observing another creature with a full299 compliment of imagined sensory data based on previous experience. I300 can then use the imagined sensory data to recognize what the observed301 creature is doing and feeling, using straightforward embodied action302 predicates. This is all demonstrated with using a simple worm-like303 creature, and recognizing worm-actions based on limited data.305 Embodied representation using multiple senses such as touch,306 proprioception, and muscle tension turns out be be exceedingly307 efficient at describing body-centered actions. It is the ``right308 language for the job''. For example, it takes only around 5 lines of309 LISP code to describe the action of ``curling'' using embodied310 primitives. It takes about 8 lines to describe the seemingly311 complicated action of wiggling.315 * COMMENT names for cortex316 - bioland