Mercurial > cortex
view thesis/org/first-chapter.org @ 403:92acbe7e5c91
begin final coding stretch.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Mon, 17 Mar 2014 14:01:02 -0400 |
parents | 7ee735a836da |
children |
line wrap: on
line source
1 #+title: =CORTEX=2 #+author: Robert McIntyre3 #+email: rlm@mit.edu4 #+description: Using embodied AI to facilitate Artificial Imagination.5 #+keywords: AI, clojure, embodiment6 #+SETUPFILE: ../../aurellem/org/setup.org7 #+INCLUDE: ../../aurellem/org/level-0.org8 #+babel: :mkdirp yes :noweb yes :exports both9 #+OPTIONS: toc:nil, num:nil11 * Artificial Imagination13 Imagine watching a video of someone skateboarding. When you watch14 the video, you can imagine yourself skateboarding, and your15 knowledge of the human body and its dynamics guides your16 interpretation of the scene. For example, even if the skateboarder17 is partially occluded, you can infer the positions of his arms and18 body from your own knowledge of how your body would be positioned if19 you were skateboarding. If the skateboarder suffers an accident, you20 wince in sympathy, imagining the pain your own body would experience21 if it were in the same situation. This empathy with other people22 guides our understanding of whatever they are doing because it is a23 powerful constraint on what is probable and possible. In order to24 make use of this powerful empathy constraint, I need a system that25 can generate and make sense of sensory data from the many different26 senses that humans possess. The two key proprieties of such a system27 are /embodiment/ and /imagination/.29 ** What is imagination?31 One kind of imagination is /sympathetic/ imagination: you imagine32 yourself in the position of something/someone you are33 observing. This type of imagination comes into play when you follow34 along visually when watching someone perform actions, or when you35 sympathetically grimace when someone hurts themselves. This type of36 imagination uses the constraints you have learned about your own37 body to highly constrain the possibilities in whatever you are38 seeing. It uses all your senses to including your senses of touch,39 proprioception, etc. Humans are flexible when it comes to "putting40 themselves in another's shoes," and can sympathetically understand41 not only other humans, but entities ranging from animals to cartoon42 characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!44 Another kind of imagination is /predictive/ imagination: you45 construct scenes in your mind that are not entirely related to46 whatever you are observing, but instead are predictions of the47 future or simply flights of fancy. You use this type of imagination48 to plan out multi-step actions, or play out dangerous situations in49 your mind so as to avoid messing them up in reality.51 Of course, sympathetic and predictive imagination blend into each52 other and are not completely separate concepts. One dimension along53 which you can distinguish types of imagination is dependence on raw54 sense data. Sympathetic imagination is highly constrained by your55 senses, while predictive imagination can be more or less dependent56 on your senses depending on how far ahead you imagine. Daydreaming57 is an extreme form of predictive imagination that wanders through58 different possibilities without concern for whether they are59 related to whatever is happening in reality.61 For this thesis, I will mostly focus on sympathetic imagination and62 the constraint it provides for understanding sensory data.64 ** What problems can imagination solve?66 Consider a video of a cat drinking some water.68 #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.69 #+ATTR_LaTeX: width=5cm70 [[../images/cat-drinking.jpg]]72 It is currently impossible for any computer program to reliably73 label such an video as "drinking". I think humans are able to label74 such video as "drinking" because they imagine /themselves/ as the75 cat, and imagine putting their face up against a stream of water76 and sticking out their tongue. In that imagined world, they can77 feel the cool water hitting their tongue, and feel the water78 entering their body, and are able to recognize that /feeling/ as79 drinking. So, the label of the action is not really in the pixels80 of the image, but is found clearly in a simulation inspired by81 those pixels. An imaginative system, having been trained on82 drinking and non-drinking examples and learning that the most83 important component of drinking is the feeling of water sliding84 down one's throat, would analyze a video of a cat drinking in the85 following manner:87 - Create a physical model of the video by putting a "fuzzy" model88 of its own body in place of the cat. Also, create a simulation of89 the stream of water.91 - Play out this simulated scene and generate imagined sensory92 experience. This will include relevant muscle contractions, a93 close up view of the stream from the cat's perspective, and most94 importantly, the imagined feeling of water entering the mouth.96 - The action is now easily identified as drinking by the sense of97 taste alone. The other senses (such as the tongue moving in and98 out) help to give plausibility to the simulated action. Note that99 the sense of vision, while critical in creating the simulation,100 is not critical for identifying the action from the simulation.102 More generally, I expect imaginative systems to be particularly103 good at identifying embodied actions in videos.105 * Cortex107 The previous example involves liquids, the sense of taste, and108 imagining oneself as a cat. For this thesis I constrain myself to109 simpler, more easily digitizable senses and situations.111 My system, =CORTEX= performs imagination in two different simplified112 worlds: /worm world/ and /stick-figure world/. In each of these113 worlds, entities capable of imagination recognize actions by114 simulating the experience from their own perspective, and then115 recognizing the action from a database of examples.117 In order to serve as a framework for experiments in imagination,118 =CORTEX= requires simulated bodies, worlds, and senses like vision,119 hearing, touch, proprioception, etc.121 ** A Video Game Engine takes care of some of the groundwork123 When it comes to simulation environments, the engines used to124 create the worlds in video games offer top-notch physics and125 graphics support. These engines also have limited support for126 creating cameras and rendering 3D sound, which can be repurposed127 for vision and hearing respectively. Physics collision detection128 can be expanded to create a sense of touch.130 jMonkeyEngine3 is one such engine for creating video games in131 Java. It uses OpenGL to render to the screen and uses screengraphs132 to avoid drawing things that do not appear on the screen. It has an133 active community and several games in the pipeline. The engine was134 not built to serve any particular game but is instead meant to be135 used for any 3D game. I chose jMonkeyEngine3 it because it had the136 most features out of all the open projects I looked at, and because137 I could then write my code in Clojure, an implementation of LISP138 that runs on the JVM.140 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses142 Using the game-making primitives provided by jMonkeyEngine3, I have143 constructed every major human sense except for smell and144 taste. =CORTEX= also provides an interface for creating creatures145 in Blender, a 3D modeling environment, and then "rigging" the146 creatures with senses using 3D annotations in Blender. A creature147 can have any number of senses, and there can be any number of148 creatures in a simulation.150 The senses available in =CORTEX= are:152 - [[../../cortex/html/vision.html][Vision]]153 - [[../../cortex/html/hearing.html][Hearing]]154 - [[../../cortex/html/touch.html][Touch]]155 - [[../../cortex/html/proprioception.html][Proprioception]]156 - [[../../cortex/html/movement.html][Muscle Tension]]158 * A roadmap for =CORTEX= experiments160 ** Worm World162 Worms in =CORTEX= are segmented creatures which vary in length and163 number of segments, and have the senses of vision, proprioception,164 touch, and muscle tension.166 #+attr_html: width=755167 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).168 [[../images/finger-UV.png]]171 #+begin_html172 <div class="figure">173 <center>174 <video controls="controls" width="550">175 <source src="../video/worm-touch.ogg" type="video/ogg"176 preload="none" />177 </video>178 <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>179 </center>180 <p>The worm responds to touch.</p>181 </div>182 #+end_html184 #+begin_html185 <div class="figure">186 <center>187 <video controls="controls" width="550">188 <source src="../video/test-proprioception.ogg" type="video/ogg"189 preload="none" />190 </video>191 <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>192 </center>193 <p>Proprioception in a worm. The proprioceptive readout is194 in the upper left corner of the screen.</p>195 </div>196 #+end_html198 A worm is trained in various actions such as sinusoidal movement,199 curling, flailing, and spinning by directly playing motor200 contractions while the worm "feels" the experience. These actions201 are recorded both as vectors of muscle tension, touch, and202 proprioceptive data, but also in higher level forms such as203 frequencies of the various contractions and a symbolic name for the204 action.206 Then, the worm watches a video of another worm performing one of207 the actions, and must judge which action was performed. Normally208 this would be an extremely difficult problem, but the worm is able209 to greatly diminish the search space through sympathetic210 imagination. First, it creates an imagined copy of its body which211 it observes from a third person point of view. Then for each frame212 of the video, it maneuvers its simulated body to be in registration213 with the worm depicted in the video. The physical constraints214 imposed by the physics simulation greatly decrease the number of215 poses that have to be tried, making the search feasible. As the216 imaginary worm moves, it generates imaginary muscle tension and217 proprioceptive sensations. The worm determines the action not by218 vision, but by matching the imagined proprioceptive data with219 previous examples.221 By using non-visual sensory data such as touch, the worms can also222 answer body related questions such as "did your head touch your223 tail?" and "did worm A touch worm B?"225 The proprioceptive information used for action identification is226 body-centric, so only the registration step is dependent on point227 of view, not the identification step. Registration is not specific228 to any particular action. Thus, action identification can be229 divided into a point-of-view dependent generic registration step,230 and a action-specific step that is body-centered and invariant to231 point of view.233 ** Stick Figure World235 This environment is similar to Worm World, except the creatures are236 more complicated and the actions and questions more varied. It is237 an experiment to see how far imagination can go in interpreting238 actions.