diff thesis/org/first-chapter.org @ 401:7ee735a836da

incorporate thesis.
author Robert McIntyre <rlm@mit.edu>
date Sun, 16 Mar 2014 23:31:16 -0400
parents
children
line wrap: on
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/thesis/org/first-chapter.org	Sun Mar 16 23:31:16 2014 -0400
     1.3 @@ -0,0 +1,238 @@
     1.4 +#+title: =CORTEX=
     1.5 +#+author: Robert McIntyre
     1.6 +#+email: rlm@mit.edu
     1.7 +#+description: Using embodied AI to facilitate Artificial Imagination.
     1.8 +#+keywords: AI, clojure, embodiment
     1.9 +#+SETUPFILE: ../../aurellem/org/setup.org
    1.10 +#+INCLUDE: ../../aurellem/org/level-0.org
    1.11 +#+babel: :mkdirp yes :noweb yes :exports both
    1.12 +#+OPTIONS: toc:nil, num:nil
    1.13 +
    1.14 +* Artificial Imagination
    1.15 +
    1.16 +  Imagine watching a video of someone skateboarding. When you watch
    1.17 +  the video, you can imagine yourself skateboarding, and your
    1.18 +  knowledge of the human body and its dynamics guides your
    1.19 +  interpretation of the scene. For example, even if the skateboarder
    1.20 +  is partially occluded, you can infer the positions of his arms and
    1.21 +  body from your own knowledge of how your body would be positioned if
    1.22 +  you were skateboarding. If the skateboarder suffers an accident, you
    1.23 +  wince in sympathy, imagining the pain your own body would experience
    1.24 +  if it were in the same situation. This empathy with other people
    1.25 +  guides our understanding of whatever they are doing because it is a
    1.26 +  powerful constraint on what is probable and possible. In order to
    1.27 +  make use of this powerful empathy constraint, I need a system that
    1.28 +  can generate and make sense of sensory data from the many different
    1.29 +  senses that humans possess. The two key proprieties of such a system
    1.30 +  are /embodiment/ and /imagination/.
    1.31 +
    1.32 +** What is imagination?
    1.33 +
    1.34 +   One kind of imagination is /sympathetic/ imagination: you imagine
    1.35 +   yourself in the position of something/someone you are
    1.36 +   observing. This type of imagination comes into play when you follow
    1.37 +   along visually when watching someone perform actions, or when you
    1.38 +   sympathetically grimace when someone hurts themselves. This type of
    1.39 +   imagination uses the constraints you have learned about your own
    1.40 +   body to highly constrain the possibilities in whatever you are
    1.41 +   seeing. It uses all your senses to including your senses of touch,
    1.42 +   proprioception, etc. Humans are flexible when it comes to "putting
    1.43 +   themselves in another's shoes," and can sympathetically understand
    1.44 +   not only other humans, but entities ranging from animals to cartoon
    1.45 +   characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!
    1.46 +
    1.47 +   Another kind of imagination is /predictive/ imagination: you
    1.48 +   construct scenes in your mind that are not entirely related to
    1.49 +   whatever you are observing, but instead are predictions of the
    1.50 +   future or simply flights of fancy. You use this type of imagination
    1.51 +   to plan out multi-step actions, or play out dangerous situations in
    1.52 +   your mind so as to avoid messing them up in reality.
    1.53 +
    1.54 +   Of course, sympathetic and predictive imagination blend into each
    1.55 +   other and are not completely separate concepts. One dimension along
    1.56 +   which you can distinguish types of imagination is dependence on raw
    1.57 +   sense data. Sympathetic imagination is highly constrained by your
    1.58 +   senses, while predictive imagination can be more or less dependent
    1.59 +   on your senses depending on how far ahead you imagine. Daydreaming
    1.60 +   is an extreme form of predictive imagination that wanders through
    1.61 +   different possibilities without concern for whether they are
    1.62 +   related to whatever is happening in reality.
    1.63 +
    1.64 +   For this thesis, I will mostly focus on sympathetic imagination and
    1.65 +   the constraint it provides for understanding sensory data.
    1.66 +   
    1.67 +** What problems can imagination solve?
    1.68 +
    1.69 +   Consider a video of a cat drinking some water.
    1.70 +
    1.71 +   #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.
    1.72 +   #+ATTR_LaTeX: width=5cm
    1.73 +   [[../images/cat-drinking.jpg]]
    1.74 +
    1.75 +   It is currently impossible for any computer program to reliably
    1.76 +   label such an video as "drinking". I think humans are able to label
    1.77 +   such video as "drinking" because they imagine /themselves/ as the
    1.78 +   cat, and imagine putting their face up against a stream of water
    1.79 +   and sticking out their tongue. In that imagined world, they can
    1.80 +   feel the cool water hitting their tongue, and feel the water
    1.81 +   entering their body, and are able to recognize that /feeling/ as
    1.82 +   drinking. So, the label of the action is not really in the pixels
    1.83 +   of the image, but is found clearly in a simulation inspired by
    1.84 +   those pixels. An imaginative system, having been trained on
    1.85 +   drinking and non-drinking examples and learning that the most
    1.86 +   important component of drinking is the feeling of water sliding
    1.87 +   down one's throat, would analyze a video of a cat drinking in the
    1.88 +   following manner:
    1.89 +   
    1.90 +   - Create a physical model of the video by putting a "fuzzy" model
    1.91 +     of its own body in place of the cat. Also, create a simulation of
    1.92 +     the stream of water.
    1.93 +
    1.94 +   - Play out this simulated scene and generate imagined sensory
    1.95 +     experience. This will include relevant muscle contractions, a
    1.96 +     close up view of the stream from the cat's perspective, and most
    1.97 +     importantly, the imagined feeling of water entering the mouth.
    1.98 +
    1.99 +   - The action is now easily identified as drinking by the sense of
   1.100 +     taste alone. The other senses (such as the tongue moving in and
   1.101 +     out) help to give plausibility to the simulated action. Note that
   1.102 +     the sense of vision, while critical in creating the simulation,
   1.103 +     is not critical for identifying the action from the simulation.
   1.104 +
   1.105 +   More generally, I expect imaginative systems to be particularly
   1.106 +   good at identifying embodied actions in videos.
   1.107 +
   1.108 +* Cortex
   1.109 +
   1.110 +  The previous example involves liquids, the sense of taste, and
   1.111 +  imagining oneself as a cat. For this thesis I constrain myself to
   1.112 +  simpler, more easily digitizable senses and situations.
   1.113 +
   1.114 +  My system, =CORTEX= performs imagination in two different simplified
   1.115 +  worlds: /worm world/ and /stick-figure world/. In each of these
   1.116 +  worlds, entities capable of imagination recognize actions by
   1.117 +  simulating the experience from their own perspective, and then
   1.118 +  recognizing the action from a database of examples.
   1.119 +
   1.120 +  In order to serve as a framework for experiments in imagination,
   1.121 +  =CORTEX= requires simulated bodies, worlds, and senses like vision,
   1.122 +  hearing, touch, proprioception, etc.
   1.123 +
   1.124 +** A Video Game Engine takes care of some of the groundwork
   1.125 +
   1.126 +   When it comes to simulation environments, the engines used to
   1.127 +   create the worlds in video games offer top-notch physics and
   1.128 +   graphics support. These engines also have limited support for
   1.129 +   creating cameras and rendering 3D sound, which can be repurposed
   1.130 +   for vision and hearing respectively. Physics collision detection
   1.131 +   can be expanded to create a sense of touch.
   1.132 +   
   1.133 +   jMonkeyEngine3 is one such engine for creating video games in
   1.134 +   Java. It uses OpenGL to render to the screen and uses screengraphs
   1.135 +   to avoid drawing things that do not appear on the screen. It has an
   1.136 +   active community and several games in the pipeline. The engine was
   1.137 +   not built to serve any particular game but is instead meant to be
   1.138 +   used for any 3D game. I chose jMonkeyEngine3 it because it had the
   1.139 +   most features out of all the open projects I looked at, and because
   1.140 +   I could then write my code in Clojure, an implementation of LISP
   1.141 +   that runs on the JVM.
   1.142 +
   1.143 +** =CORTEX= Extends jMonkeyEngine3 to implement rich senses
   1.144 +
   1.145 +   Using the game-making primitives provided by jMonkeyEngine3, I have
   1.146 +   constructed every major human sense except for smell and
   1.147 +   taste. =CORTEX= also provides an interface for creating creatures
   1.148 +   in Blender, a 3D modeling environment, and then "rigging" the
   1.149 +   creatures with senses using 3D annotations in Blender. A creature
   1.150 +   can have any number of senses, and there can be any number of
   1.151 +   creatures in a simulation.
   1.152 +   
   1.153 +   The senses available in =CORTEX= are:
   1.154 +
   1.155 +   - [[../../cortex/html/vision.html][Vision]]
   1.156 +   - [[../../cortex/html/hearing.html][Hearing]]
   1.157 +   - [[../../cortex/html/touch.html][Touch]]
   1.158 +   - [[../../cortex/html/proprioception.html][Proprioception]]
   1.159 +   - [[../../cortex/html/movement.html][Muscle Tension]]
   1.160 +
   1.161 +* A roadmap for =CORTEX= experiments
   1.162 +
   1.163 +** Worm World
   1.164 +
   1.165 +   Worms in =CORTEX= are segmented creatures which vary in length and
   1.166 +   number of segments, and have the senses of vision, proprioception,
   1.167 +   touch, and muscle tension.
   1.168 +
   1.169 +#+attr_html: width=755
   1.170 +#+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).
   1.171 +[[../images/finger-UV.png]]
   1.172 +
   1.173 +
   1.174 +#+begin_html
   1.175 +<div class="figure">
   1.176 +  <center>
   1.177 +    <video controls="controls" width="550">
   1.178 +      <source src="../video/worm-touch.ogg" type="video/ogg"
   1.179 +	      preload="none" />
   1.180 +    </video>
   1.181 +    <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>
   1.182 +  </center>
   1.183 +  <p>The worm responds to touch.</p>
   1.184 +</div>
   1.185 +#+end_html
   1.186 +
   1.187 +#+begin_html
   1.188 +<div class="figure">
   1.189 +  <center>
   1.190 +    <video controls="controls" width="550">
   1.191 +      <source src="../video/test-proprioception.ogg" type="video/ogg"
   1.192 +	      preload="none" />
   1.193 +    </video>
   1.194 +    <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>
   1.195 +  </center>
   1.196 +  <p>Proprioception in a worm. The proprioceptive readout is
   1.197 +    in the upper left corner of the screen.</p>
   1.198 +</div>
   1.199 +#+end_html
   1.200 +
   1.201 +   A worm is trained in various actions such as sinusoidal movement,
   1.202 +   curling, flailing, and spinning by directly playing motor
   1.203 +   contractions while the worm "feels" the experience. These actions
   1.204 +   are recorded both as vectors of muscle tension, touch, and
   1.205 +   proprioceptive data, but also in higher level forms such as
   1.206 +   frequencies of the various contractions and a symbolic name for the
   1.207 +   action.
   1.208 +
   1.209 +   Then, the worm watches a video of another worm performing one of
   1.210 +   the actions, and must judge which action was performed. Normally
   1.211 +   this would be an extremely difficult problem, but the worm is able
   1.212 +   to greatly diminish the search space through sympathetic
   1.213 +   imagination. First, it creates an imagined copy of its body which
   1.214 +   it observes from a third person point of view. Then for each frame
   1.215 +   of the video, it maneuvers its simulated body to be in registration
   1.216 +   with the worm depicted in the video. The physical constraints
   1.217 +   imposed by the physics simulation greatly decrease the number of
   1.218 +   poses that have to be tried, making the search feasible. As the
   1.219 +   imaginary worm moves, it generates imaginary muscle tension and
   1.220 +   proprioceptive sensations. The worm determines the action not by
   1.221 +   vision, but by matching the imagined proprioceptive data with
   1.222 +   previous examples.
   1.223 +
   1.224 +   By using non-visual sensory data such as touch, the worms can also
   1.225 +   answer body related questions such as "did your head touch your
   1.226 +   tail?" and "did worm A touch worm B?"
   1.227 +
   1.228 +   The proprioceptive information used for action identification is
   1.229 +   body-centric, so only the registration step is dependent on point
   1.230 +   of view, not the identification step. Registration is not specific
   1.231 +   to any particular action. Thus, action identification can be
   1.232 +   divided into a point-of-view dependent generic registration step,
   1.233 +   and a action-specific step that is body-centered and invariant to
   1.234 +   point of view.
   1.235 +
   1.236 +** Stick Figure World
   1.237 +
   1.238 +   This environment is similar to Worm World, except the creatures are
   1.239 +   more complicated and the actions and questions more varied. It is
   1.240 +   an experiment to see how far imagination can go in interpreting
   1.241 +   actions.