Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 #+SETUPFILE: ../../aurellem/org/setup.org

     7 #+INCLUDE: ../../aurellem/org/level-0.org

     8 #+babel: :mkdirp yes :noweb yes :exports both

     9 #+OPTIONS: toc:nil, num:nil

    10 

    11 * Artificial Imagination

    12 

    13   Imagine watching a video of someone skateboarding. When you watch

    14   the video, you can imagine yourself skateboarding, and your

    15   knowledge of the human body and its dynamics guides your

    16   interpretation of the scene. For example, even if the skateboarder

    17   is partially occluded, you can infer the positions of his arms and

    18   body from your own knowledge of how your body would be positioned if

    19   you were skateboarding. If the skateboarder suffers an accident, you

    20   wince in sympathy, imagining the pain your own body would experience

    21   if it were in the same situation. This empathy with other people

    22   guides our understanding of whatever they are doing because it is a

    23   powerful constraint on what is probable and possible. In order to

    24   make use of this powerful empathy constraint, I need a system that

    25   can generate and make sense of sensory data from the many different

    26   senses that humans possess. The two key proprieties of such a system

    27   are /embodiment/ and /imagination/.

    28 

    29 ** What is imagination?

    30 

    31    One kind of imagination is /sympathetic/ imagination: you imagine

    32    yourself in the position of something/someone you are

    33    observing. This type of imagination comes into play when you follow

    34    along visually when watching someone perform actions, or when you

    35    sympathetically grimace when someone hurts themselves. This type of

    36    imagination uses the constraints you have learned about your own

    37    body to highly constrain the possibilities in whatever you are

    38    seeing. It uses all your senses to including your senses of touch,

    39    proprioception, etc. Humans are flexible when it comes to "putting

    40    themselves in another's shoes," and can sympathetically understand

    41    not only other humans, but entities ranging from animals to cartoon

    42    characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!

    43 

    44    Another kind of imagination is /predictive/ imagination: you

    45    construct scenes in your mind that are not entirely related to

    46    whatever you are observing, but instead are predictions of the

    47    future or simply flights of fancy. You use this type of imagination

    48    to plan out multi-step actions, or play out dangerous situations in

    49    your mind so as to avoid messing them up in reality.

    50 

    51    Of course, sympathetic and predictive imagination blend into each

    52    other and are not completely separate concepts. One dimension along

    53    which you can distinguish types of imagination is dependence on raw

    54    sense data. Sympathetic imagination is highly constrained by your

    55    senses, while predictive imagination can be more or less dependent

    56    on your senses depending on how far ahead you imagine. Daydreaming

    57    is an extreme form of predictive imagination that wanders through

    58    different possibilities without concern for whether they are

    59    related to whatever is happening in reality.

    60 

    61    For this thesis, I will mostly focus on sympathetic imagination and

    62    the constraint it provides for understanding sensory data.

    63    

    64 ** What problems can imagination solve?

    65 

    66    Consider a video of a cat drinking some water.

    67 

    68    #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.

    69    #+ATTR_LaTeX: width=5cm

    70    [[../images/cat-drinking.jpg]]

    71 

    72    It is currently impossible for any computer program to reliably

    73    label such an video as "drinking". I think humans are able to label

    74    such video as "drinking" because they imagine /themselves/ as the

    75    cat, and imagine putting their face up against a stream of water

    76    and sticking out their tongue. In that imagined world, they can

    77    feel the cool water hitting their tongue, and feel the water

    78    entering their body, and are able to recognize that /feeling/ as

    79    drinking. So, the label of the action is not really in the pixels

    80    of the image, but is found clearly in a simulation inspired by

    81    those pixels. An imaginative system, having been trained on

    82    drinking and non-drinking examples and learning that the most

    83    important component of drinking is the feeling of water sliding

    84    down one's throat, would analyze a video of a cat drinking in the

    85    following manner:

    86    

    87    - Create a physical model of the video by putting a "fuzzy" model

    88      of its own body in place of the cat. Also, create a simulation of

    89      the stream of water.

    90 

    91    - Play out this simulated scene and generate imagined sensory

    92      experience. This will include relevant muscle contractions, a

    93      close up view of the stream from the cat's perspective, and most

    94      importantly, the imagined feeling of water entering the mouth.

    95 

    96    - The action is now easily identified as drinking by the sense of

    97      taste alone. The other senses (such as the tongue moving in and

    98      out) help to give plausibility to the simulated action. Note that

    99      the sense of vision, while critical in creating the simulation,

   100      is not critical for identifying the action from the simulation.

   101 

   102    More generally, I expect imaginative systems to be particularly

   103    good at identifying embodied actions in videos.

   104 

   105 * Cortex

   106 

   107   The previous example involves liquids, the sense of taste, and

   108   imagining oneself as a cat. For this thesis I constrain myself to

   109   simpler, more easily digitizable senses and situations.

   110 

   111   My system, =CORTEX= performs imagination in two different simplified

   112   worlds: /worm world/ and /stick-figure world/. In each of these

   113   worlds, entities capable of imagination recognize actions by

   114   simulating the experience from their own perspective, and then

   115   recognizing the action from a database of examples.

   116 

   117   In order to serve as a framework for experiments in imagination,

   118   =CORTEX= requires simulated bodies, worlds, and senses like vision,

   119   hearing, touch, proprioception, etc.

   120 

   121 ** A Video Game Engine takes care of some of the groundwork

   122 

   123    When it comes to simulation environments, the engines used to

   124    create the worlds in video games offer top-notch physics and

   125    graphics support. These engines also have limited support for

   126    creating cameras and rendering 3D sound, which can be repurposed

   127    for vision and hearing respectively. Physics collision detection

   128    can be expanded to create a sense of touch.

   129    

   130    jMonkeyEngine3 is one such engine for creating video games in

   131    Java. It uses OpenGL to render to the screen and uses screengraphs

   132    to avoid drawing things that do not appear on the screen. It has an

   133    active community and several games in the pipeline. The engine was

   134    not built to serve any particular game but is instead meant to be

   135    used for any 3D game. I chose jMonkeyEngine3 it because it had the

   136    most features out of all the open projects I looked at, and because

   137    I could then write my code in Clojure, an implementation of LISP

   138    that runs on the JVM.

   139 

   140 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses

   141 

   142    Using the game-making primitives provided by jMonkeyEngine3, I have

   143    constructed every major human sense except for smell and

   144    taste. =CORTEX= also provides an interface for creating creatures

   145    in Blender, a 3D modeling environment, and then "rigging" the

   146    creatures with senses using 3D annotations in Blender. A creature

   147    can have any number of senses, and there can be any number of

   148    creatures in a simulation.

   149    

   150    The senses available in =CORTEX= are:

   151 

   152    - [[../../cortex/html/vision.html][Vision]]

   153    - [[../../cortex/html/hearing.html][Hearing]]

   154    - [[../../cortex/html/touch.html][Touch]]

   155    - [[../../cortex/html/proprioception.html][Proprioception]]

   156    - [[../../cortex/html/movement.html][Muscle Tension]]

   157 

   158 * A roadmap for =CORTEX= experiments

   159 

   160 ** Worm World

   161 

   162    Worms in =CORTEX= are segmented creatures which vary in length and

   163    number of segments, and have the senses of vision, proprioception,

   164    touch, and muscle tension.

   165 

   166 #+attr_html: width=755

   167 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).

   168 [[../images/finger-UV.png]]

   169 

   170 

   171 #+begin_html

   172 <div class="figure">

   173   <center>

   174     <video controls="controls" width="550">

   175       <source src="../video/worm-touch.ogg" type="video/ogg"

   176 	      preload="none" />

   177     </video>

   178     <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>

   179   </center>

   180   <p>The worm responds to touch.</p>

   181 </div>

   182 #+end_html

   183 

   184 #+begin_html

   185 <div class="figure">

   186   <center>

   187     <video controls="controls" width="550">

   188       <source src="../video/test-proprioception.ogg" type="video/ogg"

   189 	      preload="none" />

   190     </video>

   191     <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>

   192   </center>

   193   <p>Proprioception in a worm. The proprioceptive readout is

   194     in the upper left corner of the screen.</p>

   195 </div>

   196 #+end_html

   197 

   198    A worm is trained in various actions such as sinusoidal movement,

   199    curling, flailing, and spinning by directly playing motor

   200    contractions while the worm "feels" the experience. These actions

   201    are recorded both as vectors of muscle tension, touch, and

   202    proprioceptive data, but also in higher level forms such as

   203    frequencies of the various contractions and a symbolic name for the

   204    action.

   205 

   206    Then, the worm watches a video of another worm performing one of

   207    the actions, and must judge which action was performed. Normally

   208    this would be an extremely difficult problem, but the worm is able

   209    to greatly diminish the search space through sympathetic

   210    imagination. First, it creates an imagined copy of its body which

   211    it observes from a third person point of view. Then for each frame

   212    of the video, it maneuvers its simulated body to be in registration

   213    with the worm depicted in the video. The physical constraints

   214    imposed by the physics simulation greatly decrease the number of

   215    poses that have to be tried, making the search feasible. As the

   216    imaginary worm moves, it generates imaginary muscle tension and

   217    proprioceptive sensations. The worm determines the action not by

   218    vision, but by matching the imagined proprioceptive data with

   219    previous examples.

   220 

   221    By using non-visual sensory data such as touch, the worms can also

   222    answer body related questions such as "did your head touch your

   223    tail?" and "did worm A touch worm B?"

   224 

   225    The proprioceptive information used for action identification is

   226    body-centric, so only the registration step is dependent on point

   227    of view, not the identification step. Registration is not specific

   228    to any particular action. Thus, action identification can be

   229    divided into a point-of-view dependent generic registration step,

   230    and a action-specific step that is body-centered and invariant to

   231    point of view.

   232 

   233 ** Stick Figure World

   234 

   235    This environment is similar to Worm World, except the creatures are

   236    more complicated and the actions and questions more varied. It is

   237    an experiment to see how far imagination can go in interpreting

   238    actions.
author	Robert McIntyre <rlm@mit.edu>
date	Wed, 19 Mar 2014 22:02:06 -0400
parents	7ee735a836da
children