Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 #+SETUPFILE: ../../aurellem/org/setup.org

     7 #+INCLUDE: ../../aurellem/org/level-0.org

     8 #+babel: :mkdirp yes :noweb yes :exports both

     9 #+OPTIONS: toc:nil, num:nil

    10 

    11 * Artificial Imagination

    12   Imagine watching a video of someone skateboarding. When you watch

    13   the video, you can imagine yourself skateboarding, and your

    14   knowledge of the human body and its dynamics guides your

    15   interpretation of the scene. For example, even if the skateboarder

    16   is partially occluded, you can infer the positions of his arms and

    17   body from your own knowledge of how your body would be positioned if

    18   you were skateboarding. If the skateboarder suffers an accident, you

    19   wince in sympathy, imagining the pain your own body would experience

    20   if it were in the same situation. This empathy with other people

    21   guides our understanding of whatever they are doing because it is a

    22   powerful constraint on what is probable and possible. In order to

    23   make use of this powerful empathy constraint, I need a system that

    24   can generate and make sense of sensory data from the many different

    25   senses that humans possess. The two key proprieties of such a system

    26   are /embodiment/ and /imagination/.

    27 

    28 ** What is imagination?

    29 

    30    One kind of imagination is /sympathetic/ imagination: you imagine

    31    yourself in the position of something/someone you are

    32    observing. This type of imagination comes into play when you follow

    33    along visually when watching someone perform actions, or when you

    34    sympathetically grimace when someone hurts themselves. This type of

    35    imagination uses the constraints you have learned about your own

    36    body to highly constrain the possibilities in whatever you are

    37    seeing. It uses all your senses to including your senses of touch,

    38    proprioception, etc. Humans are flexible when it comes to "putting

    39    themselves in another's shoes," and can sympathetically understand

    40    not only other humans, but entities ranging from animals to cartoon

    41    characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!

    42 

    43 # and can infer intention from the actions of not only other humans,

    44 # but also animals, cartoon characters, and even abstract moving dots

    45 # on a screen!

    46 

    47    Another kind of imagination is /predictive/ imagination: you

    48    construct scenes in your mind that are not entirely related to

    49    whatever you are observing, but instead are predictions of the

    50    future or simply flights of fancy. You use this type of imagination

    51    to plan out multi-step actions, or play out dangerous situations in

    52    your mind so as to avoid messing them up in reality.

    53 

    54    Of course, sympathetic and predictive imagination blend into each

    55    other and are not completely separate concepts. One dimension along

    56    which you can distinguish types of imagination is dependence on raw

    57    sense data. Sympathetic imagination is highly constrained by your

    58    senses, while predictive imagination can be more or less dependent

    59    on your senses depending on how far ahead you imagine. Daydreaming

    60    is an extreme form of predictive imagination that wanders through

    61    different possibilities without concern for whether they are

    62    related to whatever is happening in reality.

    63 

    64    For this thesis, I will mostly focus on sympathetic imagination and

    65    the constraint it provides for understanding sensory data.

    66    

    67 ** What problems can imagination solve?

    68 

    69    Consider a video of a cat drinking some water.

    70 

    71    #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.

    72    #+ATTR_LaTeX: width=5cm

    73    [[../images/cat-drinking.jpg]]

    74 

    75    It is currently impossible for any computer program to reliably

    76    label such an video as "drinking". I think humans are able to label

    77    such video as "drinking" because they imagine /themselves/ as the

    78    cat, and imagine putting their face up against a stream of water

    79    and sticking out their tongue. In that imagined world, they can

    80    feel the cool water hitting their tongue, and feel the water

    81    entering their body, and are able to recognize that /feeling/ as

    82    drinking. So, the label of the action is not really in the pixels

    83    of the image, but is found clearly in a simulation inspired by

    84    those pixels. An imaginative system, having been trained on

    85    drinking and non-drinking examples and learning that the most

    86    important component of drinking is the feeling of water sliding

    87    down one's throat, would analyze a video of a cat drinking in the

    88    following manner:

    89    

    90    - Create a physical model of the video by putting a "fuzzy" model

    91      of its own body in place of the cat. Also, create a simulation of

    92      the stream of water.

    93 

    94    - Play out this simulated scene and generate imagined sensory

    95      experience. This will include relevant muscle contractions, a

    96      close up view of the stream from the cat's perspective, and most

    97      importantly, the imagined feeling of water entering the mouth.

    98 

    99    - The action is now easily identified as drinking by the sense of

   100      taste alone. The other senses (such as the tongue moving in and

   101      out) help to give plausibility to the simulated action. Note that

   102      the sense of vision, while critical in creating the simulation,

   103      is not critical for identifying the action from the simulation.

   104 

   105    More generally, I expect imaginative systems to be particularly

   106    good at identifying embodied actions in videos.

   107 

   108 * Cortex

   109 

   110   The previous example involves liquids, the sense of taste, and

   111   imagining oneself as a cat. For this thesis I constrain myself to

   112   simpler, more easily digitizable senses and situations.

   113 

   114   My system, =CORTEX= performs imagination in two different simplified

   115   worlds: /worm world/ and /stick-figure world/. In each of these

   116   worlds, entities capable of imagination recognize actions by

   117   simulating the experience from their own perspective, and then

   118   recognizing the action from a database of examples.

   119 

   120   In order to serve as a framework for experiments in imagination,

   121   =CORTEX= requires simulated bodies, worlds, and senses like vision,

   122   hearing, touch, proprioception, etc.

   123 

   124 ** A Video Game Engine takes care of some of the groundwork

   125 

   126    When it comes to simulation environments, the engines used to

   127    create the worlds in video games offer top-notch physics and

   128    graphics support. These engines also have limited support for

   129    creating cameras and rendering 3D sound, which can be repurposed

   130    for vision and hearing respectively. Physics collision detection

   131    can be expanded to create a sense of touch.

   132    

   133    jMonkeyEngine3 is one such engine for creating video games in

   134    Java. It uses OpenGL to render to the screen and uses screengraphs

   135    to avoid drawing things that do not appear on the screen. It has an

   136    active community and several games in the pipeline. The engine was

   137    not built to serve any particular game but is instead meant to be

   138    used for any 3D game. I chose jMonkeyEngine3 it because it had the

   139    most features out of all the open projects I looked at, and because

   140    I could then write my code in Clojure, an implementation of LISP

   141    that runs on the JVM.

   142 

   143 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses

   144 

   145    Using the game-making primitives provided by jMonkeyEngine3, I have

   146    constructed every major human sense except for smell and

   147    taste. =CORTEX= also provides an interface for creating creatures

   148    in Blender, a 3D modeling environment, and then "rigging" the

   149    creatures with senses using 3D annotations in Blender. A creature

   150    can have any number of senses, and there can be any number of

   151    creatures in a simulation.

   152    

   153    The senses available in =CORTEX= are:

   154 

   155    - [[../../cortex/html/vision.html][Vision]]

   156    - [[../../cortex/html/hearing.html][Hearing]]

   157    - [[../../cortex/html/touch.html][Touch]]

   158    - [[../../cortex/html/proprioception.html][Proprioception]]

   159    - [[../../cortex/html/movement.html][Muscle Tension]]

   160 

   161 * A roadmap for =CORTEX= experiments

   162 

   163 ** Worm World

   164 

   165    Worms in =CORTEX= are segmented creatures which vary in length and

   166    number of segments, and have the senses of vision, proprioception,

   167    touch, and muscle tension.

   168 

   169 #+attr_html: width=755

   170 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).

   171 [[../images/finger-UV.png]]

   172 

   173 

   174 #+begin_html

   175 <div class="figure">

   176   <center>

   177     <video controls="controls" width="550">

   178       <source src="../video/worm-touch.ogg" type="video/ogg"

   179 	      preload="none" />

   180     </video>

   181     <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>

   182   </center>

   183   <p>The worm responds to touch.</p>

   184 </div>

   185 #+end_html

   186 

   187 #+begin_html

   188 <div class="figure">

   189   <center>

   190     <video controls="controls" width="550">

   191       <source src="../video/test-proprioception.ogg" type="video/ogg"

   192 	      preload="none" />

   193     </video>

   194     <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>

   195   </center>

   196   <p>Proprioception in a worm. The proprioceptive readout is

   197     in the upper left corner of the screen.</p>

   198 </div>

   199 #+end_html

   200 

   201    A worm is trained in various actions such as sinusoidal movement,

   202    curling, flailing, and spinning by directly playing motor

   203    contractions while the worm "feels" the experience. These actions

   204    are recorded both as vectors of muscle tension, touch, and

   205    proprioceptive data, but also in higher level forms such as

   206    frequencies of the various contractions and a symbolic name for the

   207    action.

   208 

   209    Then, the worm watches a video of another worm performing one of

   210    the actions, and must judge which action was performed. Normally

   211    this would be an extremely difficult problem, but the worm is able

   212    to greatly diminish the search space through sympathetic

   213    imagination. First, it creates an imagined copy of its body which

   214    it observes from a third person point of view. Then for each frame

   215    of the video, it maneuvers its simulated body to be in registration

   216    with the worm depicted in the video. The physical constraints

   217    imposed by the physics simulation greatly decrease the number of

   218    poses that have to be tried, making the search feasible. As the

   219    imaginary worm moves, it generates imaginary muscle tension and

   220    proprioceptive sensations. The worm determines the action not by

   221    vision, but by matching the imagined proprioceptive data with

   222    previous examples.

   223 

   224    By using non-visual sensory data such as touch, the worms can also

   225    answer body related questions such as "did your head touch your

   226    tail?" and "did worm A touch worm B?"

   227 

   228    The proprioceptive information used for action identification is

   229    body-centric, so only the registration step is dependent on point

   230    of view, not the identification step. Registration is not specific

   231    to any particular action. Thus, action identification can be

   232    divided into a point-of-view dependent generic registration step,

   233    and a action-specific step that is body-centered and invariant to

   234    point of view.

   235 

   236 ** Stick Figure World

   237 

   238    This environment is similar to Worm World, except the creatures are

   239    more complicated and the actions and questions more varied. It is

   240    an experiment to see how far imagination can go in interpreting

   241    actions.
author	Robert McIntyre <rlm@mit.edu>
date	Mon, 21 Apr 2014 02:11:29 -0400
parents	5205535237fb
children