Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 

     7 

     8 * Empathy and Embodiment as problem solving strategies

     9   

    10   By the end of this thesis, you will have seen a novel approach to

    11   interpreting video using embodiment and empathy. You will have also

    12   seen one way to efficiently implement empathy for embodied

    13   creatures. Finally, you will become familiar with =CORTEX=, a system

    14   for designing and simulating creatures with rich senses, which you

    15   may choose to use in your own research.

    16   

    17   This is the core vision of my thesis: That one of the important ways

    18   in which we understand others is by imagining ourselves in their

    19   position and emphatically feeling experiences relative to our own

    20   bodies. By understanding events in terms of our own previous

    21   corporeal experience, we greatly constrain the possibilities of what

    22   would otherwise be an unwieldy exponential search. This extra

    23   constraint can be the difference between easily understanding what

    24   is happening in a video and being completely lost in a sea of

    25   incomprehensible color and movement.

    26 

    27 ** Recognizing actions in video is extremely difficult

    28 

    29    Consider for example the problem of determining what is happening

    30    in a video of which this is one frame:

    31 

    32    #+caption: A cat drinking some water. Identifying this action is 

    33    #+caption: beyond the state of the art for computers.

    34    #+ATTR_LaTeX: :width 7cm

    35    [[./images/cat-drinking.jpg]]

    36    

    37    It is currently impossible for any computer program to reliably

    38    label such a video as ``drinking''. And rightly so -- it is a very

    39    hard problem! What features can you describe in terms of low level

    40    functions of pixels that can even begin to describe at a high level

    41    what is happening here?

    42   

    43    Or suppose that you are building a program that recognizes chairs.

    44    How could you ``see'' the chair in figure \ref{hidden-chair}?

    45    

    46    #+caption: The chair in this image is quite obvious to humans, but I 

    47    #+caption: doubt that any modern computer vision program can find it.

    48    #+name: hidden-chair

    49    #+ATTR_LaTeX: :width 10cm

    50    [[./images/fat-person-sitting-at-desk.jpg]]

    51    

    52    Finally, how is it that you can easily tell the difference between

    53    how the girls /muscles/ are working in figure \ref{girl}?

    54    

    55    #+caption: The mysterious ``common sense'' appears here as you are able 

    56    #+caption: to discern the difference in how the girl's arm muscles

    57    #+caption: are activated between the two images.

    58    #+name: girl

    59    #+ATTR_LaTeX: :width 7cm

    60    [[./images/wall-push.png]]

    61   

    62    Each of these examples tells us something about what might be going

    63    on in our minds as we easily solve these recognition problems.

    64    

    65    The hidden chairs show us that we are strongly triggered by cues

    66    relating to the position of human bodies, and that we can determine

    67    the overall physical configuration of a human body even if much of

    68    that body is occluded.

    69 

    70    The picture of the girl pushing against the wall tells us that we

    71    have common sense knowledge about the kinetics of our own bodies.

    72    We know well how our muscles would have to work to maintain us in

    73    most positions, and we can easily project this self-knowledge to

    74    imagined positions triggered by images of the human body.

    75 

    76 ** =EMPATH= neatly solves recognition problems  

    77    

    78    I propose a system that can express the types of recognition

    79    problems above in a form amenable to computation. It is split into

    80    four parts:

    81 

    82    - Free/Guided Play :: The creature moves around and experiences the

    83         world through its unique perspective. Many otherwise

    84         complicated actions are easily described in the language of a

    85         full suite of body-centered, rich senses. For example,

    86         drinking is the feeling of water sliding down your throat, and

    87         cooling your insides. It's often accompanied by bringing your

    88         hand close to your face, or bringing your face close to water.

    89         Sitting down is the feeling of bending your knees, activating

    90         your quadriceps, then feeling a surface with your bottom and

    91         relaxing your legs. These body-centered action descriptions

    92         can be either learned or hard coded.

    93    - Posture Imitation :: When trying to interpret a video or image,

    94         the creature takes a model of itself and aligns it with

    95         whatever it sees. This alignment can even cross species, as

    96         when humans try to align themselves with things like ponies,

    97         dogs, or other humans with a different body type.

    98    - Empathy         :: The alignment triggers associations with

    99         sensory data from prior experiences. For example, the

   100         alignment itself easily maps to proprioceptive data. Any

   101         sounds or obvious skin contact in the video can to a lesser

   102         extent trigger previous experience. Segments of previous

   103         experiences are stitched together to form a coherent and

   104         complete sensory portrait of the scene.

   105    - Recognition      :: With the scene described in terms of first

   106         person sensory events, the creature can now run its

   107         action-identification programs on this synthesized sensory

   108         data, just as it would if it were actually experiencing the

   109         scene first-hand. If previous experience has been accurately

   110         retrieved, and if it is analogous enough to the scene, then

   111         the creature will correctly identify the action in the scene.

   112    

   113    For example, I think humans are able to label the cat video as

   114    ``drinking'' because they imagine /themselves/ as the cat, and

   115    imagine putting their face up against a stream of water and

   116    sticking out their tongue. In that imagined world, they can feel

   117    the cool water hitting their tongue, and feel the water entering

   118    their body, and are able to recognize that /feeling/ as drinking.

   119    So, the label of the action is not really in the pixels of the

   120    image, but is found clearly in a simulation inspired by those

   121    pixels. An imaginative system, having been trained on drinking and

   122    non-drinking examples and learning that the most important

   123    component of drinking is the feeling of water sliding down one's

   124    throat, would analyze a video of a cat drinking in the following

   125    manner:

   126    

   127    1. Create a physical model of the video by putting a ``fuzzy''

   128       model of its own body in place of the cat. Possibly also create

   129       a simulation of the stream of water.

   130 

   131    2. Play out this simulated scene and generate imagined sensory

   132       experience. This will include relevant muscle contractions, a

   133       close up view of the stream from the cat's perspective, and most

   134       importantly, the imagined feeling of water entering the

   135       mouth. The imagined sensory experience can come from a

   136       simulation of the event, but can also be pattern-matched from

   137       previous, similar embodied experience.

   138 

   139    3. The action is now easily identified as drinking by the sense of

   140       taste alone. The other senses (such as the tongue moving in and

   141       out) help to give plausibility to the simulated action. Note that

   142       the sense of vision, while critical in creating the simulation,

   143       is not critical for identifying the action from the simulation.

   144 

   145    For the chair examples, the process is even easier:

   146 

   147     1. Align a model of your body to the person in the image.

   148 

   149     2. Generate proprioceptive sensory data from this alignment.

   150   

   151     3. Use the imagined proprioceptive data as a key to lookup related

   152        sensory experience associated with that particular proproceptive

   153        feeling.

   154 

   155     4. Retrieve the feeling of your bottom resting on a surface, your

   156        knees bent, and your leg muscles relaxed.

   157 

   158     5. This sensory information is consistent with the =sitting?=

   159        sensory predicate, so you (and the entity in the image) must be

   160        sitting.

   161 

   162     6. There must be a chair-like object since you are sitting.

   163 

   164    Empathy offers yet another alternative to the age-old AI

   165    representation question: ``What is a chair?'' --- A chair is the

   166    feeling of sitting.

   167 

   168    My program, =EMPATH= uses this empathic problem solving technique

   169    to interpret the actions of a simple, worm-like creature. 

   170    

   171    #+caption: The worm performs many actions during free play such as 

   172    #+caption: curling, wiggling, and resting.

   173    #+name: worm-intro

   174    #+ATTR_LaTeX: :width 15cm

   175    [[./images/worm-intro-white.png]]

   176 

   177    #+caption: =EMPATH= recognized and classified each of these poses by

   178    #+caption: inferring the complete sensory experience from 

   179    #+caption: proprioceptive data.

   180    #+name: worm-recognition-intro

   181    #+ATTR_LaTeX: :width 15cm

   182    [[./images/worm-poses.png]]

   183    

   184    One powerful advantage of empathic problem solving is that it

   185    factors the action recognition problem into two easier problems. To

   186    use empathy, you need an /aligner/, which takes the video and a

   187    model of your body, and aligns the model with the video. Then, you

   188    need a /recognizer/, which uses the aligned model to interpret the

   189    action. The power in this method lies in the fact that you describe

   190    all actions form a body-centered viewpoint. You are less tied to

   191    the particulars of any visual representation of the actions. If you

   192    teach the system what ``running'' is, and you have a good enough

   193    aligner, the system will from then on be able to recognize running

   194    from any point of view, even strange points of view like above or

   195    underneath the runner. This is in contrast to action recognition

   196    schemes that try to identify actions using a non-embodied approach.

   197    If these systems learn about running as viewed from the side, they

   198    will not automatically be able to recognize running from any other

   199    viewpoint.

   200 

   201    Another powerful advantage is that using the language of multiple

   202    body-centered rich senses to describe body-centerd actions offers a

   203    massive boost in descriptive capability. Consider how difficult it

   204    would be to compose a set of HOG filters to describe the action of

   205    a simple worm-creature ``curling'' so that its head touches its

   206    tail, and then behold the simplicity of describing thus action in a

   207    language designed for the task (listing \ref{grand-circle-intro}):

   208 

   209    #+caption: Body-centerd actions are best expressed in a body-centered 

   210    #+caption: language. This code detects when the worm has curled into a 

   211    #+caption: full circle. Imagine how you would replicate this functionality

   212    #+caption: using low-level pixel features such as HOG filters!

   213    #+name: grand-circle-intro

   214    #+begin_listing clojure

   215    #+begin_src clojure

   216 (defn grand-circle?

   217   "Does the worm form a majestic circle (one end touching the other)?"

   218   [experiences]

   219   (and (curled? experiences)

   220        (let [worm-touch (:touch (peek experiences))

   221              tail-touch (worm-touch 0)

   222              head-touch (worm-touch 4)]

   223          (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))

   224               (< 0.55 (contact worm-segment-top-tip    head-touch))))))

   225    #+end_src

   226    #+end_listing

   227 

   228 

   229 ** =CORTEX= is a toolkit for building sensate creatures

   230 

   231    I built =CORTEX= to be a general AI research platform for doing

   232    experiments involving multiple rich senses and a wide variety and

   233    number of creatures. I intend it to be useful as a library for many

   234    more projects than just this one. =CORTEX= was necessary to meet a

   235    need among AI researchers at CSAIL and beyond, which is that people

   236    often will invent neat ideas that are best expressed in the

   237    language of creatures and senses, but in order to explore those

   238    ideas they must first build a platform in which they can create

   239    simulated creatures with rich senses! There are many ideas that

   240    would be simple to execute (such as =EMPATH=), but attached to them

   241    is the multi-month effort to make a good creature simulator. Often,

   242    that initial investment of time proves to be too much, and the

   243    project must make do with a lesser environment.

   244 

   245    =CORTEX= is well suited as an environment for embodied AI research

   246    for three reasons:

   247 

   248    - You can create new creatures using Blender, a popular 3D modeling

   249      program. Each sense can be specified using special blender nodes

   250      with biologically inspired paramaters. You need not write any

   251      code to create a creature, and can use a wide library of

   252      pre-existing blender models as a base for your own creatures.

   253 

   254    - =CORTEX= implements a wide variety of senses, including touch,

   255      proprioception, vision, hearing, and muscle tension. Complicated

   256      senses like touch, and vision involve multiple sensory elements

   257      embedded in a 2D surface. You have complete control over the

   258      distribution of these sensor elements through the use of simple

   259      png image files. In particular, =CORTEX= implements more

   260      comprehensive hearing than any other creature simulation system

   261      available. 

   262 

   263    - =CORTEX= supports any number of creatures and any number of

   264      senses. Time in =CORTEX= dialates so that the simulated creatures

   265      always precieve a perfectly smooth flow of time, regardless of

   266      the actual computational load.

   267 

   268    =CORTEX= is built on top of =jMonkeyEngine3=, which is a video game

   269    engine designed to create cross-platform 3D desktop games. =CORTEX=

   270    is mainly written in clojure, a dialect of =LISP= that runs on the

   271    java virtual machine (JVM). The API for creating and simulating

   272    creatures is entirely expressed in clojure. Hearing is implemented

   273    as a layer of clojure code on top of a layer of java code on top of

   274    a layer of =C++= code which implements a modified version of

   275    =OpenAL= to support multiple listeners. =CORTEX= is the only

   276    simulation environment that I know of that can support multiple

   277    entities that can each hear the world from their own perspective.

   278    Other senses also require a small layer of Java code. =CORTEX= also

   279    uses =bullet=, a physics simulator written in =C=.

   280 

   281    #+caption: Here is the worm from above modeled in Blender, a free 

   282    #+caption: 3D-modeling program. Senses and joints are described

   283    #+caption: using special nodes in Blender.

   284    #+name: worm-recognition-intro

   285    #+ATTR_LaTeX: :width 12cm

   286    [[./images/blender-worm.png]]

   287 

   288    During one test with =CORTEX=, I created 3,000 entities each with

   289    their own independent senses and ran them all at only 1/80 real

   290    time. In another test, I created a detailed model of my own hand,

   291    equipped with a realistic distribution of touch (more sensitive at

   292    the fingertips), as well as eyes and ears, and it ran at around 1/4

   293    real time.

   294 

   295    #+caption: Here is the worm from above modeled in Blender, a free 

   296    #+caption: 3D-modeling program. Senses and joints are described

   297    #+caption: using special nodes in Blender.

   298    #+name: worm-recognition-intro

   299    #+ATTR_LaTeX: :width 15cm

   300    [[./images/full-hand.png]]

   301    

   302    

   303    

   304 

   305    

   306 ** Contributions

   307 

   308 * Building =CORTEX=

   309 

   310 ** To explore embodiment, we need a world, body, and senses

   311 

   312 ** Because of Time, simulation is perferable to reality

   313 

   314 ** Video game engines are a great starting point

   315 

   316 ** Bodies are composed of segments connected by joints

   317 

   318 ** Eyes reuse standard video game components

   319 

   320 ** Hearing is hard; =CORTEX= does it right

   321 

   322 ** Touch uses hundreds of hair-like elements

   323 

   324 ** Proprioception is the sense that makes everything ``real''

   325 

   326 ** Muscles are both effectors and sensors

   327 

   328 ** =CORTEX= brings complex creatures to life!

   329 

   330 ** =CORTEX= enables many possiblities for further research

   331 

   332 * Empathy in a simulated worm

   333 

   334 ** Embodiment factors action recognition into managable parts

   335 

   336 ** Action recognition is easy with a full gamut of senses

   337 

   338 ** Digression: bootstrapping touch using free exploration

   339 

   340 ** \Phi-space describes the worm's experiences

   341 

   342 ** Empathy is the process of tracing though \Phi-space 

   343   

   344 ** Efficient action recognition with =EMPATH=

   345 

   346 * Contributions

   347   - Built =CORTEX=, a comprehensive platform for embodied AI

   348     experiments. Has many new features lacking in other systems, such

   349     as sound. Easy to model/create new creatures.

   350   - created a novel concept for action recognition by using artificial

   351     imagination. 

   352 

   353 In the second half of the thesis I develop a computational model of

   354 empathy, using =CORTEX= as a base. Empathy in this context is the

   355 ability to observe another creature and infer what sorts of sensations

   356 that creature is feeling. My empathy algorithm involves multiple

   357 phases. First is free-play, where the creature moves around and gains

   358 sensory experience. From this experience I construct a representation

   359 of the creature's sensory state space, which I call \Phi-space. Using

   360 \Phi-space, I construct an efficient function for enriching the

   361 limited data that comes from observing another creature with a full

   362 compliment of imagined sensory data based on previous experience. I

   363 can then use the imagined sensory data to recognize what the observed

   364 creature is doing and feeling, using straightforward embodied action

   365 predicates. This is all demonstrated with using a simple worm-like

   366 creature, and recognizing worm-actions based on limited data.

   367 

   368 Embodied representation using multiple senses such as touch,

   369 proprioception, and muscle tension turns out be be exceedingly

   370 efficient at describing body-centered actions. It is the ``right

   371 language for the job''. For example, it takes only around 5 lines of

   372 LISP code to describe the action of ``curling'' using embodied

   373 primitives. It takes about 8 lines to describe the seemingly

   374 complicated action of wiggling.

   375 

   376 

   377 

   378 * COMMENT names for cortex

   379  - bioland

   380 

   381 

   382 

   383 

   384 # An anatomical joke:

   385 # - Training

   386 # - Skeletal imitation

   387 # - Sensory fleshing-out

   388 # - Classification
author	Robert McIntyre <rlm@mit.edu>
date	Tue, 25 Mar 2014 22:54:41 -0400
parents	284316604be0
children	09b7c8dd4365