Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 

     7 

     8 * Empathy and Embodiment as problem solving strategies

     9   

    10   By the end of this thesis, you will have seen a novel approach to

    11   interpreting video using embodiment and empathy. You will have also

    12   seen one way to efficiently implement empathy for embodied

    13   creatures. Finally, you will become familiar with =CORTEX=, a

    14   system for designing and simulating creatures with rich senses,

    15   which you may choose to use in your own research.

    16   

    17   This is the core vision of my thesis: That one of the important ways

    18   in which we understand others is by imagining ourselves in their

    19   position and emphatically feeling experiences relative to our own

    20   bodies. By understanding events in terms of our own previous

    21   corporeal experience, we greatly constrain the possibilities of what

    22   would otherwise be an unwieldy exponential search. This extra

    23   constraint can be the difference between easily understanding what

    24   is happening in a video and being completely lost in a sea of

    25   incomprehensible color and movement.

    26 

    27 ** Recognizing actions in video is extremely difficult

    28 

    29    Consider for example the problem of determining what is happening in

    30    a video of which this is one frame:

    31 

    32    #+caption: A cat drinking some water. Identifying this action is 

    33    #+caption: beyond the state of the art for computers.

    34    #+ATTR_LaTeX: :width 7cm

    35    [[./images/cat-drinking.jpg]]

    36    

    37    It is currently impossible for any computer program to reliably

    38    label such a video as "drinking".  And rightly so -- it is a very

    39    hard problem! What features can you describe in terms of low level

    40    functions of pixels that can even begin to describe at a high level

    41    what is happening here?

    42   

    43    Or suppose that you are building a program that recognizes

    44    chairs. How could you ``see'' the chair in figure

    45    \ref{invisible-chair} and figure \ref{hidden-chair}?

    46    

    47    #+caption: When you look at this, do you think ``chair''? I certainly do.

    48    #+name: invisible-chair

    49    #+ATTR_LaTeX: :width 10cm

    50    [[./images/invisible-chair.png]]

    51    

    52    #+caption: The chair in this image is quite obvious to humans, but I 

    53    #+caption: doubt that any computer program can find it.

    54    #+name: hidden-chair

    55    #+ATTR_LaTeX: :width 10cm

    56    [[./images/fat-person-sitting-at-desk.jpg]]

    57    

    58    Finally, how is it that you can easily tell the difference between

    59    how the girls /muscles/ are working in figure \ref{girl}?

    60    

    61    #+caption: The mysterious ``common sense'' appears here as you are able 

    62    #+caption: to discern the difference in how the girl's arm muscles

    63    #+caption: are activated between the two images.

    64    #+name: girl

    65    #+ATTR_LaTeX: :width 10cm

    66    [[./images/wall-push.png]]

    67   

    68    Each of these examples tells us something about what might be going

    69    on in our minds as we easily solve these recognition problems.

    70    

    71    The hidden chairs show us that we are strongly triggered by cues

    72    relating to the position of human bodies, and that we can

    73    determine the overall physical configuration of a human body even

    74    if much of that body is occluded.

    75 

    76    The picture of the girl pushing against the wall tells us that we

    77    have common sense knowledge about the kinetics of our own bodies.

    78    We know well how our muscles would have to work to maintain us in

    79    most positions, and we can easily project this self-knowledge to

    80    imagined positions triggered by images of the human body.

    81 

    82 ** =EMPATH= neatly solves recognition problems  

    83    

    84    I propose a system that can express the types of recognition

    85    problems above in a form amenable to computation. It is split into

    86    four parts:

    87 

    88    - Free/Guided Play :: The creature moves around and experiences the

    89         world through its unique perspective. Many otherwise

    90         complicated actions are easily described in the language of a

    91         full suite of body-centered, rich senses. For example,

    92         drinking is the feeling of water sliding down your throat, and

    93         cooling your insides. It's often accompanied by bringing your

    94         hand close to your face, or bringing your face close to

    95         water. Sitting down is the feeling of bending your knees,

    96         activating your quadriceps, then feeling a surface with your

    97         bottom and relaxing your legs. These body-centered action

    98         descriptions can be either learned or hard coded.

    99    - Alignment :: When trying to interpret a video or image, the

   100                   creature takes a model of itself and aligns it with

   101                   whatever it sees. This can be a rather loose

   102                   alignment that can cross species, as when humans try

   103                   to align themselves with things like ponies, dogs,

   104                   or other humans with a different body type.

   105    - Empathy :: The alignment triggers the memories of previous

   106                 experience. For example, the alignment itself easily

   107                 maps to proprioceptive data. Any sounds or obvious

   108                 skin contact in the video can to a lesser extent

   109                 trigger previous experience. The creatures previous

   110                 experience is chained together in short bursts to

   111                 coherently describe the new scene.

   112    - Recognition :: With the scene now described in terms of past

   113                     experience, the creature can now run its

   114                     action-identification programs on this synthesized

   115                     sensory data, just as it would if it were actually

   116                     experiencing the scene first-hand. If previous

   117                     experience has been accurately retrieved, and if

   118                     it is analogous enough to the scene, then the

   119                     creature will correctly identify the action in the

   120                     scene.

   121 		    

   122 

   123    For example, I think humans are able to label the cat video as

   124    "drinking" because they imagine /themselves/ as the cat, and

   125    imagine putting their face up against a stream of water and

   126    sticking out their tongue. In that imagined world, they can feel

   127    the cool water hitting their tongue, and feel the water entering

   128    their body, and are able to recognize that /feeling/ as

   129    drinking. So, the label of the action is not really in the pixels

   130    of the image, but is found clearly in a simulation inspired by

   131    those pixels. An imaginative system, having been trained on

   132    drinking and non-drinking examples and learning that the most

   133    important component of drinking is the feeling of water sliding

   134    down one's throat, would analyze a video of a cat drinking in the

   135    following manner:

   136    

   137    1. Create a physical model of the video by putting a "fuzzy" model

   138       of its own body in place of the cat. Possibly also create a

   139       simulation of the stream of water.

   140 

   141    2. Play out this simulated scene and generate imagined sensory

   142       experience. This will include relevant muscle contractions, a

   143       close up view of the stream from the cat's perspective, and most

   144       importantly, the imagined feeling of water entering the

   145       mouth. The imagined sensory experience can come from a

   146       simulation of the event, but can also be pattern-matched from

   147       previous, similar embodied experience.

   148 

   149    3. The action is now easily identified as drinking by the sense of

   150       taste alone. The other senses (such as the tongue moving in and

   151       out) help to give plausibility to the simulated action. Note that

   152       the sense of vision, while critical in creating the simulation,

   153       is not critical for identifying the action from the simulation.

   154 

   155    For the chair examples, the process is even easier:

   156 

   157     1. Align a model of your body to the person in the image.

   158 

   159     2. Generate proprioceptive sensory data from this alignment.

   160   

   161     3. Use the imagined proprioceptive data as a key to lookup related

   162        sensory experience associated with that particular proproceptive

   163        feeling.

   164 

   165     4. Retrieve the feeling of your bottom resting on a surface, your

   166        knees bent, and your leg muscles relaxed.

   167 

   168     5. This sensory information is consistent with the =sitting?=

   169        sensory predicate, so you (and the entity in the image) must be

   170        sitting.

   171 

   172     6. There must be a chair-like object since you are sitting.

   173 

   174    Empathy offers yet another alternative to the age-old AI

   175    representation question: ``What is a chair?'' --- A chair is the

   176    feeling of sitting.

   177 

   178    My program, =EMPATH= uses this empathic problem solving technique

   179    to interpret the actions of a simple, worm-like creature. 

   180    

   181    #+caption: The worm performs many actions during free play such as 

   182    #+caption: curling, wiggling, and resting.

   183    #+name: worm-intro

   184    #+ATTR_LaTeX: :width 15cm

   185    [[./images/worm-intro-white.png]]

   186 

   187    #+caption: The actions of a worm in a video can be recognized by

   188    #+caption: proprioceptive data and sentory predicates by filling

   189    #+caption:  in the missing sensory detail with previous experience.

   190    #+name: worm-recognition-intro

   191    #+ATTR_LaTeX: :width 15cm

   192    [[./images/worm-poses.png]]

   193 

   194    

   195    One powerful advantage of empathic problem solving is that it

   196    factors the action recognition problem into two easier problems. To

   197    use empathy, you need an /aligner/, which takes the video and a

   198    model of your body, and aligns the model with the video. Then, you

   199    need a /recognizer/, which uses the aligned model to interpret the

   200    action. The power in this method lies in the fact that you describe

   201    all actions form a body-centered, rich viewpoint. This way, if you

   202    teach the system what ``running'' is, and you have a good enough

   203    aligner, the system will from then on be able to recognize running

   204    from any point of view, even strange points of view like above or

   205    underneath the runner. This is in contrast to action recognition

   206    schemes that try to identify actions using a non-embodied approach

   207    such as TODO:REFERENCE. If these systems learn about running as viewed

   208    from the side, they will not automatically be able to recognize

   209    running from any other viewpoint.

   210 

   211    Another powerful advantage is that using the language of multiple

   212    body-centered rich senses to describe body-centerd actions offers a

   213    massive boost in descriptive capability. Consider how difficult it

   214    would be to compose a set of HOG filters to describe the action of

   215    a simple worm-creature "curling" so that its head touches its tail,

   216    and then behold the simplicity of describing thus action in a

   217    language designed for the task (listing \ref{grand-circle-intro}):

   218 

   219    #+caption: Body-centerd actions are best expressed in a body-centered 

   220    #+caption: language. This code detects when the worm has curled into a 

   221    #+caption: full circle. Imagine how you would replicate this functionality

   222    #+caption: using low-level pixel features such as HOG filters!

   223    #+name: grand-circle-intro

   224    #+begin_listing clojure

   225    #+begin_src clojure

   226 (defn grand-circle?

   227   "Does the worm form a majestic circle (one end touching the other)?"

   228   [experiences]

   229   (and (curled? experiences)

   230        (let [worm-touch (:touch (peek experiences))

   231              tail-touch (worm-touch 0)

   232              head-touch (worm-touch 4)]

   233          (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))

   234               (< 0.55 (contact worm-segment-top-tip    head-touch))))))

   235    #+end_src

   236    #+end_listing

   237 

   238 

   239 ** =CORTEX= is a toolkit for building sensate creatures

   240 

   241    Hand integration demo

   242 

   243 ** Contributions

   244 

   245 * Building =CORTEX=

   246 

   247 ** To explore embodiment, we need a world, body, and senses

   248 

   249 ** Because of Time, simulation is perferable to reality

   250 

   251 ** Video game engines are a great starting point

   252 

   253 ** Bodies are composed of segments connected by joints

   254 

   255 ** Eyes reuse standard video game components

   256 

   257 ** Hearing is hard; =CORTEX= does it right

   258 

   259 ** Touch uses hundreds of hair-like elements

   260 

   261 ** Proprioception is the sense that makes everything ``real''

   262 

   263 ** Muscles are both effectors and sensors

   264 

   265 ** =CORTEX= brings complex creatures to life!

   266 

   267 ** =CORTEX= enables many possiblities for further research

   268 

   269 * Empathy in a simulated worm

   270 

   271 ** Embodiment factors action recognition into managable parts

   272 

   273 ** Action recognition is easy with a full gamut of senses

   274 

   275 ** Digression: bootstrapping touch using free exploration

   276 

   277 ** \Phi-space describes the worm's experiences

   278 

   279 ** Empathy is the process of tracing though \Phi-space 

   280   

   281 ** Efficient action recognition with =EMPATH=

   282 

   283 * Contributions

   284   - Built =CORTEX=, a comprehensive platform for embodied AI

   285     experiments. Has many new features lacking in other systems, such

   286     as sound. Easy to model/create new creatures.

   287   - created a novel concept for action recognition by using artificial

   288     imagination. 

   289 

   290 In the second half of the thesis I develop a computational model of

   291 empathy, using =CORTEX= as a base. Empathy in this context is the

   292 ability to observe another creature and infer what sorts of sensations

   293 that creature is feeling. My empathy algorithm involves multiple

   294 phases. First is free-play, where the creature moves around and gains

   295 sensory experience. From this experience I construct a representation

   296 of the creature's sensory state space, which I call \phi-space. Using

   297 \phi-space, I construct an efficient function for enriching the

   298 limited data that comes from observing another creature with a full

   299 compliment of imagined sensory data based on previous experience. I

   300 can then use the imagined sensory data to recognize what the observed

   301 creature is doing and feeling, using straightforward embodied action

   302 predicates. This is all demonstrated with using a simple worm-like

   303 creature, and recognizing worm-actions based on limited data.

   304 

   305 Embodied representation using multiple senses such as touch,

   306 proprioception, and muscle tension turns out be be exceedingly

   307 efficient at describing body-centered actions. It is the ``right

   308 language for the job''. For example, it takes only around 5 lines of

   309 LISP code to describe the action of ``curling'' using embodied

   310 primitives. It takes about 8 lines to describe the seemingly

   311 complicated action of wiggling.

   312 

   313 

   314 

   315 * COMMENT names for cortex

   316  - bioland
author	Robert McIntyre <rlm@mit.edu>
date	Tue, 25 Mar 2014 03:24:28 -0400
parents	47cfbe84f00e
children	284316604be0