Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 

     7 

     8 * Empathy and Embodiment as problem solving strategies

     9   

    10   By the end of this thesis, you will have seen a novel approach to

    11   interpreting video using embodiment and empathy. You will have also

    12   seen one way to efficiently implement empathy for embodied

    13   creatures. Finally, you will become familiar with =CORTEX=, a system

    14   for designing and simulating creatures with rich senses, which you

    15   may choose to use in your own research.

    16   

    17   This is the core vision of my thesis: That one of the important ways

    18   in which we understand others is by imagining ourselves in their

    19   position and emphatically feeling experiences relative to our own

    20   bodies. By understanding events in terms of our own previous

    21   corporeal experience, we greatly constrain the possibilities of what

    22   would otherwise be an unwieldy exponential search. This extra

    23   constraint can be the difference between easily understanding what

    24   is happening in a video and being completely lost in a sea of

    25   incomprehensible color and movement.

    26 

    27 ** Recognizing actions in video is extremely difficult

    28 

    29    Consider for example the problem of determining what is happening

    30    in a video of which this is one frame:

    31 

    32    #+caption: A cat drinking some water. Identifying this action is 

    33    #+caption: beyond the state of the art for computers.

    34    #+ATTR_LaTeX: :width 7cm

    35    [[./images/cat-drinking.jpg]]

    36    

    37    It is currently impossible for any computer program to reliably

    38    label such a video as ``drinking''. And rightly so -- it is a very

    39    hard problem! What features can you describe in terms of low level

    40    functions of pixels that can even begin to describe at a high level

    41    what is happening here?

    42   

    43    Or suppose that you are building a program that recognizes chairs.

    44    How could you ``see'' the chair in figure \ref{invisible-chair} and

    45    figure \ref{hidden-chair}?

    46    

    47    #+caption: When you look at this, do you think ``chair''? I certainly do.

    48    #+name: invisible-chair

    49    #+ATTR_LaTeX: :width 10cm

    50    [[./images/invisible-chair.png]]

    51    

    52    #+caption: The chair in this image is quite obvious to humans, but I 

    53    #+caption: doubt that any computer program can find it.

    54    #+name: hidden-chair

    55    #+ATTR_LaTeX: :width 10cm

    56    [[./images/fat-person-sitting-at-desk.jpg]]

    57    

    58    Finally, how is it that you can easily tell the difference between

    59    how the girls /muscles/ are working in figure \ref{girl}?

    60    

    61    #+caption: The mysterious ``common sense'' appears here as you are able 

    62    #+caption: to discern the difference in how the girl's arm muscles

    63    #+caption: are activated between the two images.

    64    #+name: girl

    65    #+ATTR_LaTeX: :width 10cm

    66    [[./images/wall-push.png]]

    67   

    68    Each of these examples tells us something about what might be going

    69    on in our minds as we easily solve these recognition problems.

    70    

    71    The hidden chairs show us that we are strongly triggered by cues

    72    relating to the position of human bodies, and that we can determine

    73    the overall physical configuration of a human body even if much of

    74    that body is occluded.

    75 

    76    The picture of the girl pushing against the wall tells us that we

    77    have common sense knowledge about the kinetics of our own bodies.

    78    We know well how our muscles would have to work to maintain us in

    79    most positions, and we can easily project this self-knowledge to

    80    imagined positions triggered by images of the human body.

    81 

    82 ** =EMPATH= neatly solves recognition problems  

    83    

    84    I propose a system that can express the types of recognition

    85    problems above in a form amenable to computation. It is split into

    86    four parts:

    87 

    88    - Free/Guided Play (Training) :: The creature moves around and

    89         experiences the world through its unique perspective. Many

    90         otherwise complicated actions are easily described in the

    91         language of a full suite of body-centered, rich senses. For

    92         example, drinking is the feeling of water sliding down your

    93         throat, and cooling your insides. It's often accompanied by

    94         bringing your hand close to your face, or bringing your face

    95         close to water. Sitting down is the feeling of bending your

    96         knees, activating your quadriceps, then feeling a surface with

    97         your bottom and relaxing your legs. These body-centered action

    98         descriptions can be either learned or hard coded.

    99    - Alignment (Posture imitation) :: When trying to interpret a video

   100         or image, the creature takes a model of itself and aligns it

   101         with whatever it sees. This alignment can even cross species,

   102         as when humans try to align themselves with things like

   103         ponies, dogs, or other humans with a different body type.

   104    - Empathy (Sensory extrapolation) :: The alignment triggers

   105         associations with sensory data from prior experiences. For

   106         example, the alignment itself easily maps to proprioceptive

   107         data. Any sounds or obvious skin contact in the video can to a

   108         lesser extent trigger previous experience. Segments of

   109         previous experiences are stitched together to form a coherent

   110         and complete sensory portrait of the scene.

   111    - Recognition (Classification) :: With the scene described in terms

   112         of first person sensory events, the creature can now run its

   113         action-identification programs on this synthesized sensory

   114         data, just as it would if it were actually experiencing the

   115         scene first-hand. If previous experience has been accurately

   116         retrieved, and if it is analogous enough to the scene, then

   117         the creature will correctly identify the action in the scene.

   118    

   119    For example, I think humans are able to label the cat video as

   120    ``drinking'' because they imagine /themselves/ as the cat, and

   121    imagine putting their face up against a stream of water and

   122    sticking out their tongue. In that imagined world, they can feel

   123    the cool water hitting their tongue, and feel the water entering

   124    their body, and are able to recognize that /feeling/ as drinking.

   125    So, the label of the action is not really in the pixels of the

   126    image, but is found clearly in a simulation inspired by those

   127    pixels. An imaginative system, having been trained on drinking and

   128    non-drinking examples and learning that the most important

   129    component of drinking is the feeling of water sliding down one's

   130    throat, would analyze a video of a cat drinking in the following

   131    manner:

   132    

   133    1. Create a physical model of the video by putting a ``fuzzy''

   134       model of its own body in place of the cat. Possibly also create

   135       a simulation of the stream of water.

   136 

   137    2. Play out this simulated scene and generate imagined sensory

   138       experience. This will include relevant muscle contractions, a

   139       close up view of the stream from the cat's perspective, and most

   140       importantly, the imagined feeling of water entering the

   141       mouth. The imagined sensory experience can come from a

   142       simulation of the event, but can also be pattern-matched from

   143       previous, similar embodied experience.

   144 

   145    3. The action is now easily identified as drinking by the sense of

   146       taste alone. The other senses (such as the tongue moving in and

   147       out) help to give plausibility to the simulated action. Note that

   148       the sense of vision, while critical in creating the simulation,

   149       is not critical for identifying the action from the simulation.

   150 

   151    For the chair examples, the process is even easier:

   152 

   153     1. Align a model of your body to the person in the image.

   154 

   155     2. Generate proprioceptive sensory data from this alignment.

   156   

   157     3. Use the imagined proprioceptive data as a key to lookup related

   158        sensory experience associated with that particular proproceptive

   159        feeling.

   160 

   161     4. Retrieve the feeling of your bottom resting on a surface, your

   162        knees bent, and your leg muscles relaxed.

   163 

   164     5. This sensory information is consistent with the =sitting?=

   165        sensory predicate, so you (and the entity in the image) must be

   166        sitting.

   167 

   168     6. There must be a chair-like object since you are sitting.

   169 

   170    Empathy offers yet another alternative to the age-old AI

   171    representation question: ``What is a chair?'' --- A chair is the

   172    feeling of sitting.

   173 

   174    My program, =EMPATH= uses this empathic problem solving technique

   175    to interpret the actions of a simple, worm-like creature. 

   176    

   177    #+caption: The worm performs many actions during free play such as 

   178    #+caption: curling, wiggling, and resting.

   179    #+name: worm-intro

   180    #+ATTR_LaTeX: :width 15cm

   181    [[./images/worm-intro-white.png]]

   182 

   183    #+caption: =EMPATH= recognized and classified each of these poses by

   184    #+caption: inferring the complete sensory experience from 

   185    #+caption: proprioceptive data.

   186    #+name: worm-recognition-intro

   187    #+ATTR_LaTeX: :width 15cm

   188    [[./images/worm-poses.png]]

   189    

   190    One powerful advantage of empathic problem solving is that it

   191    factors the action recognition problem into two easier problems. To

   192    use empathy, you need an /aligner/, which takes the video and a

   193    model of your body, and aligns the model with the video. Then, you

   194    need a /recognizer/, which uses the aligned model to interpret the

   195    action. The power in this method lies in the fact that you describe

   196    all actions form a body-centered, viewpoint You are less tied to

   197    the particulars of any visual representation of the actions. If you

   198    teach the system what ``running'' is, and you have a good enough

   199    aligner, the system will from then on be able to recognize running

   200    from any point of view, even strange points of view like above or

   201    underneath the runner. This is in contrast to action recognition

   202    schemes that try to identify actions using a non-embodied approach

   203    such as TODO:REFERENCE. If these systems learn about running as

   204    viewed from the side, they will not automatically be able to

   205    recognize running from any other viewpoint.

   206 

   207    Another powerful advantage is that using the language of multiple

   208    body-centered rich senses to describe body-centerd actions offers a

   209    massive boost in descriptive capability. Consider how difficult it

   210    would be to compose a set of HOG filters to describe the action of

   211    a simple worm-creature ``curling'' so that its head touches its

   212    tail, and then behold the simplicity of describing thus action in a

   213    language designed for the task (listing \ref{grand-circle-intro}):

   214 

   215    #+caption: Body-centerd actions are best expressed in a body-centered 

   216    #+caption: language. This code detects when the worm has curled into a 

   217    #+caption: full circle. Imagine how you would replicate this functionality

   218    #+caption: using low-level pixel features such as HOG filters!

   219    #+name: grand-circle-intro

   220    #+begin_listing clojure

   221    #+begin_src clojure

   222 (defn grand-circle?

   223   "Does the worm form a majestic circle (one end touching the other)?"

   224   [experiences]

   225   (and (curled? experiences)

   226        (let [worm-touch (:touch (peek experiences))

   227              tail-touch (worm-touch 0)

   228              head-touch (worm-touch 4)]

   229          (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))

   230               (< 0.55 (contact worm-segment-top-tip    head-touch))))))

   231    #+end_src

   232    #+end_listing

   233 

   234 

   235 ** =CORTEX= is a toolkit for building sensate creatures

   236 

   237    Hand integration demo

   238 

   239 ** Contributions

   240 

   241 * Building =CORTEX=

   242 

   243 ** To explore embodiment, we need a world, body, and senses

   244 

   245 ** Because of Time, simulation is perferable to reality

   246 

   247 ** Video game engines are a great starting point

   248 

   249 ** Bodies are composed of segments connected by joints

   250 

   251 ** Eyes reuse standard video game components

   252 

   253 ** Hearing is hard; =CORTEX= does it right

   254 

   255 ** Touch uses hundreds of hair-like elements

   256 

   257 ** Proprioception is the sense that makes everything ``real''

   258 

   259 ** Muscles are both effectors and sensors

   260 

   261 ** =CORTEX= brings complex creatures to life!

   262 

   263 ** =CORTEX= enables many possiblities for further research

   264 

   265 * Empathy in a simulated worm

   266 

   267 ** Embodiment factors action recognition into managable parts

   268 

   269 ** Action recognition is easy with a full gamut of senses

   270 

   271 ** Digression: bootstrapping touch using free exploration

   272 

   273 ** \Phi-space describes the worm's experiences

   274 

   275 ** Empathy is the process of tracing though \Phi-space 

   276   

   277 ** Efficient action recognition with =EMPATH=

   278 

   279 * Contributions

   280   - Built =CORTEX=, a comprehensive platform for embodied AI

   281     experiments. Has many new features lacking in other systems, such

   282     as sound. Easy to model/create new creatures.

   283   - created a novel concept for action recognition by using artificial

   284     imagination. 

   285 

   286 In the second half of the thesis I develop a computational model of

   287 empathy, using =CORTEX= as a base. Empathy in this context is the

   288 ability to observe another creature and infer what sorts of sensations

   289 that creature is feeling. My empathy algorithm involves multiple

   290 phases. First is free-play, where the creature moves around and gains

   291 sensory experience. From this experience I construct a representation

   292 of the creature's sensory state space, which I call \Phi-space. Using

   293 \Phi-space, I construct an efficient function for enriching the

   294 limited data that comes from observing another creature with a full

   295 compliment of imagined sensory data based on previous experience. I

   296 can then use the imagined sensory data to recognize what the observed

   297 creature is doing and feeling, using straightforward embodied action

   298 predicates. This is all demonstrated with using a simple worm-like

   299 creature, and recognizing worm-actions based on limited data.

   300 

   301 Embodied representation using multiple senses such as touch,

   302 proprioception, and muscle tension turns out be be exceedingly

   303 efficient at describing body-centered actions. It is the ``right

   304 language for the job''. For example, it takes only around 5 lines of

   305 LISP code to describe the action of ``curling'' using embodied

   306 primitives. It takes about 8 lines to describe the seemingly

   307 complicated action of wiggling.

   308 

   309 

   310 

   311 * COMMENT names for cortex

   312  - bioland

   313 

   314 

   315 

   316 

   317 # An anatomical joke:

   318 # - Training

   319 # - Skeletal imitation

   320 # - Sensory fleshing-out

   321 # - Classification
author	Robert McIntyre <rlm@mit.edu>
date	Tue, 25 Mar 2014 11:30:15 -0400
parents	3e91585b2a1c
children	af13fc73e851