Mercurial > cortex

     1 #+title: =CORTEX=

     2 #+author: Robert McIntyre

     3 #+email: rlm@mit.edu

     4 #+description: Using embodied AI to facilitate Artificial Imagination.

     5 #+keywords: AI, clojure, embodiment

     6 

     7 

     8 * Empathy and Embodiment as problem solving strategies

     9   

    10   By the end of this thesis, you will have seen a novel approach to

    11   interpreting video using embodiment and empathy. You will have also

    12   seen one way to efficiently implement empathy for embodied

    13   creatures. Finally, you will become familiar with =CORTEX=, a

    14   system for designing and simulating creatures with rich senses,

    15   which you may choose to use in your own research.

    16   

    17   This is the core vision of my thesis: That one of the important ways

    18   in which we understand others is by imagining ourselves in their

    19   position and emphatically feeling experiences relative to our own

    20   bodies. By understanding events in terms of our own previous

    21   corporeal experience, we greatly constrain the possibilities of what

    22   would otherwise be an unwieldy exponential search. This extra

    23   constraint can be the difference between easily understanding what

    24   is happening in a video and being completely lost in a sea of

    25   incomprehensible color and movement.

    26 

    27 ** Recognizing actions in video is extremely difficult

    28 

    29    Consider for example the problem of determining what is happening in

    30    a video of which this is one frame:

    31 

    32    #+caption: A cat drinking some water. Identifying this action is 

    33    #+caption: beyond the state of the art for computers.

    34    #+ATTR_LaTeX: :width 7cm

    35    [[./images/cat-drinking.jpg]]

    36    

    37    It is currently impossible for any computer program to reliably

    38    label such an video as "drinking".  And rightly so -- it is a very

    39    hard problem! What features can you describe in terms of low level

    40    functions of pixels that can even begin to describe at a high level

    41    what is happening here?

    42   

    43    Or suppose that you are building a program that recognizes

    44    chairs. How could you ``see'' the chair in figure

    45    \ref{invisible-chair} and figure \ref{hidden-chair}?

    46    

    47    #+caption: When you look at this, do you think ``chair''? I certainly do.

    48    #+name: invisible-chair

    49    #+ATTR_LaTeX: :width 10cm

    50    [[./images/invisible-chair.png]]

    51    

    52    #+caption: The chair in this image is quite obvious to humans, but I 

    53    #+caption: doubt that any computer program can find it.

    54    #+name: hidden-chair

    55    #+ATTR_LaTeX: :width 10cm

    56    [[./images/fat-person-sitting-at-desk.jpg]]

    57    

    58    Finally, how is it that you can easily tell the difference between

    59    how the girls /muscles/ are working in figure \ref{girl}?

    60    

    61    #+caption: The mysterious ``common sense'' appears here as you are able 

    62    #+caption: to discern the difference in how the girl's arm muscles

    63    #+caption: are activated between the two images.

    64    #+name: girl

    65    #+ATTR_LaTeX: :width 10cm

    66    [[./images/wall-push.png]]

    67   

    68    Each of these examples tells us something about what might be going

    69    on in our minds as we easily solve these recognition problems.

    70    

    71    The hidden chairs show us that we are strongly triggered by cues

    72    relating to the position of human bodies, and that we can

    73    determine the overall physical configuration of a human body even

    74    if much of that body is occluded.

    75 

    76    The picture of the girl pushing against the wall tells us that we

    77    have common sense knowledge about the kinetics of our own bodies.

    78    We know well how our muscles would have to work to maintain us in

    79    most positions, and we can easily project this self-knowledge to

    80    imagined positions triggered by images of the human body.

    81 

    82 ** =EMPATH= neatly solves recognition problems  

    83    

    84    I propose a system that can express the types of recognition

    85    problems above in a form amenable to computation. It is split into

    86    four parts:

    87 

    88    - Free/Guided Play :: The creature moves around and experiences the

    89         world through its unique perspective. Many otherwise

    90         complicated actions are easily described in the language of a

    91         full suite of body-centered, rich senses. For example,

    92         drinking is the feeling of water sliding down your throat, and

    93         cooling your insides. It's often accompanied by bringing your

    94         hand close to your face, or bringing your face close to

    95         water. Sitting down is the feeling of bending your knees,

    96         activating your quadriceps, then feeling a surface with your

    97         bottom and relaxing your legs. These body-centered action

    98         descriptions can be either learned or hard coded.

    99    - Alignment :: When trying to interpret a video or image, the

   100                   creature takes a model of itself and aligns it with

   101                   whatever it sees. This can be a rather loose

   102                   alignment that can cross species, as when humans try

   103                   to align themselves with things like ponies, dogs,

   104                   or other humans with a different body type.

   105    - Empathy :: The alignment triggers the memories of previous

   106                 experience. For example, the alignment itself easily

   107                 maps to proprioceptive data. Any sounds or obvious

   108                 skin contact in the video can to a lesser extent

   109                 trigger previous experience. The creatures previous

   110                 experience is chained together in short bursts to

   111                 coherently describe the new scene.

   112    - Recognition :: With the scene now described in terms of past

   113                     experience, the creature can now run its

   114                     action-identification programs on this synthesized

   115                     sensory data, just as it would if it were actually

   116                     experiencing the scene first-hand. If previous

   117                     experience has been accurately retrieved, and if

   118                     it is analogous enough to the scene, then the

   119                     creature will correctly identify the action in the

   120                     scene.

   121 		    

   122 

   123    For example, I think humans are able to label the cat video as

   124    "drinking" because they imagine /themselves/ as the cat, and

   125    imagine putting their face up against a stream of water and

   126    sticking out their tongue. In that imagined world, they can feel

   127    the cool water hitting their tongue, and feel the water entering

   128    their body, and are able to recognize that /feeling/ as

   129    drinking. So, the label of the action is not really in the pixels

   130    of the image, but is found clearly in a simulation inspired by

   131    those pixels. An imaginative system, having been trained on

   132    drinking and non-drinking examples and learning that the most

   133    important component of drinking is the feeling of water sliding

   134    down one's throat, would analyze a video of a cat drinking in the

   135    following manner:

   136    

   137    1. Create a physical model of the video by putting a "fuzzy" model

   138       of its own body in place of the cat. Possibly also create a

   139       simulation of the stream of water.

   140 

   141    2. Play out this simulated scene and generate imagined sensory

   142       experience. This will include relevant muscle contractions, a

   143       close up view of the stream from the cat's perspective, and most

   144       importantly, the imagined feeling of water entering the

   145       mouth. The imagined sensory experience can come from both a

   146       simulation of the event, but can also be pattern-matched from

   147       previous, similar embodied experience.

   148 

   149    3. The action is now easily identified as drinking by the sense of

   150       taste alone. The other senses (such as the tongue moving in and

   151       out) help to give plausibility to the simulated action. Note that

   152       the sense of vision, while critical in creating the simulation,

   153       is not critical for identifying the action from the simulation.

   154 

   155    For the chair examples, the process is even easier:

   156 

   157     1. Align a model of your body to the person in the image.

   158 

   159     2. Generate proprioceptive sensory data from this alignment.

   160   

   161     3. Use the imagined proprioceptive data as a key to lookup related

   162        sensory experience associated with that particular proproceptive

   163        feeling.

   164 

   165     4. Retrieve the feeling of your bottom resting on a surface and

   166        your leg muscles relaxed.

   167 

   168     5. This sensory information is consistent with the =sitting?=

   169        sensory predicate, so you (and the entity in the image) must be

   170        sitting.

   171 

   172     6. There must be a chair-like object since you are sitting.

   173 

   174    Empathy offers yet another alternative to the age-old AI

   175    representation question: ``What is a chair?'' --- A chair is the

   176    feeling of sitting.

   177 

   178    My program, =EMPATH= uses this empathic problem solving technique

   179    to interpret the actions of a simple, worm-like creature. 

   180    

   181    #+caption: The worm performs many actions during free play such as 

   182    #+caption: curling, wiggling, and resting.

   183    #+name: worm-intro

   184    #+ATTR_LaTeX: :width 10cm

   185    [[./images/wall-push.png]]

   186 

   187    #+caption: This sensory predicate detects when the worm is resting on the 

   188    #+caption: ground.

   189    #+name: resting-intro

   190    #+begin_listing clojure

   191    #+begin_src clojure

   192 (defn resting?

   193   "Is the worm resting on the ground?"

   194   [experiences]

   195   (every?

   196    (fn [touch-data]

   197      (< 0.9 (contact worm-segment-bottom touch-data)))

   198    (:touch (peek experiences))))

   199    #+end_src

   200    #+end_listing

   201 

   202    #+caption: Body-centerd actions are best expressed in a body-centered 

   203    #+caption: language. This code detects when the worm has curled into a 

   204    #+caption: full circle. Imagine how you would replicate this functionality

   205    #+caption: using low-level pixel features such as HOG filters!

   206    #+name: grand-circle-intro

   207    #+begin_listing clojure

   208    #+begin_src clojure

   209 (defn grand-circle?

   210   "Does the worm form a majestic circle (one end touching the other)?"

   211   [experiences]

   212   (and (curled? experiences)

   213        (let [worm-touch (:touch (peek experiences))

   214              tail-touch (worm-touch 0)

   215              head-touch (worm-touch 4)]

   216          (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))

   217               (< 0.55 (contact worm-segment-top-tip    head-touch))))))

   218    #+end_src

   219    #+end_listing

   220 

   221    #+caption: Even complicated actions such as ``wiggling'' are fairly simple

   222    #+caption: to describe with a rich enough language.

   223    #+name: wiggling-intro

   224    #+begin_listing clojure

   225    #+begin_src clojure

   226 (defn wiggling?

   227   "Is the worm wiggling?"

   228   [experiences]

   229   (let [analysis-interval 0x40]

   230     (when (> (count experiences) analysis-interval)

   231       (let [a-flex 3

   232             a-ex   2

   233             muscle-activity

   234             (map :muscle (vector:last-n experiences analysis-interval))

   235             base-activity

   236             (map #(- (% a-flex) (% a-ex)) muscle-activity)]

   237         (= 2

   238            (first

   239             (max-indexed

   240              (map #(Math/abs %)

   241                   (take 20 (fft base-activity))))))))))

   242    #+end_src

   243    #+end_listing

   244 

   245    #+caption: The actions of a worm in a video can be recognized by

   246    #+caption: proprioceptive data and sentory predicates by filling

   247    #+caption:  in the missing sensory detail with previous experience.

   248    #+name: worm-recognition-intro

   249    #+ATTR_LaTeX: :width 10cm

   250    [[./images/wall-push.png]]

   251 

   252 

   253    

   254    One powerful advantage of empathic problem solving is that it

   255    factors the action recognition problem into two easier problems. To

   256    use empathy, you need an /aligner/, which takes the video and a

   257    model of your body, and aligns the model with the video. Then, you

   258    need a /recognizer/, which uses the aligned model to interpret the

   259    action. The power in this method lies in the fact that you describe

   260    all actions form a body-centered, rich viewpoint. This way, if you

   261    teach the system what ``running'' is, and you have a good enough

   262    aligner, the system will from then on be able to recognize running

   263    from any point of view, even strange points of view like above or

   264    underneath the runner. This is in contrast to action recognition

   265    schemes that try to identify actions using a non-embodied approach

   266    such as TODO:REFERENCE. If these systems learn about running as viewed

   267    from the side, they will not automatically be able to recognize

   268    running from any other viewpoint.

   269 

   270    Another powerful advantage is that using the language of multiple

   271    body-centered rich senses to describe body-centerd actions offers a

   272    massive boost in descriptive capability. Consider how difficult it

   273    would be to compose a set of HOG filters to describe the action of

   274    a simple worm-creature "curling" so that its head touches its tail,

   275    and then behold the simplicity of describing thus action in a

   276    language designed for the task (listing \ref{grand-circle-intro}):

   277 

   278 

   279 ** =CORTEX= is a toolkit for building sensate creatures

   280 

   281    Hand integration demo

   282 

   283 ** Contributions

   284 

   285 * Building =CORTEX=

   286 

   287 ** To explore embodiment, we need a world, body, and senses

   288 

   289 ** Because of Time, simulation is perferable to reality

   290 

   291 ** Video game engines are a great starting point

   292 

   293 ** Bodies are composed of segments connected by joints

   294 

   295 ** Eyes reuse standard video game components

   296 

   297 ** Hearing is hard; =CORTEX= does it right

   298 

   299 ** Touch uses hundreds of hair-like elements

   300 

   301 ** Proprioception is the sense that makes everything ``real''

   302 

   303 ** Muscles are both effectors and sensors

   304 

   305 ** =CORTEX= brings complex creatures to life!

   306 

   307 ** =CORTEX= enables many possiblities for further research

   308 

   309 * Empathy in a simulated worm

   310 

   311 ** Embodiment factors action recognition into managable parts

   312 

   313 ** Action recognition is easy with a full gamut of senses

   314 

   315 ** Digression: bootstrapping touch using free exploration

   316 

   317 ** \Phi-space describes the worm's experiences

   318 

   319 ** Empathy is the process of tracing though \Phi-space 

   320   

   321 ** Efficient action recognition with =EMPATH=

   322 

   323 * Contributions

   324   - Built =CORTEX=, a comprehensive platform for embodied AI

   325     experiments. Has many new features lacking in other systems, such

   326     as sound. Easy to model/create new creatures.

   327   - created a novel concept for action recognition by using artificial

   328     imagination. 

   329 

   330 In the second half of the thesis I develop a computational model of

   331 empathy, using =CORTEX= as a base. Empathy in this context is the

   332 ability to observe another creature and infer what sorts of sensations

   333 that creature is feeling. My empathy algorithm involves multiple

   334 phases. First is free-play, where the creature moves around and gains

   335 sensory experience. From this experience I construct a representation

   336 of the creature's sensory state space, which I call \phi-space. Using

   337 \phi-space, I construct an efficient function for enriching the

   338 limited data that comes from observing another creature with a full

   339 compliment of imagined sensory data based on previous experience. I

   340 can then use the imagined sensory data to recognize what the observed

   341 creature is doing and feeling, using straightforward embodied action

   342 predicates. This is all demonstrated with using a simple worm-like

   343 creature, and recognizing worm-actions based on limited data.

   344 

   345 Embodied representation using multiple senses such as touch,

   346 proprioception, and muscle tension turns out be be exceedingly

   347 efficient at describing body-centered actions. It is the ``right

   348 language for the job''. For example, it takes only around 5 lines of

   349 LISP code to describe the action of ``curling'' using embodied

   350 primitives. It takes about 8 lines to describe the seemingly

   351 complicated action of wiggling.

   352 

   353 

   354 

   355 * COMMENT names for cortex

   356  - bioland
author	Robert McIntyre <rlm@mit.edu>
date	Mon, 24 Mar 2014 20:59:35 -0400
parents	b01c070b03d4
children	eaf8c591372b