changeset 441:c20de2267d39

completeing first third of first chapter.
author Robert McIntyre <rlm@mit.edu>
date Mon, 24 Mar 2014 20:59:35 -0400 (2014-03-25)
parents b01c070b03d4
children eaf8c591372b
files thesis/abstract.org thesis/cortex.org thesis/cover.tex thesis/rlm-cortex-meng.tex thesis/to-frames.pl
diffstat 5 files changed, 268 insertions(+), 86 deletions(-) [+]
line wrap: on
line diff
     1.1 --- a/thesis/abstract.org	Sun Mar 23 23:43:20 2014 -0400
     1.2 +++ b/thesis/abstract.org	Mon Mar 24 20:59:35 2014 -0400
     1.3 @@ -6,11 +6,11 @@
     1.4  curling and wiggling.
     1.5  
     1.6  To attack the action recognition problem, I developed a computational
     1.7 -model of empathy (=EMPATH=) which allows me to use simple, embodied
     1.8 -representations of actions (which require rich sensory data), even
     1.9 -when that sensory data is not actually available. The missing sense
    1.10 -data is ``imagined'' by the system by combining previous experiences
    1.11 -gained from unsupervised free play.
    1.12 +model of empathy (=EMPATH=) which allows me to recognize actions using
    1.13 +simple, embodied representations of actions (which require rich
    1.14 +sensory data), even when that sensory data is not actually
    1.15 +available. The missing sense data is ``imagined'' by the system by
    1.16 +combining previous experiences gained from unsupervised free play.
    1.17  
    1.18  In order to build this empathic, action-recognizing system, I created
    1.19  a program called =CORTEX=, which is a complete platform for embodied
     2.1 --- a/thesis/cortex.org	Sun Mar 23 23:43:20 2014 -0400
     2.2 +++ b/thesis/cortex.org	Mon Mar 24 20:59:35 2014 -0400
     2.3 @@ -10,104 +10,271 @@
     2.4    By the end of this thesis, you will have seen a novel approach to
     2.5    interpreting video using embodiment and empathy. You will have also
     2.6    seen one way to efficiently implement empathy for embodied
     2.7 -  creatures.
     2.8 +  creatures. Finally, you will become familiar with =CORTEX=, a
     2.9 +  system for designing and simulating creatures with rich senses,
    2.10 +  which you may choose to use in your own research.
    2.11    
    2.12 -  The core vision of this thesis is that one of the important ways in
    2.13 -  which we understand others is by imagining ourselves in their
    2.14 -  posistion and empathicaly feeling experiences based on our own past
    2.15 -  experiences and imagination.
    2.16 -
    2.17 -  By understanding events in terms of our own previous corperal
    2.18 -  experience, we greatly constrain the possibilities of what would
    2.19 -  otherwise be an unweidly exponential search. This extra constraint
    2.20 -  can be the difference between easily understanding what is happening
    2.21 -  in a video and being completely lost in a sea of incomprehensible
    2.22 -  color and movement.
    2.23 +  This is the core vision of my thesis: That one of the important ways
    2.24 +  in which we understand others is by imagining ourselves in their
    2.25 +  position and emphatically feeling experiences relative to our own
    2.26 +  bodies. By understanding events in terms of our own previous
    2.27 +  corporeal experience, we greatly constrain the possibilities of what
    2.28 +  would otherwise be an unwieldy exponential search. This extra
    2.29 +  constraint can be the difference between easily understanding what
    2.30 +  is happening in a video and being completely lost in a sea of
    2.31 +  incomprehensible color and movement.
    2.32  
    2.33  ** Recognizing actions in video is extremely difficult
    2.34  
    2.35 -  Consider for example the problem of determining what is happening in
    2.36 -  a video of which this is one frame:
    2.37 +   Consider for example the problem of determining what is happening in
    2.38 +   a video of which this is one frame:
    2.39  
    2.40 -  #+caption: A cat drinking some water. Identifying this action is 
    2.41 -  #+caption: beyond the state of the art for computers.
    2.42 -  #+ATTR_LaTeX: :width 7cm
    2.43 -  [[./images/cat-drinking.jpg]]
    2.44 +   #+caption: A cat drinking some water. Identifying this action is 
    2.45 +   #+caption: beyond the state of the art for computers.
    2.46 +   #+ATTR_LaTeX: :width 7cm
    2.47 +   [[./images/cat-drinking.jpg]]
    2.48 +   
    2.49 +   It is currently impossible for any computer program to reliably
    2.50 +   label such an video as "drinking".  And rightly so -- it is a very
    2.51 +   hard problem! What features can you describe in terms of low level
    2.52 +   functions of pixels that can even begin to describe at a high level
    2.53 +   what is happening here?
    2.54    
    2.55 -  It is currently impossible for any computer program to reliably
    2.56 -  label such an video as "drinking".  And rightly so -- it is a very
    2.57 -  hard problem! What features can you describe in terms of low level
    2.58 -  functions of pixels that can even begin to describe what is
    2.59 -  happening here? 
    2.60 +   Or suppose that you are building a program that recognizes
    2.61 +   chairs. How could you ``see'' the chair in figure
    2.62 +   \ref{invisible-chair} and figure \ref{hidden-chair}?
    2.63 +   
    2.64 +   #+caption: When you look at this, do you think ``chair''? I certainly do.
    2.65 +   #+name: invisible-chair
    2.66 +   #+ATTR_LaTeX: :width 10cm
    2.67 +   [[./images/invisible-chair.png]]
    2.68 +   
    2.69 +   #+caption: The chair in this image is quite obvious to humans, but I 
    2.70 +   #+caption: doubt that any computer program can find it.
    2.71 +   #+name: hidden-chair
    2.72 +   #+ATTR_LaTeX: :width 10cm
    2.73 +   [[./images/fat-person-sitting-at-desk.jpg]]
    2.74 +   
    2.75 +   Finally, how is it that you can easily tell the difference between
    2.76 +   how the girls /muscles/ are working in figure \ref{girl}?
    2.77 +   
    2.78 +   #+caption: The mysterious ``common sense'' appears here as you are able 
    2.79 +   #+caption: to discern the difference in how the girl's arm muscles
    2.80 +   #+caption: are activated between the two images.
    2.81 +   #+name: girl
    2.82 +   #+ATTR_LaTeX: :width 10cm
    2.83 +   [[./images/wall-push.png]]
    2.84    
    2.85 -  Or suppose that you are building a program that recognizes
    2.86 -  chairs. How could you ``see'' the chair in the following pictures?
    2.87 +   Each of these examples tells us something about what might be going
    2.88 +   on in our minds as we easily solve these recognition problems.
    2.89 +   
    2.90 +   The hidden chairs show us that we are strongly triggered by cues
    2.91 +   relating to the position of human bodies, and that we can
    2.92 +   determine the overall physical configuration of a human body even
    2.93 +   if much of that body is occluded.
    2.94  
    2.95 -  #+caption: When you look at this, do you think ``chair''? I certainly do.
    2.96 -  #+ATTR_LaTeX: :width 10cm
    2.97 -  [[./images/invisible-chair.png]]
    2.98 +   The picture of the girl pushing against the wall tells us that we
    2.99 +   have common sense knowledge about the kinetics of our own bodies.
   2.100 +   We know well how our muscles would have to work to maintain us in
   2.101 +   most positions, and we can easily project this self-knowledge to
   2.102 +   imagined positions triggered by images of the human body.
   2.103 +
   2.104 +** =EMPATH= neatly solves recognition problems  
   2.105 +   
   2.106 +   I propose a system that can express the types of recognition
   2.107 +   problems above in a form amenable to computation. It is split into
   2.108 +   four parts:
   2.109 +
   2.110 +   - Free/Guided Play :: The creature moves around and experiences the
   2.111 +        world through its unique perspective. Many otherwise
   2.112 +        complicated actions are easily described in the language of a
   2.113 +        full suite of body-centered, rich senses. For example,
   2.114 +        drinking is the feeling of water sliding down your throat, and
   2.115 +        cooling your insides. It's often accompanied by bringing your
   2.116 +        hand close to your face, or bringing your face close to
   2.117 +        water. Sitting down is the feeling of bending your knees,
   2.118 +        activating your quadriceps, then feeling a surface with your
   2.119 +        bottom and relaxing your legs. These body-centered action
   2.120 +        descriptions can be either learned or hard coded.
   2.121 +   - Alignment :: When trying to interpret a video or image, the
   2.122 +                  creature takes a model of itself and aligns it with
   2.123 +                  whatever it sees. This can be a rather loose
   2.124 +                  alignment that can cross species, as when humans try
   2.125 +                  to align themselves with things like ponies, dogs,
   2.126 +                  or other humans with a different body type.
   2.127 +   - Empathy :: The alignment triggers the memories of previous
   2.128 +                experience. For example, the alignment itself easily
   2.129 +                maps to proprioceptive data. Any sounds or obvious
   2.130 +                skin contact in the video can to a lesser extent
   2.131 +                trigger previous experience. The creatures previous
   2.132 +                experience is chained together in short bursts to
   2.133 +                coherently describe the new scene.
   2.134 +   - Recognition :: With the scene now described in terms of past
   2.135 +                    experience, the creature can now run its
   2.136 +                    action-identification programs on this synthesized
   2.137 +                    sensory data, just as it would if it were actually
   2.138 +                    experiencing the scene first-hand. If previous
   2.139 +                    experience has been accurately retrieved, and if
   2.140 +                    it is analogous enough to the scene, then the
   2.141 +                    creature will correctly identify the action in the
   2.142 +                    scene.
   2.143 +		    
   2.144 +
   2.145 +   For example, I think humans are able to label the cat video as
   2.146 +   "drinking" because they imagine /themselves/ as the cat, and
   2.147 +   imagine putting their face up against a stream of water and
   2.148 +   sticking out their tongue. In that imagined world, they can feel
   2.149 +   the cool water hitting their tongue, and feel the water entering
   2.150 +   their body, and are able to recognize that /feeling/ as
   2.151 +   drinking. So, the label of the action is not really in the pixels
   2.152 +   of the image, but is found clearly in a simulation inspired by
   2.153 +   those pixels. An imaginative system, having been trained on
   2.154 +   drinking and non-drinking examples and learning that the most
   2.155 +   important component of drinking is the feeling of water sliding
   2.156 +   down one's throat, would analyze a video of a cat drinking in the
   2.157 +   following manner:
   2.158 +   
   2.159 +   1. Create a physical model of the video by putting a "fuzzy" model
   2.160 +      of its own body in place of the cat. Possibly also create a
   2.161 +      simulation of the stream of water.
   2.162 +
   2.163 +   2. Play out this simulated scene and generate imagined sensory
   2.164 +      experience. This will include relevant muscle contractions, a
   2.165 +      close up view of the stream from the cat's perspective, and most
   2.166 +      importantly, the imagined feeling of water entering the
   2.167 +      mouth. The imagined sensory experience can come from both a
   2.168 +      simulation of the event, but can also be pattern-matched from
   2.169 +      previous, similar embodied experience.
   2.170 +
   2.171 +   3. The action is now easily identified as drinking by the sense of
   2.172 +      taste alone. The other senses (such as the tongue moving in and
   2.173 +      out) help to give plausibility to the simulated action. Note that
   2.174 +      the sense of vision, while critical in creating the simulation,
   2.175 +      is not critical for identifying the action from the simulation.
   2.176 +
   2.177 +   For the chair examples, the process is even easier:
   2.178 +
   2.179 +    1. Align a model of your body to the person in the image.
   2.180 +
   2.181 +    2. Generate proprioceptive sensory data from this alignment.
   2.182    
   2.183 -  #+caption: The chair in this image is quite obvious to humans, but I 
   2.184 -  #+caption: doubt that any computer program can find it.
   2.185 -  #+ATTR_LaTeX: :width 10cm
   2.186 -  [[./images/fat-person-sitting-at-desk.jpg]]
   2.187 +    3. Use the imagined proprioceptive data as a key to lookup related
   2.188 +       sensory experience associated with that particular proproceptive
   2.189 +       feeling.
   2.190  
   2.191 -  Finally, how is it that you can easily tell the difference between
   2.192 -  how the girls /muscles/ are working in \ref{girl}?
   2.193 +    4. Retrieve the feeling of your bottom resting on a surface and
   2.194 +       your leg muscles relaxed.
   2.195  
   2.196 -  #+caption: The mysterious ``common sense'' appears here as you are able 
   2.197 -  #+caption: to ``see'' the difference in how the girl's arm muscles
   2.198 -  #+caption: are activated differently in the two images.
   2.199 -  #+name: girl
   2.200 -  #+ATTR_LaTeX: :width 10cm
   2.201 -  [[./images/wall-push.png]]
   2.202 -  
   2.203 +    5. This sensory information is consistent with the =sitting?=
   2.204 +       sensory predicate, so you (and the entity in the image) must be
   2.205 +       sitting.
   2.206  
   2.207 -  These problems are difficult because the language of pixels is far
   2.208 -  removed from what we would consider to be an acceptable description
   2.209 -  of the events in these images. In order to process them, we must
   2.210 -  raise the images into some higher level of abstraction where their
   2.211 -  descriptions become more similar to how we would describe them in
   2.212 -  English. The question is, how can we raise 
   2.213 -  
   2.214 +    6. There must be a chair-like object since you are sitting.
   2.215  
   2.216 -  I think humans are able to label such video as "drinking" because
   2.217 -  they imagine /themselves/ as the cat, and imagine putting their face
   2.218 -  up against a stream of water and sticking out their tongue. In that
   2.219 -  imagined world, they can feel the cool water hitting their tongue,
   2.220 -  and feel the water entering their body, and are able to recognize
   2.221 -  that /feeling/ as drinking. So, the label of the action is not
   2.222 -  really in the pixels of the image, but is found clearly in a
   2.223 -  simulation inspired by those pixels. An imaginative system, having
   2.224 -  been trained on drinking and non-drinking examples and learning that
   2.225 -  the most important component of drinking is the feeling of water
   2.226 -  sliding down one's throat, would analyze a video of a cat drinking
   2.227 -  in the following manner:
   2.228 +   Empathy offers yet another alternative to the age-old AI
   2.229 +   representation question: ``What is a chair?'' --- A chair is the
   2.230 +   feeling of sitting.
   2.231 +
   2.232 +   My program, =EMPATH= uses this empathic problem solving technique
   2.233 +   to interpret the actions of a simple, worm-like creature. 
   2.234     
   2.235 -   - Create a physical model of the video by putting a "fuzzy" model
   2.236 -     of its own body in place of the cat. Also, create a simulation of
   2.237 -     the stream of water.
   2.238 +   #+caption: The worm performs many actions during free play such as 
   2.239 +   #+caption: curling, wiggling, and resting.
   2.240 +   #+name: worm-intro
   2.241 +   #+ATTR_LaTeX: :width 10cm
   2.242 +   [[./images/wall-push.png]]
   2.243  
   2.244 -   - Play out this simulated scene and generate imagined sensory
   2.245 -     experience. This will include relevant muscle contractions, a
   2.246 -     close up view of the stream from the cat's perspective, and most
   2.247 -     importantly, the imagined feeling of water entering the mouth.
   2.248 +   #+caption: This sensory predicate detects when the worm is resting on the 
   2.249 +   #+caption: ground.
   2.250 +   #+name: resting-intro
   2.251 +   #+begin_listing clojure
   2.252 +   #+begin_src clojure
   2.253 +(defn resting?
   2.254 +  "Is the worm resting on the ground?"
   2.255 +  [experiences]
   2.256 +  (every?
   2.257 +   (fn [touch-data]
   2.258 +     (< 0.9 (contact worm-segment-bottom touch-data)))
   2.259 +   (:touch (peek experiences))))
   2.260 +   #+end_src
   2.261 +   #+end_listing
   2.262  
   2.263 -   - The action is now easily identified as drinking by the sense of
   2.264 -     taste alone. The other senses (such as the tongue moving in and
   2.265 -     out) help to give plausibility to the simulated action. Note that
   2.266 -     the sense of vision, while critical in creating the simulation,
   2.267 -     is not critical for identifying the action from the simulation.
   2.268 +   #+caption: Body-centerd actions are best expressed in a body-centered 
   2.269 +   #+caption: language. This code detects when the worm has curled into a 
   2.270 +   #+caption: full circle. Imagine how you would replicate this functionality
   2.271 +   #+caption: using low-level pixel features such as HOG filters!
   2.272 +   #+name: grand-circle-intro
   2.273 +   #+begin_listing clojure
   2.274 +   #+begin_src clojure
   2.275 +(defn grand-circle?
   2.276 +  "Does the worm form a majestic circle (one end touching the other)?"
   2.277 +  [experiences]
   2.278 +  (and (curled? experiences)
   2.279 +       (let [worm-touch (:touch (peek experiences))
   2.280 +             tail-touch (worm-touch 0)
   2.281 +             head-touch (worm-touch 4)]
   2.282 +         (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
   2.283 +              (< 0.55 (contact worm-segment-top-tip    head-touch))))))
   2.284 +   #+end_src
   2.285 +   #+end_listing
   2.286  
   2.287 -   cat drinking, mimes, leaning, common sense
   2.288 +   #+caption: Even complicated actions such as ``wiggling'' are fairly simple
   2.289 +   #+caption: to describe with a rich enough language.
   2.290 +   #+name: wiggling-intro
   2.291 +   #+begin_listing clojure
   2.292 +   #+begin_src clojure
   2.293 +(defn wiggling?
   2.294 +  "Is the worm wiggling?"
   2.295 +  [experiences]
   2.296 +  (let [analysis-interval 0x40]
   2.297 +    (when (> (count experiences) analysis-interval)
   2.298 +      (let [a-flex 3
   2.299 +            a-ex   2
   2.300 +            muscle-activity
   2.301 +            (map :muscle (vector:last-n experiences analysis-interval))
   2.302 +            base-activity
   2.303 +            (map #(- (% a-flex) (% a-ex)) muscle-activity)]
   2.304 +        (= 2
   2.305 +           (first
   2.306 +            (max-indexed
   2.307 +             (map #(Math/abs %)
   2.308 +                  (take 20 (fft base-activity))))))))))
   2.309 +   #+end_src
   2.310 +   #+end_listing
   2.311  
   2.312 -** =EMPATH= neatly solves recognition problems
   2.313 +   #+caption: The actions of a worm in a video can be recognized by
   2.314 +   #+caption: proprioceptive data and sentory predicates by filling
   2.315 +   #+caption:  in the missing sensory detail with previous experience.
   2.316 +   #+name: worm-recognition-intro
   2.317 +   #+ATTR_LaTeX: :width 10cm
   2.318 +   [[./images/wall-push.png]]
   2.319  
   2.320 -   factorization , right language, etc
   2.321  
   2.322 -   a new possibility for the question ``what is a chair?'' -- it's the
   2.323 -   feeling of your butt on something and your knees bent, with your
   2.324 -   back muscles and legs relaxed.
   2.325 +   
   2.326 +   One powerful advantage of empathic problem solving is that it
   2.327 +   factors the action recognition problem into two easier problems. To
   2.328 +   use empathy, you need an /aligner/, which takes the video and a
   2.329 +   model of your body, and aligns the model with the video. Then, you
   2.330 +   need a /recognizer/, which uses the aligned model to interpret the
   2.331 +   action. The power in this method lies in the fact that you describe
   2.332 +   all actions form a body-centered, rich viewpoint. This way, if you
   2.333 +   teach the system what ``running'' is, and you have a good enough
   2.334 +   aligner, the system will from then on be able to recognize running
   2.335 +   from any point of view, even strange points of view like above or
   2.336 +   underneath the runner. This is in contrast to action recognition
   2.337 +   schemes that try to identify actions using a non-embodied approach
   2.338 +   such as TODO:REFERENCE. If these systems learn about running as viewed
   2.339 +   from the side, they will not automatically be able to recognize
   2.340 +   running from any other viewpoint.
   2.341 +
   2.342 +   Another powerful advantage is that using the language of multiple
   2.343 +   body-centered rich senses to describe body-centerd actions offers a
   2.344 +   massive boost in descriptive capability. Consider how difficult it
   2.345 +   would be to compose a set of HOG filters to describe the action of
   2.346 +   a simple worm-creature "curling" so that its head touches its tail,
   2.347 +   and then behold the simplicity of describing thus action in a
   2.348 +   language designed for the task (listing \ref{grand-circle-intro}):
   2.349 +
   2.350  
   2.351  ** =CORTEX= is a toolkit for building sensate creatures
   2.352  
   2.353 @@ -151,7 +318,7 @@
   2.354  
   2.355  ** Empathy is the process of tracing though \Phi-space 
   2.356    
   2.357 -** Efficient action recognition =EMPATH=
   2.358 +** Efficient action recognition with =EMPATH=
   2.359  
   2.360  * Contributions
   2.361    - Built =CORTEX=, a comprehensive platform for embodied AI
     3.1 --- a/thesis/cover.tex	Sun Mar 23 23:43:20 2014 -0400
     3.2 +++ b/thesis/cover.tex	Mon Mar 24 20:59:35 2014 -0400
     3.3 @@ -45,7 +45,7 @@
     3.4  % however the specifications can change.  We recommend that you verify the
     3.5  % layout of your title page with your thesis advisor and/or the MIT 
     3.6  % Libraries before printing your final copy.
     3.7 -\title{Solving Problems using Embodiment \& Empathy.}
     3.8 +\title{Solving Problems using Embodiment \& Empathy}
     3.9  \author{Robert Louis M\raisebox{\depth}{\small \underline{\underline{c}}}Intyre}
    3.10  %\author{Robert McIntyre}
    3.11  
     4.1 --- a/thesis/rlm-cortex-meng.tex	Sun Mar 23 23:43:20 2014 -0400
     4.2 +++ b/thesis/rlm-cortex-meng.tex	Mon Mar 24 20:59:35 2014 -0400
     4.3 @@ -25,7 +25,7 @@
     4.4  %% Page Intentionally Left Blank'', use the ``leftblank'' option, as
     4.5  %% above. 
     4.6  
     4.7 -\documentclass[12pt,twoside,singlespace]{mitthesis}
     4.8 +\documentclass[12pt,twoside,singlespace,vi]{mitthesis}
     4.9  \usepackage[utf8]{inputenc}
    4.10  \usepackage[T1]{fontenc}
    4.11  \usepackage{fixltx2e}
     5.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     5.2 +++ b/thesis/to-frames.pl	Mon Mar 24 20:59:35 2014 -0400
     5.3 @@ -0,0 +1,15 @@
     5.4 +#!/bin/perl
     5.5 +
     5.6 +$movie_file = shift(@ARGV);
     5.7 +
     5.8 +# get file name without extension
     5.9 +$movie_file =~ m/^([^.]+)\.[^.]+$/;
    5.10 +$movie_name = $1;
    5.11 +
    5.12 +@mkdir_command = ("mkdir", "-vp", $movie_name);
    5.13 +@ffmpeg_command = ("ffmpeg", "-i", $movie_file, $movie_name."/%07d.png");
    5.14 +
    5.15 +print "@mkdir_command\n";
    5.16 +system(@mkdir_command);
    5.17 +print "@ffmpeg_command\n";
    5.18 +system(@ffmpeg_command);
    5.19 \ No newline at end of file