diff thesis/cortex.org @ 441:c20de2267d39

completeing first third of first chapter.
author Robert McIntyre <rlm@mit.edu>
date Mon, 24 Mar 2014 20:59:35 -0400
parents b01c070b03d4
children eaf8c591372b
line wrap: on
line diff
     1.1 --- a/thesis/cortex.org	Sun Mar 23 23:43:20 2014 -0400
     1.2 +++ b/thesis/cortex.org	Mon Mar 24 20:59:35 2014 -0400
     1.3 @@ -10,104 +10,271 @@
     1.4    By the end of this thesis, you will have seen a novel approach to
     1.5    interpreting video using embodiment and empathy. You will have also
     1.6    seen one way to efficiently implement empathy for embodied
     1.7 -  creatures.
     1.8 +  creatures. Finally, you will become familiar with =CORTEX=, a
     1.9 +  system for designing and simulating creatures with rich senses,
    1.10 +  which you may choose to use in your own research.
    1.11    
    1.12 -  The core vision of this thesis is that one of the important ways in
    1.13 -  which we understand others is by imagining ourselves in their
    1.14 -  posistion and empathicaly feeling experiences based on our own past
    1.15 -  experiences and imagination.
    1.16 -
    1.17 -  By understanding events in terms of our own previous corperal
    1.18 -  experience, we greatly constrain the possibilities of what would
    1.19 -  otherwise be an unweidly exponential search. This extra constraint
    1.20 -  can be the difference between easily understanding what is happening
    1.21 -  in a video and being completely lost in a sea of incomprehensible
    1.22 -  color and movement.
    1.23 +  This is the core vision of my thesis: That one of the important ways
    1.24 +  in which we understand others is by imagining ourselves in their
    1.25 +  position and emphatically feeling experiences relative to our own
    1.26 +  bodies. By understanding events in terms of our own previous
    1.27 +  corporeal experience, we greatly constrain the possibilities of what
    1.28 +  would otherwise be an unwieldy exponential search. This extra
    1.29 +  constraint can be the difference between easily understanding what
    1.30 +  is happening in a video and being completely lost in a sea of
    1.31 +  incomprehensible color and movement.
    1.32  
    1.33  ** Recognizing actions in video is extremely difficult
    1.34  
    1.35 -  Consider for example the problem of determining what is happening in
    1.36 -  a video of which this is one frame:
    1.37 +   Consider for example the problem of determining what is happening in
    1.38 +   a video of which this is one frame:
    1.39  
    1.40 -  #+caption: A cat drinking some water. Identifying this action is 
    1.41 -  #+caption: beyond the state of the art for computers.
    1.42 -  #+ATTR_LaTeX: :width 7cm
    1.43 -  [[./images/cat-drinking.jpg]]
    1.44 +   #+caption: A cat drinking some water. Identifying this action is 
    1.45 +   #+caption: beyond the state of the art for computers.
    1.46 +   #+ATTR_LaTeX: :width 7cm
    1.47 +   [[./images/cat-drinking.jpg]]
    1.48 +   
    1.49 +   It is currently impossible for any computer program to reliably
    1.50 +   label such an video as "drinking".  And rightly so -- it is a very
    1.51 +   hard problem! What features can you describe in terms of low level
    1.52 +   functions of pixels that can even begin to describe at a high level
    1.53 +   what is happening here?
    1.54    
    1.55 -  It is currently impossible for any computer program to reliably
    1.56 -  label such an video as "drinking".  And rightly so -- it is a very
    1.57 -  hard problem! What features can you describe in terms of low level
    1.58 -  functions of pixels that can even begin to describe what is
    1.59 -  happening here? 
    1.60 +   Or suppose that you are building a program that recognizes
    1.61 +   chairs. How could you ``see'' the chair in figure
    1.62 +   \ref{invisible-chair} and figure \ref{hidden-chair}?
    1.63 +   
    1.64 +   #+caption: When you look at this, do you think ``chair''? I certainly do.
    1.65 +   #+name: invisible-chair
    1.66 +   #+ATTR_LaTeX: :width 10cm
    1.67 +   [[./images/invisible-chair.png]]
    1.68 +   
    1.69 +   #+caption: The chair in this image is quite obvious to humans, but I 
    1.70 +   #+caption: doubt that any computer program can find it.
    1.71 +   #+name: hidden-chair
    1.72 +   #+ATTR_LaTeX: :width 10cm
    1.73 +   [[./images/fat-person-sitting-at-desk.jpg]]
    1.74 +   
    1.75 +   Finally, how is it that you can easily tell the difference between
    1.76 +   how the girls /muscles/ are working in figure \ref{girl}?
    1.77 +   
    1.78 +   #+caption: The mysterious ``common sense'' appears here as you are able 
    1.79 +   #+caption: to discern the difference in how the girl's arm muscles
    1.80 +   #+caption: are activated between the two images.
    1.81 +   #+name: girl
    1.82 +   #+ATTR_LaTeX: :width 10cm
    1.83 +   [[./images/wall-push.png]]
    1.84    
    1.85 -  Or suppose that you are building a program that recognizes
    1.86 -  chairs. How could you ``see'' the chair in the following pictures?
    1.87 +   Each of these examples tells us something about what might be going
    1.88 +   on in our minds as we easily solve these recognition problems.
    1.89 +   
    1.90 +   The hidden chairs show us that we are strongly triggered by cues
    1.91 +   relating to the position of human bodies, and that we can
    1.92 +   determine the overall physical configuration of a human body even
    1.93 +   if much of that body is occluded.
    1.94  
    1.95 -  #+caption: When you look at this, do you think ``chair''? I certainly do.
    1.96 -  #+ATTR_LaTeX: :width 10cm
    1.97 -  [[./images/invisible-chair.png]]
    1.98 +   The picture of the girl pushing against the wall tells us that we
    1.99 +   have common sense knowledge about the kinetics of our own bodies.
   1.100 +   We know well how our muscles would have to work to maintain us in
   1.101 +   most positions, and we can easily project this self-knowledge to
   1.102 +   imagined positions triggered by images of the human body.
   1.103 +
   1.104 +** =EMPATH= neatly solves recognition problems  
   1.105 +   
   1.106 +   I propose a system that can express the types of recognition
   1.107 +   problems above in a form amenable to computation. It is split into
   1.108 +   four parts:
   1.109 +
   1.110 +   - Free/Guided Play :: The creature moves around and experiences the
   1.111 +        world through its unique perspective. Many otherwise
   1.112 +        complicated actions are easily described in the language of a
   1.113 +        full suite of body-centered, rich senses. For example,
   1.114 +        drinking is the feeling of water sliding down your throat, and
   1.115 +        cooling your insides. It's often accompanied by bringing your
   1.116 +        hand close to your face, or bringing your face close to
   1.117 +        water. Sitting down is the feeling of bending your knees,
   1.118 +        activating your quadriceps, then feeling a surface with your
   1.119 +        bottom and relaxing your legs. These body-centered action
   1.120 +        descriptions can be either learned or hard coded.
   1.121 +   - Alignment :: When trying to interpret a video or image, the
   1.122 +                  creature takes a model of itself and aligns it with
   1.123 +                  whatever it sees. This can be a rather loose
   1.124 +                  alignment that can cross species, as when humans try
   1.125 +                  to align themselves with things like ponies, dogs,
   1.126 +                  or other humans with a different body type.
   1.127 +   - Empathy :: The alignment triggers the memories of previous
   1.128 +                experience. For example, the alignment itself easily
   1.129 +                maps to proprioceptive data. Any sounds or obvious
   1.130 +                skin contact in the video can to a lesser extent
   1.131 +                trigger previous experience. The creatures previous
   1.132 +                experience is chained together in short bursts to
   1.133 +                coherently describe the new scene.
   1.134 +   - Recognition :: With the scene now described in terms of past
   1.135 +                    experience, the creature can now run its
   1.136 +                    action-identification programs on this synthesized
   1.137 +                    sensory data, just as it would if it were actually
   1.138 +                    experiencing the scene first-hand. If previous
   1.139 +                    experience has been accurately retrieved, and if
   1.140 +                    it is analogous enough to the scene, then the
   1.141 +                    creature will correctly identify the action in the
   1.142 +                    scene.
   1.143 +		    
   1.144 +
   1.145 +   For example, I think humans are able to label the cat video as
   1.146 +   "drinking" because they imagine /themselves/ as the cat, and
   1.147 +   imagine putting their face up against a stream of water and
   1.148 +   sticking out their tongue. In that imagined world, they can feel
   1.149 +   the cool water hitting their tongue, and feel the water entering
   1.150 +   their body, and are able to recognize that /feeling/ as
   1.151 +   drinking. So, the label of the action is not really in the pixels
   1.152 +   of the image, but is found clearly in a simulation inspired by
   1.153 +   those pixels. An imaginative system, having been trained on
   1.154 +   drinking and non-drinking examples and learning that the most
   1.155 +   important component of drinking is the feeling of water sliding
   1.156 +   down one's throat, would analyze a video of a cat drinking in the
   1.157 +   following manner:
   1.158 +   
   1.159 +   1. Create a physical model of the video by putting a "fuzzy" model
   1.160 +      of its own body in place of the cat. Possibly also create a
   1.161 +      simulation of the stream of water.
   1.162 +
   1.163 +   2. Play out this simulated scene and generate imagined sensory
   1.164 +      experience. This will include relevant muscle contractions, a
   1.165 +      close up view of the stream from the cat's perspective, and most
   1.166 +      importantly, the imagined feeling of water entering the
   1.167 +      mouth. The imagined sensory experience can come from both a
   1.168 +      simulation of the event, but can also be pattern-matched from
   1.169 +      previous, similar embodied experience.
   1.170 +
   1.171 +   3. The action is now easily identified as drinking by the sense of
   1.172 +      taste alone. The other senses (such as the tongue moving in and
   1.173 +      out) help to give plausibility to the simulated action. Note that
   1.174 +      the sense of vision, while critical in creating the simulation,
   1.175 +      is not critical for identifying the action from the simulation.
   1.176 +
   1.177 +   For the chair examples, the process is even easier:
   1.178 +
   1.179 +    1. Align a model of your body to the person in the image.
   1.180 +
   1.181 +    2. Generate proprioceptive sensory data from this alignment.
   1.182    
   1.183 -  #+caption: The chair in this image is quite obvious to humans, but I 
   1.184 -  #+caption: doubt that any computer program can find it.
   1.185 -  #+ATTR_LaTeX: :width 10cm
   1.186 -  [[./images/fat-person-sitting-at-desk.jpg]]
   1.187 +    3. Use the imagined proprioceptive data as a key to lookup related
   1.188 +       sensory experience associated with that particular proproceptive
   1.189 +       feeling.
   1.190  
   1.191 -  Finally, how is it that you can easily tell the difference between
   1.192 -  how the girls /muscles/ are working in \ref{girl}?
   1.193 +    4. Retrieve the feeling of your bottom resting on a surface and
   1.194 +       your leg muscles relaxed.
   1.195  
   1.196 -  #+caption: The mysterious ``common sense'' appears here as you are able 
   1.197 -  #+caption: to ``see'' the difference in how the girl's arm muscles
   1.198 -  #+caption: are activated differently in the two images.
   1.199 -  #+name: girl
   1.200 -  #+ATTR_LaTeX: :width 10cm
   1.201 -  [[./images/wall-push.png]]
   1.202 -  
   1.203 +    5. This sensory information is consistent with the =sitting?=
   1.204 +       sensory predicate, so you (and the entity in the image) must be
   1.205 +       sitting.
   1.206  
   1.207 -  These problems are difficult because the language of pixels is far
   1.208 -  removed from what we would consider to be an acceptable description
   1.209 -  of the events in these images. In order to process them, we must
   1.210 -  raise the images into some higher level of abstraction where their
   1.211 -  descriptions become more similar to how we would describe them in
   1.212 -  English. The question is, how can we raise 
   1.213 -  
   1.214 +    6. There must be a chair-like object since you are sitting.
   1.215  
   1.216 -  I think humans are able to label such video as "drinking" because
   1.217 -  they imagine /themselves/ as the cat, and imagine putting their face
   1.218 -  up against a stream of water and sticking out their tongue. In that
   1.219 -  imagined world, they can feel the cool water hitting their tongue,
   1.220 -  and feel the water entering their body, and are able to recognize
   1.221 -  that /feeling/ as drinking. So, the label of the action is not
   1.222 -  really in the pixels of the image, but is found clearly in a
   1.223 -  simulation inspired by those pixels. An imaginative system, having
   1.224 -  been trained on drinking and non-drinking examples and learning that
   1.225 -  the most important component of drinking is the feeling of water
   1.226 -  sliding down one's throat, would analyze a video of a cat drinking
   1.227 -  in the following manner:
   1.228 +   Empathy offers yet another alternative to the age-old AI
   1.229 +   representation question: ``What is a chair?'' --- A chair is the
   1.230 +   feeling of sitting.
   1.231 +
   1.232 +   My program, =EMPATH= uses this empathic problem solving technique
   1.233 +   to interpret the actions of a simple, worm-like creature. 
   1.234     
   1.235 -   - Create a physical model of the video by putting a "fuzzy" model
   1.236 -     of its own body in place of the cat. Also, create a simulation of
   1.237 -     the stream of water.
   1.238 +   #+caption: The worm performs many actions during free play such as 
   1.239 +   #+caption: curling, wiggling, and resting.
   1.240 +   #+name: worm-intro
   1.241 +   #+ATTR_LaTeX: :width 10cm
   1.242 +   [[./images/wall-push.png]]
   1.243  
   1.244 -   - Play out this simulated scene and generate imagined sensory
   1.245 -     experience. This will include relevant muscle contractions, a
   1.246 -     close up view of the stream from the cat's perspective, and most
   1.247 -     importantly, the imagined feeling of water entering the mouth.
   1.248 +   #+caption: This sensory predicate detects when the worm is resting on the 
   1.249 +   #+caption: ground.
   1.250 +   #+name: resting-intro
   1.251 +   #+begin_listing clojure
   1.252 +   #+begin_src clojure
   1.253 +(defn resting?
   1.254 +  "Is the worm resting on the ground?"
   1.255 +  [experiences]
   1.256 +  (every?
   1.257 +   (fn [touch-data]
   1.258 +     (< 0.9 (contact worm-segment-bottom touch-data)))
   1.259 +   (:touch (peek experiences))))
   1.260 +   #+end_src
   1.261 +   #+end_listing
   1.262  
   1.263 -   - The action is now easily identified as drinking by the sense of
   1.264 -     taste alone. The other senses (such as the tongue moving in and
   1.265 -     out) help to give plausibility to the simulated action. Note that
   1.266 -     the sense of vision, while critical in creating the simulation,
   1.267 -     is not critical for identifying the action from the simulation.
   1.268 +   #+caption: Body-centerd actions are best expressed in a body-centered 
   1.269 +   #+caption: language. This code detects when the worm has curled into a 
   1.270 +   #+caption: full circle. Imagine how you would replicate this functionality
   1.271 +   #+caption: using low-level pixel features such as HOG filters!
   1.272 +   #+name: grand-circle-intro
   1.273 +   #+begin_listing clojure
   1.274 +   #+begin_src clojure
   1.275 +(defn grand-circle?
   1.276 +  "Does the worm form a majestic circle (one end touching the other)?"
   1.277 +  [experiences]
   1.278 +  (and (curled? experiences)
   1.279 +       (let [worm-touch (:touch (peek experiences))
   1.280 +             tail-touch (worm-touch 0)
   1.281 +             head-touch (worm-touch 4)]
   1.282 +         (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
   1.283 +              (< 0.55 (contact worm-segment-top-tip    head-touch))))))
   1.284 +   #+end_src
   1.285 +   #+end_listing
   1.286  
   1.287 -   cat drinking, mimes, leaning, common sense
   1.288 +   #+caption: Even complicated actions such as ``wiggling'' are fairly simple
   1.289 +   #+caption: to describe with a rich enough language.
   1.290 +   #+name: wiggling-intro
   1.291 +   #+begin_listing clojure
   1.292 +   #+begin_src clojure
   1.293 +(defn wiggling?
   1.294 +  "Is the worm wiggling?"
   1.295 +  [experiences]
   1.296 +  (let [analysis-interval 0x40]
   1.297 +    (when (> (count experiences) analysis-interval)
   1.298 +      (let [a-flex 3
   1.299 +            a-ex   2
   1.300 +            muscle-activity
   1.301 +            (map :muscle (vector:last-n experiences analysis-interval))
   1.302 +            base-activity
   1.303 +            (map #(- (% a-flex) (% a-ex)) muscle-activity)]
   1.304 +        (= 2
   1.305 +           (first
   1.306 +            (max-indexed
   1.307 +             (map #(Math/abs %)
   1.308 +                  (take 20 (fft base-activity))))))))))
   1.309 +   #+end_src
   1.310 +   #+end_listing
   1.311  
   1.312 -** =EMPATH= neatly solves recognition problems
   1.313 +   #+caption: The actions of a worm in a video can be recognized by
   1.314 +   #+caption: proprioceptive data and sentory predicates by filling
   1.315 +   #+caption:  in the missing sensory detail with previous experience.
   1.316 +   #+name: worm-recognition-intro
   1.317 +   #+ATTR_LaTeX: :width 10cm
   1.318 +   [[./images/wall-push.png]]
   1.319  
   1.320 -   factorization , right language, etc
   1.321  
   1.322 -   a new possibility for the question ``what is a chair?'' -- it's the
   1.323 -   feeling of your butt on something and your knees bent, with your
   1.324 -   back muscles and legs relaxed.
   1.325 +   
   1.326 +   One powerful advantage of empathic problem solving is that it
   1.327 +   factors the action recognition problem into two easier problems. To
   1.328 +   use empathy, you need an /aligner/, which takes the video and a
   1.329 +   model of your body, and aligns the model with the video. Then, you
   1.330 +   need a /recognizer/, which uses the aligned model to interpret the
   1.331 +   action. The power in this method lies in the fact that you describe
   1.332 +   all actions form a body-centered, rich viewpoint. This way, if you
   1.333 +   teach the system what ``running'' is, and you have a good enough
   1.334 +   aligner, the system will from then on be able to recognize running
   1.335 +   from any point of view, even strange points of view like above or
   1.336 +   underneath the runner. This is in contrast to action recognition
   1.337 +   schemes that try to identify actions using a non-embodied approach
   1.338 +   such as TODO:REFERENCE. If these systems learn about running as viewed
   1.339 +   from the side, they will not automatically be able to recognize
   1.340 +   running from any other viewpoint.
   1.341 +
   1.342 +   Another powerful advantage is that using the language of multiple
   1.343 +   body-centered rich senses to describe body-centerd actions offers a
   1.344 +   massive boost in descriptive capability. Consider how difficult it
   1.345 +   would be to compose a set of HOG filters to describe the action of
   1.346 +   a simple worm-creature "curling" so that its head touches its tail,
   1.347 +   and then behold the simplicity of describing thus action in a
   1.348 +   language designed for the task (listing \ref{grand-circle-intro}):
   1.349 +
   1.350  
   1.351  ** =CORTEX= is a toolkit for building sensate creatures
   1.352  
   1.353 @@ -151,7 +318,7 @@
   1.354  
   1.355  ** Empathy is the process of tracing though \Phi-space 
   1.356    
   1.357 -** Efficient action recognition =EMPATH=
   1.358 +** Efficient action recognition with =EMPATH=
   1.359  
   1.360  * Contributions
   1.361    - Built =CORTEX=, a comprehensive platform for embodied AI