diff thesis/dylan-cortex-diff.diff @ 513:4c4d45f6f30b

accept/reject changes
author Robert McIntyre <rlm@mit.edu>
date Sun, 30 Mar 2014 10:41:18 -0400
parents
children 447c3c8405a2
line wrap: on
line diff
     1.1 --- /dev/null	Thu Jan 01 00:00:00 1970 +0000
     1.2 +++ b/thesis/dylan-cortex-diff.diff	Sun Mar 30 10:41:18 2014 -0400
     1.3 @@ -0,0 +1,395 @@
     1.4 +diff -r f639e2139ce2 thesis/cortex.org
     1.5 +--- a/thesis/cortex.org	Sun Mar 30 01:34:43 2014 -0400
     1.6 ++++ b/thesis/cortex.org	Sun Mar 30 10:07:17 2014 -0400
     1.7 +@@ -41,49 +41,46 @@
     1.8 +     [[./images/aurellem-gray.png]]
     1.9 + 
    1.10 + 
    1.11 +-* Empathy and Embodiment as problem solving strategies
    1.12 ++* Empathy \& Embodiment: problem solving strategies
    1.13 +   
    1.14 +-  By the end of this thesis, you will have seen a novel approach to
    1.15 +-  interpreting video using embodiment and empathy. You will have also
    1.16 +-  seen one way to efficiently implement empathy for embodied
    1.17 +-  creatures. Finally, you will become familiar with =CORTEX=, a system
    1.18 +-  for designing and simulating creatures with rich senses, which you
    1.19 +-  may choose to use in your own research.
    1.20 +-  
    1.21 +-  This is the core vision of my thesis: That one of the important ways
    1.22 +-  in which we understand others is by imagining ourselves in their
    1.23 +-  position and emphatically feeling experiences relative to our own
    1.24 +-  bodies. By understanding events in terms of our own previous
    1.25 +-  corporeal experience, we greatly constrain the possibilities of what
    1.26 +-  would otherwise be an unwieldy exponential search. This extra
    1.27 +-  constraint can be the difference between easily understanding what
    1.28 +-  is happening in a video and being completely lost in a sea of
    1.29 +-  incomprehensible color and movement.
    1.30 +-  
    1.31 +-** Recognizing actions in video is extremely difficult
    1.32 +-
    1.33 +-   Consider for example the problem of determining what is happening
    1.34 +-   in a video of which this is one frame:
    1.35 +-
    1.36 ++** The problem: recognizing actions in video is extremely difficult
    1.37 ++# developing / requires useful representations
    1.38 ++   
    1.39 ++   Examine the following collection of images. As you, and indeed very
    1.40 ++   young children, can easily determine, each one is a picture of
    1.41 ++   someone drinking. 
    1.42 ++
    1.43 ++   # dxh: cat, cup, drinking fountain, rain, straw, coconut
    1.44 +    #+caption: A cat drinking some water. Identifying this action is 
    1.45 +-   #+caption: beyond the state of the art for computers.
    1.46 ++   #+caption: beyond the capabilities of existing computer vision systems.
    1.47 +    #+ATTR_LaTeX: :width 7cm
    1.48 +    [[./images/cat-drinking.jpg]]
    1.49 ++     
    1.50 ++   Nevertheless, it is beyond the state of the art for a computer
    1.51 ++   vision program to describe what's happening in each of these
    1.52 ++   images, or what's common to them. Part of the problem is that many
    1.53 ++   computer vision systems focus on pixel-level details or probability
    1.54 ++   distributions of pixels, with little focus on [...]
    1.55 ++
    1.56 ++
    1.57 ++   In fact, the contents of scene may have much less to do with pixel
    1.58 ++   probabilities than with recognizing various affordances: things you
    1.59 ++   can move, objects you can grasp, spaces that can be filled
    1.60 ++   (Gibson). For example, what processes might enable you to see the
    1.61 ++   chair in figure \ref{hidden-chair}? 
    1.62 ++   # Or suppose that you are building a program that recognizes chairs.
    1.63 ++   # How could you ``see'' the chair ?
    1.64 +    
    1.65 +-   It is currently impossible for any computer program to reliably
    1.66 +-   label such a video as ``drinking''. And rightly so -- it is a very
    1.67 +-   hard problem! What features can you describe in terms of low level
    1.68 +-   functions of pixels that can even begin to describe at a high level
    1.69 +-   what is happening here?
    1.70 +-  
    1.71 +-   Or suppose that you are building a program that recognizes chairs.
    1.72 +-   How could you ``see'' the chair in figure \ref{hidden-chair}?
    1.73 +-   
    1.74 ++   # dxh: blur chair
    1.75 +    #+caption: The chair in this image is quite obvious to humans, but I 
    1.76 +    #+caption: doubt that any modern computer vision program can find it.
    1.77 +    #+name: hidden-chair
    1.78 +    #+ATTR_LaTeX: :width 10cm
    1.79 +    [[./images/fat-person-sitting-at-desk.jpg]]
    1.80 ++
    1.81 ++
    1.82 ++   
    1.83 ++
    1.84 +    
    1.85 +    Finally, how is it that you can easily tell the difference between
    1.86 +    how the girls /muscles/ are working in figure \ref{girl}?
    1.87 +@@ -95,10 +92,13 @@
    1.88 +    #+ATTR_LaTeX: :width 7cm
    1.89 +    [[./images/wall-push.png]]
    1.90 +   
    1.91 ++
    1.92 ++
    1.93 ++
    1.94 +    Each of these examples tells us something about what might be going
    1.95 +    on in our minds as we easily solve these recognition problems.
    1.96 +    
    1.97 +-   The hidden chairs show us that we are strongly triggered by cues
    1.98 ++   The hidden chair shows us that we are strongly triggered by cues
    1.99 +    relating to the position of human bodies, and that we can determine
   1.100 +    the overall physical configuration of a human body even if much of
   1.101 +    that body is occluded.
   1.102 +@@ -109,10 +109,107 @@
   1.103 +    most positions, and we can easily project this self-knowledge to
   1.104 +    imagined positions triggered by images of the human body.
   1.105 + 
   1.106 +-** =EMPATH= neatly solves recognition problems  
   1.107 ++** A step forward: the sensorimotor-centered approach
   1.108 ++# ** =EMPATH= recognizes what creatures are doing
   1.109 ++# neatly solves recognition problems  
   1.110 ++   In this thesis, I explore the idea that our knowledge of our own
   1.111 ++   bodies enables us to recognize the actions of others. 
   1.112 ++
   1.113 ++   First, I built a system for constructing virtual creatures with
   1.114 ++   physiologically plausible sensorimotor systems and detailed
   1.115 ++   environments. The result is =CORTEX=, which is described in section
   1.116 ++   \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other
   1.117 ++   AI researchers; it is provided in full with detailed instructions
   1.118 ++   on the web [here].)
   1.119 ++
   1.120 ++   Next, I wrote routines which enabled a simple worm-like creature to
   1.121 ++   infer the actions of a second worm-like creature, using only its
   1.122 ++   own prior sensorimotor experiences and knowledge of the second
   1.123 ++   worm's joint positions. This program, =EMPATH=, is described in
   1.124 ++   section \ref{sec-3}, and the key results of this experiment are
   1.125 ++   summarized below.
   1.126 ++
   1.127 ++  #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer 
   1.128 ++  #+caption: the complete sensory experience and classify these four poses.
   1.129 ++  #+caption: The last image is a composite, depicting the intermediate stages of \emph{wriggling}.
   1.130 ++  #+name: worm-recognition-intro-2
   1.131 ++  #+ATTR_LaTeX: :width 15cm
   1.132 ++   [[./images/empathy-1.png]]
   1.133 ++
   1.134 ++   # =CORTEX= provides a language for describing the sensorimotor
   1.135 ++   # experiences of various creatures. 
   1.136 ++
   1.137 ++   # Next, I developed an experiment to test the power of =CORTEX='s
   1.138 ++   # sensorimotor-centered language for solving recognition problems. As
   1.139 ++   # a proof of concept, I wrote routines which enabled a simple
   1.140 ++   # worm-like creature to infer the actions of a second worm-like
   1.141 ++   # creature, using only its own previous sensorimotor experiences and
   1.142 ++   # knowledge of the second worm's joints (figure
   1.143 ++   # \ref{worm-recognition-intro-2}). The result of this proof of
   1.144 ++   # concept was the program =EMPATH=, described in section
   1.145 ++   # \ref{sec-3}. The key results of this
   1.146 ++
   1.147 ++   # Using only first-person sensorimotor experiences and third-person
   1.148 ++   # proprioceptive data, 
   1.149 ++
   1.150 ++*** Key results
   1.151 ++   - After one-shot supervised training, =EMPATH= was able recognize a
   1.152 ++     wide variety of static poses and dynamic actions---ranging from
   1.153 ++     curling in a circle to wriggling with a particular frequency ---
   1.154 ++     with 95\% accuracy.
   1.155 ++   - These results were completely independent of viewing angle
   1.156 ++     because the underlying body-centered language fundamentally is;
   1.157 ++     once an action is learned, it can be recognized equally well from
   1.158 ++     any viewing angle.
   1.159 ++   - =EMPATH= is surprisingly short; the sensorimotor-centered
   1.160 ++     language provided by =CORTEX= resulted in extremely economical
   1.161 ++     recognition routines --- about 0000 lines in all --- suggesting
   1.162 ++     that such representations are very powerful, and often
   1.163 ++     indispensible for the types of recognition tasks considered here.
   1.164 ++   - Although for expediency's sake, I relied on direct knowledge of
   1.165 ++     joint positions in this proof of concept, it would be
   1.166 ++     straightforward to extend =EMPATH= so that it (more
   1.167 ++     realistically) infers joint positions from its visual data.
   1.168 ++
   1.169 ++# because the underlying language is fundamentally orientation-independent
   1.170 ++
   1.171 ++# recognize the actions of a worm with 95\% accuracy. The
   1.172 ++#      recognition tasks 
   1.173 +    
   1.174 +-   I propose a system that can express the types of recognition
   1.175 +-   problems above in a form amenable to computation. It is split into
   1.176 ++
   1.177 ++
   1.178 ++
   1.179 ++   [Talk about these results and what you find promising about them]
   1.180 ++
   1.181 ++** Roadmap
   1.182 ++   [I'm going to explain how =CORTEX= works, then break down how
   1.183 ++   =EMPATH= does its thing. Because the details reveal such-and-such
   1.184 ++   about the approach.]
   1.185 ++
   1.186 ++   # The success of this simple proof-of-concept offers a tantalizing
   1.187 ++
   1.188 ++
   1.189 ++   # explore the idea 
   1.190 ++   # The key contribution of this thesis is the idea that body-centered
   1.191 ++   # representations (which express 
   1.192 ++
   1.193 ++
   1.194 ++   # the
   1.195 ++   # body-centered approach --- in which I try to determine what's
   1.196 ++   # happening in a scene by bringing it into registration with my own
   1.197 ++   # bodily experiences --- are indispensible for recognizing what
   1.198 ++   # creatures are doing in a scene.
   1.199 ++
   1.200 ++* COMMENT
   1.201 ++# body-centered language
   1.202 ++   
   1.203 ++   In this thesis, I'll describe =EMPATH=, which solves a certain
   1.204 ++   class of recognition problems 
   1.205 ++
   1.206 ++   The key idea is to use self-centered (or first-person) language.
   1.207 ++
   1.208 ++   I have built a system that can express the types of recognition
   1.209 ++   problems in a form amenable to computation. It is split into
   1.210 +    four parts:
   1.211 + 
   1.212 +    - Free/Guided Play :: The creature moves around and experiences the
   1.213 +@@ -286,14 +383,14 @@
   1.214 +      code to create a creature, and can use a wide library of
   1.215 +      pre-existing blender models as a base for your own creatures.
   1.216 + 
   1.217 +-   - =CORTEX= implements a wide variety of senses, including touch,
   1.218 ++   - =CORTEX= implements a wide variety of senses: touch,
   1.219 +      proprioception, vision, hearing, and muscle tension. Complicated
   1.220 +      senses like touch, and vision involve multiple sensory elements
   1.221 +      embedded in a 2D surface. You have complete control over the
   1.222 +      distribution of these sensor elements through the use of simple
   1.223 +      png image files. In particular, =CORTEX= implements more
   1.224 +      comprehensive hearing than any other creature simulation system
   1.225 +-     available. 
   1.226 ++     available.
   1.227 + 
   1.228 +    - =CORTEX= supports any number of creatures and any number of
   1.229 +      senses. Time in =CORTEX= dialates so that the simulated creatures
   1.230 +@@ -353,7 +450,24 @@
   1.231 +    \end{sidewaysfigure}
   1.232 + #+END_LaTeX
   1.233 + 
   1.234 +-** Contributions
   1.235 ++** Road map
   1.236 ++
   1.237 ++   By the end of this thesis, you will have seen a novel approach to
   1.238 ++  interpreting video using embodiment and empathy. You will have also
   1.239 ++  seen one way to efficiently implement empathy for embodied
   1.240 ++  creatures. Finally, you will become familiar with =CORTEX=, a system
   1.241 ++  for designing and simulating creatures with rich senses, which you
   1.242 ++  may choose to use in your own research.
   1.243 ++  
   1.244 ++  This is the core vision of my thesis: That one of the important ways
   1.245 ++  in which we understand others is by imagining ourselves in their
   1.246 ++  position and emphatically feeling experiences relative to our own
   1.247 ++  bodies. By understanding events in terms of our own previous
   1.248 ++  corporeal experience, we greatly constrain the possibilities of what
   1.249 ++  would otherwise be an unwieldy exponential search. This extra
   1.250 ++  constraint can be the difference between easily understanding what
   1.251 ++  is happening in a video and being completely lost in a sea of
   1.252 ++  incomprehensible color and movement.
   1.253 + 
   1.254 +    - I built =CORTEX=, a comprehensive platform for embodied AI
   1.255 +      experiments. =CORTEX= supports many features lacking in other
   1.256 +@@ -363,18 +477,22 @@
   1.257 +    - I built =EMPATH=, which uses =CORTEX= to identify the actions of
   1.258 +      a worm-like creature using a computational model of empathy.
   1.259 +    
   1.260 +-* Building =CORTEX=
   1.261 +-
   1.262 +-  I intend for =CORTEX= to be used as a general-purpose library for
   1.263 +-  building creatures and outfitting them with senses, so that it will
   1.264 +-  be useful for other researchers who want to test out ideas of their
   1.265 +-  own. To this end, wherver I have had to make archetictural choices
   1.266 +-  about =CORTEX=, I have chosen to give as much freedom to the user as
   1.267 +-  possible, so that =CORTEX= may be used for things I have not
   1.268 +-  forseen.
   1.269 +-
   1.270 +-** Simulation or Reality?
   1.271 +-   
   1.272 ++
   1.273 ++* Designing =CORTEX=
   1.274 ++  In this section, I outline the design decisions that went into
   1.275 ++  making =CORTEX=, along with some details about its
   1.276 ++  implementation. (A practical guide to getting started with =CORTEX=,
   1.277 ++  which skips over the history and implementation details presented
   1.278 ++  here, is provided in an appendix \ref{} at the end of this paper.)
   1.279 ++
   1.280 ++  Throughout this project, I intended for =CORTEX= to be flexible and
   1.281 ++  extensible enough to be useful for other researchers who want to
   1.282 ++  test out ideas of their own. To this end, wherver I have had to make
   1.283 ++  archetictural choices about =CORTEX=, I have chosen to give as much
   1.284 ++  freedom to the user as possible, so that =CORTEX= may be used for
   1.285 ++  things I have not forseen.
   1.286 ++
   1.287 ++** Building in simulation versus reality
   1.288 +    The most important archetictural decision of all is the choice to
   1.289 +    use a computer-simulated environemnt in the first place! The world
   1.290 +    is a vast and rich place, and for now simulations are a very poor
   1.291 +@@ -436,7 +554,7 @@
   1.292 +     doing everything in software is far cheaper than building custom
   1.293 +     real-time hardware. All you need is a laptop and some patience.
   1.294 + 
   1.295 +-** Because of Time, simulation is perferable to reality
   1.296 ++** Simulated time enables rapid prototyping and complex scenes 
   1.297 + 
   1.298 +    I envision =CORTEX= being used to support rapid prototyping and
   1.299 +    iteration of ideas. Even if I could put together a well constructed
   1.300 +@@ -459,8 +577,8 @@
   1.301 +    simulations of very simple creatures in =CORTEX= generally run at
   1.302 +    40x on my machine!
   1.303 + 
   1.304 +-** What is a sense?
   1.305 +-   
   1.306 ++** All sense organs are two-dimensional surfaces
   1.307 ++# What is a sense?   
   1.308 +    If =CORTEX= is to support a wide variety of senses, it would help
   1.309 +    to have a better understanding of what a ``sense'' actually is!
   1.310 +    While vision, touch, and hearing all seem like they are quite
   1.311 +@@ -956,7 +1074,7 @@
   1.312 +     #+ATTR_LaTeX: :width 15cm
   1.313 +     [[./images/physical-hand.png]]
   1.314 + 
   1.315 +-** Eyes reuse standard video game components
   1.316 ++** Sight reuses standard video game components...
   1.317 + 
   1.318 +    Vision is one of the most important senses for humans, so I need to
   1.319 +    build a simulated sense of vision for my AI. I will do this with
   1.320 +@@ -1257,8 +1375,8 @@
   1.321 +     community and is now (in modified form) part of a system for
   1.322 +     capturing in-game video to a file.
   1.323 + 
   1.324 +-** Hearing is hard; =CORTEX= does it right
   1.325 +-   
   1.326 ++** ...but hearing must be built from scratch
   1.327 ++# is hard; =CORTEX= does it right
   1.328 +    At the end of this section I will have simulated ears that work the
   1.329 +    same way as the simulated eyes in the last section. I will be able to
   1.330 +    place any number of ear-nodes in a blender file, and they will bind to
   1.331 +@@ -1565,7 +1683,7 @@
   1.332 +     jMonkeyEngine3 community and is used to record audio for demo
   1.333 +     videos.
   1.334 + 
   1.335 +-** Touch uses hundreds of hair-like elements
   1.336 ++** Hundreds of hair-like elements provide a sense of touch
   1.337 + 
   1.338 +    Touch is critical to navigation and spatial reasoning and as such I
   1.339 +    need a simulated version of it to give to my AI creatures.
   1.340 +@@ -2059,7 +2177,7 @@
   1.341 +     #+ATTR_LaTeX: :width 15cm
   1.342 +     [[./images/touch-cube.png]]
   1.343 + 
   1.344 +-** Proprioception is the sense that makes everything ``real''
   1.345 ++** Proprioception provides knowledge of your own body's position
   1.346 + 
   1.347 +    Close your eyes, and touch your nose with your right index finger.
   1.348 +    How did you do it? You could not see your hand, and neither your
   1.349 +@@ -2193,7 +2311,7 @@
   1.350 +     #+ATTR_LaTeX: :width 11cm
   1.351 +     [[./images/proprio.png]]
   1.352 + 
   1.353 +-** Muscles are both effectors and sensors
   1.354 ++** Muscles contain both sensors and effectors
   1.355 + 
   1.356 +    Surprisingly enough, terrestrial creatures only move by using
   1.357 +    torque applied about their joints. There's not a single straight
   1.358 +@@ -2440,7 +2558,8 @@
   1.359 +         hard control problems without worrying about physics or
   1.360 +         senses.
   1.361 + 
   1.362 +-* Empathy in a simulated worm
   1.363 ++* =EMPATH=: the simulated worm experiment
   1.364 ++# Empathy in a simulated worm
   1.365 + 
   1.366 +   Here I develop a computational model of empathy, using =CORTEX= as a
   1.367 +   base. Empathy in this context is the ability to observe another
   1.368 +@@ -2732,7 +2851,7 @@
   1.369 +    provided by an experience vector and reliably infering the rest of
   1.370 +    the senses.
   1.371 + 
   1.372 +-** Empathy is the process of tracing though \Phi-space 
   1.373 ++** ``Empathy'' requires retracing steps though \Phi-space 
   1.374 + 
   1.375 +    Here is the core of a basic empathy algorithm, starting with an
   1.376 +    experience vector:
   1.377 +@@ -2888,7 +3007,7 @@
   1.378 +    #+end_src
   1.379 +    #+end_listing
   1.380 +   
   1.381 +-** Efficient action recognition with =EMPATH=
   1.382 ++** =EMPATH= recognizes actions efficiently
   1.383 +    
   1.384 +    To use =EMPATH= with the worm, I first need to gather a set of
   1.385 +    experiences from the worm that includes the actions I want to
   1.386 +@@ -3044,9 +3163,9 @@
   1.387 +   to interpretation, and dissaggrement between empathy and experience
   1.388 +   is more excusable.
   1.389 + 
   1.390 +-** Digression: bootstrapping touch using free exploration
   1.391 +-
   1.392 +-   In the previous section I showed how to compute actions in terms of
   1.393 ++** Digression: Learn touch sensor layout through haptic experimentation, instead 
   1.394 ++# Boostraping touch using free exploration   
   1.395 ++In the previous section I showed how to compute actions in terms of
   1.396 +    body-centered predicates which relied averate touch activation of
   1.397 +    pre-defined regions of the worm's skin. What if, instead of recieving
   1.398 +    touch pre-grouped into the six faces of each worm segment, the true