Mercurial > cortex
diff thesis/cortex.org @ 441:c20de2267d39
completeing first third of first chapter.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Mon, 24 Mar 2014 20:59:35 -0400 |
parents | b01c070b03d4 |
children | eaf8c591372b |
line wrap: on
line diff
1.1 --- a/thesis/cortex.org Sun Mar 23 23:43:20 2014 -0400 1.2 +++ b/thesis/cortex.org Mon Mar 24 20:59:35 2014 -0400 1.3 @@ -10,104 +10,271 @@ 1.4 By the end of this thesis, you will have seen a novel approach to 1.5 interpreting video using embodiment and empathy. You will have also 1.6 seen one way to efficiently implement empathy for embodied 1.7 - creatures. 1.8 + creatures. Finally, you will become familiar with =CORTEX=, a 1.9 + system for designing and simulating creatures with rich senses, 1.10 + which you may choose to use in your own research. 1.11 1.12 - The core vision of this thesis is that one of the important ways in 1.13 - which we understand others is by imagining ourselves in their 1.14 - posistion and empathicaly feeling experiences based on our own past 1.15 - experiences and imagination. 1.16 - 1.17 - By understanding events in terms of our own previous corperal 1.18 - experience, we greatly constrain the possibilities of what would 1.19 - otherwise be an unweidly exponential search. This extra constraint 1.20 - can be the difference between easily understanding what is happening 1.21 - in a video and being completely lost in a sea of incomprehensible 1.22 - color and movement. 1.23 + This is the core vision of my thesis: That one of the important ways 1.24 + in which we understand others is by imagining ourselves in their 1.25 + position and emphatically feeling experiences relative to our own 1.26 + bodies. By understanding events in terms of our own previous 1.27 + corporeal experience, we greatly constrain the possibilities of what 1.28 + would otherwise be an unwieldy exponential search. This extra 1.29 + constraint can be the difference between easily understanding what 1.30 + is happening in a video and being completely lost in a sea of 1.31 + incomprehensible color and movement. 1.32 1.33 ** Recognizing actions in video is extremely difficult 1.34 1.35 - Consider for example the problem of determining what is happening in 1.36 - a video of which this is one frame: 1.37 + Consider for example the problem of determining what is happening in 1.38 + a video of which this is one frame: 1.39 1.40 - #+caption: A cat drinking some water. Identifying this action is 1.41 - #+caption: beyond the state of the art for computers. 1.42 - #+ATTR_LaTeX: :width 7cm 1.43 - [[./images/cat-drinking.jpg]] 1.44 + #+caption: A cat drinking some water. Identifying this action is 1.45 + #+caption: beyond the state of the art for computers. 1.46 + #+ATTR_LaTeX: :width 7cm 1.47 + [[./images/cat-drinking.jpg]] 1.48 + 1.49 + It is currently impossible for any computer program to reliably 1.50 + label such an video as "drinking". And rightly so -- it is a very 1.51 + hard problem! What features can you describe in terms of low level 1.52 + functions of pixels that can even begin to describe at a high level 1.53 + what is happening here? 1.54 1.55 - It is currently impossible for any computer program to reliably 1.56 - label such an video as "drinking". And rightly so -- it is a very 1.57 - hard problem! What features can you describe in terms of low level 1.58 - functions of pixels that can even begin to describe what is 1.59 - happening here? 1.60 + Or suppose that you are building a program that recognizes 1.61 + chairs. How could you ``see'' the chair in figure 1.62 + \ref{invisible-chair} and figure \ref{hidden-chair}? 1.63 + 1.64 + #+caption: When you look at this, do you think ``chair''? I certainly do. 1.65 + #+name: invisible-chair 1.66 + #+ATTR_LaTeX: :width 10cm 1.67 + [[./images/invisible-chair.png]] 1.68 + 1.69 + #+caption: The chair in this image is quite obvious to humans, but I 1.70 + #+caption: doubt that any computer program can find it. 1.71 + #+name: hidden-chair 1.72 + #+ATTR_LaTeX: :width 10cm 1.73 + [[./images/fat-person-sitting-at-desk.jpg]] 1.74 + 1.75 + Finally, how is it that you can easily tell the difference between 1.76 + how the girls /muscles/ are working in figure \ref{girl}? 1.77 + 1.78 + #+caption: The mysterious ``common sense'' appears here as you are able 1.79 + #+caption: to discern the difference in how the girl's arm muscles 1.80 + #+caption: are activated between the two images. 1.81 + #+name: girl 1.82 + #+ATTR_LaTeX: :width 10cm 1.83 + [[./images/wall-push.png]] 1.84 1.85 - Or suppose that you are building a program that recognizes 1.86 - chairs. How could you ``see'' the chair in the following pictures? 1.87 + Each of these examples tells us something about what might be going 1.88 + on in our minds as we easily solve these recognition problems. 1.89 + 1.90 + The hidden chairs show us that we are strongly triggered by cues 1.91 + relating to the position of human bodies, and that we can 1.92 + determine the overall physical configuration of a human body even 1.93 + if much of that body is occluded. 1.94 1.95 - #+caption: When you look at this, do you think ``chair''? I certainly do. 1.96 - #+ATTR_LaTeX: :width 10cm 1.97 - [[./images/invisible-chair.png]] 1.98 + The picture of the girl pushing against the wall tells us that we 1.99 + have common sense knowledge about the kinetics of our own bodies. 1.100 + We know well how our muscles would have to work to maintain us in 1.101 + most positions, and we can easily project this self-knowledge to 1.102 + imagined positions triggered by images of the human body. 1.103 + 1.104 +** =EMPATH= neatly solves recognition problems 1.105 + 1.106 + I propose a system that can express the types of recognition 1.107 + problems above in a form amenable to computation. It is split into 1.108 + four parts: 1.109 + 1.110 + - Free/Guided Play :: The creature moves around and experiences the 1.111 + world through its unique perspective. Many otherwise 1.112 + complicated actions are easily described in the language of a 1.113 + full suite of body-centered, rich senses. For example, 1.114 + drinking is the feeling of water sliding down your throat, and 1.115 + cooling your insides. It's often accompanied by bringing your 1.116 + hand close to your face, or bringing your face close to 1.117 + water. Sitting down is the feeling of bending your knees, 1.118 + activating your quadriceps, then feeling a surface with your 1.119 + bottom and relaxing your legs. These body-centered action 1.120 + descriptions can be either learned or hard coded. 1.121 + - Alignment :: When trying to interpret a video or image, the 1.122 + creature takes a model of itself and aligns it with 1.123 + whatever it sees. This can be a rather loose 1.124 + alignment that can cross species, as when humans try 1.125 + to align themselves with things like ponies, dogs, 1.126 + or other humans with a different body type. 1.127 + - Empathy :: The alignment triggers the memories of previous 1.128 + experience. For example, the alignment itself easily 1.129 + maps to proprioceptive data. Any sounds or obvious 1.130 + skin contact in the video can to a lesser extent 1.131 + trigger previous experience. The creatures previous 1.132 + experience is chained together in short bursts to 1.133 + coherently describe the new scene. 1.134 + - Recognition :: With the scene now described in terms of past 1.135 + experience, the creature can now run its 1.136 + action-identification programs on this synthesized 1.137 + sensory data, just as it would if it were actually 1.138 + experiencing the scene first-hand. If previous 1.139 + experience has been accurately retrieved, and if 1.140 + it is analogous enough to the scene, then the 1.141 + creature will correctly identify the action in the 1.142 + scene. 1.143 + 1.144 + 1.145 + For example, I think humans are able to label the cat video as 1.146 + "drinking" because they imagine /themselves/ as the cat, and 1.147 + imagine putting their face up against a stream of water and 1.148 + sticking out their tongue. In that imagined world, they can feel 1.149 + the cool water hitting their tongue, and feel the water entering 1.150 + their body, and are able to recognize that /feeling/ as 1.151 + drinking. So, the label of the action is not really in the pixels 1.152 + of the image, but is found clearly in a simulation inspired by 1.153 + those pixels. An imaginative system, having been trained on 1.154 + drinking and non-drinking examples and learning that the most 1.155 + important component of drinking is the feeling of water sliding 1.156 + down one's throat, would analyze a video of a cat drinking in the 1.157 + following manner: 1.158 + 1.159 + 1. Create a physical model of the video by putting a "fuzzy" model 1.160 + of its own body in place of the cat. Possibly also create a 1.161 + simulation of the stream of water. 1.162 + 1.163 + 2. Play out this simulated scene and generate imagined sensory 1.164 + experience. This will include relevant muscle contractions, a 1.165 + close up view of the stream from the cat's perspective, and most 1.166 + importantly, the imagined feeling of water entering the 1.167 + mouth. The imagined sensory experience can come from both a 1.168 + simulation of the event, but can also be pattern-matched from 1.169 + previous, similar embodied experience. 1.170 + 1.171 + 3. The action is now easily identified as drinking by the sense of 1.172 + taste alone. The other senses (such as the tongue moving in and 1.173 + out) help to give plausibility to the simulated action. Note that 1.174 + the sense of vision, while critical in creating the simulation, 1.175 + is not critical for identifying the action from the simulation. 1.176 + 1.177 + For the chair examples, the process is even easier: 1.178 + 1.179 + 1. Align a model of your body to the person in the image. 1.180 + 1.181 + 2. Generate proprioceptive sensory data from this alignment. 1.182 1.183 - #+caption: The chair in this image is quite obvious to humans, but I 1.184 - #+caption: doubt that any computer program can find it. 1.185 - #+ATTR_LaTeX: :width 10cm 1.186 - [[./images/fat-person-sitting-at-desk.jpg]] 1.187 + 3. Use the imagined proprioceptive data as a key to lookup related 1.188 + sensory experience associated with that particular proproceptive 1.189 + feeling. 1.190 1.191 - Finally, how is it that you can easily tell the difference between 1.192 - how the girls /muscles/ are working in \ref{girl}? 1.193 + 4. Retrieve the feeling of your bottom resting on a surface and 1.194 + your leg muscles relaxed. 1.195 1.196 - #+caption: The mysterious ``common sense'' appears here as you are able 1.197 - #+caption: to ``see'' the difference in how the girl's arm muscles 1.198 - #+caption: are activated differently in the two images. 1.199 - #+name: girl 1.200 - #+ATTR_LaTeX: :width 10cm 1.201 - [[./images/wall-push.png]] 1.202 - 1.203 + 5. This sensory information is consistent with the =sitting?= 1.204 + sensory predicate, so you (and the entity in the image) must be 1.205 + sitting. 1.206 1.207 - These problems are difficult because the language of pixels is far 1.208 - removed from what we would consider to be an acceptable description 1.209 - of the events in these images. In order to process them, we must 1.210 - raise the images into some higher level of abstraction where their 1.211 - descriptions become more similar to how we would describe them in 1.212 - English. The question is, how can we raise 1.213 - 1.214 + 6. There must be a chair-like object since you are sitting. 1.215 1.216 - I think humans are able to label such video as "drinking" because 1.217 - they imagine /themselves/ as the cat, and imagine putting their face 1.218 - up against a stream of water and sticking out their tongue. In that 1.219 - imagined world, they can feel the cool water hitting their tongue, 1.220 - and feel the water entering their body, and are able to recognize 1.221 - that /feeling/ as drinking. So, the label of the action is not 1.222 - really in the pixels of the image, but is found clearly in a 1.223 - simulation inspired by those pixels. An imaginative system, having 1.224 - been trained on drinking and non-drinking examples and learning that 1.225 - the most important component of drinking is the feeling of water 1.226 - sliding down one's throat, would analyze a video of a cat drinking 1.227 - in the following manner: 1.228 + Empathy offers yet another alternative to the age-old AI 1.229 + representation question: ``What is a chair?'' --- A chair is the 1.230 + feeling of sitting. 1.231 + 1.232 + My program, =EMPATH= uses this empathic problem solving technique 1.233 + to interpret the actions of a simple, worm-like creature. 1.234 1.235 - - Create a physical model of the video by putting a "fuzzy" model 1.236 - of its own body in place of the cat. Also, create a simulation of 1.237 - the stream of water. 1.238 + #+caption: The worm performs many actions during free play such as 1.239 + #+caption: curling, wiggling, and resting. 1.240 + #+name: worm-intro 1.241 + #+ATTR_LaTeX: :width 10cm 1.242 + [[./images/wall-push.png]] 1.243 1.244 - - Play out this simulated scene and generate imagined sensory 1.245 - experience. This will include relevant muscle contractions, a 1.246 - close up view of the stream from the cat's perspective, and most 1.247 - importantly, the imagined feeling of water entering the mouth. 1.248 + #+caption: This sensory predicate detects when the worm is resting on the 1.249 + #+caption: ground. 1.250 + #+name: resting-intro 1.251 + #+begin_listing clojure 1.252 + #+begin_src clojure 1.253 +(defn resting? 1.254 + "Is the worm resting on the ground?" 1.255 + [experiences] 1.256 + (every? 1.257 + (fn [touch-data] 1.258 + (< 0.9 (contact worm-segment-bottom touch-data))) 1.259 + (:touch (peek experiences)))) 1.260 + #+end_src 1.261 + #+end_listing 1.262 1.263 - - The action is now easily identified as drinking by the sense of 1.264 - taste alone. The other senses (such as the tongue moving in and 1.265 - out) help to give plausibility to the simulated action. Note that 1.266 - the sense of vision, while critical in creating the simulation, 1.267 - is not critical for identifying the action from the simulation. 1.268 + #+caption: Body-centerd actions are best expressed in a body-centered 1.269 + #+caption: language. This code detects when the worm has curled into a 1.270 + #+caption: full circle. Imagine how you would replicate this functionality 1.271 + #+caption: using low-level pixel features such as HOG filters! 1.272 + #+name: grand-circle-intro 1.273 + #+begin_listing clojure 1.274 + #+begin_src clojure 1.275 +(defn grand-circle? 1.276 + "Does the worm form a majestic circle (one end touching the other)?" 1.277 + [experiences] 1.278 + (and (curled? experiences) 1.279 + (let [worm-touch (:touch (peek experiences)) 1.280 + tail-touch (worm-touch 0) 1.281 + head-touch (worm-touch 4)] 1.282 + (and (< 0.55 (contact worm-segment-bottom-tip tail-touch)) 1.283 + (< 0.55 (contact worm-segment-top-tip head-touch)))))) 1.284 + #+end_src 1.285 + #+end_listing 1.286 1.287 - cat drinking, mimes, leaning, common sense 1.288 + #+caption: Even complicated actions such as ``wiggling'' are fairly simple 1.289 + #+caption: to describe with a rich enough language. 1.290 + #+name: wiggling-intro 1.291 + #+begin_listing clojure 1.292 + #+begin_src clojure 1.293 +(defn wiggling? 1.294 + "Is the worm wiggling?" 1.295 + [experiences] 1.296 + (let [analysis-interval 0x40] 1.297 + (when (> (count experiences) analysis-interval) 1.298 + (let [a-flex 3 1.299 + a-ex 2 1.300 + muscle-activity 1.301 + (map :muscle (vector:last-n experiences analysis-interval)) 1.302 + base-activity 1.303 + (map #(- (% a-flex) (% a-ex)) muscle-activity)] 1.304 + (= 2 1.305 + (first 1.306 + (max-indexed 1.307 + (map #(Math/abs %) 1.308 + (take 20 (fft base-activity)))))))))) 1.309 + #+end_src 1.310 + #+end_listing 1.311 1.312 -** =EMPATH= neatly solves recognition problems 1.313 + #+caption: The actions of a worm in a video can be recognized by 1.314 + #+caption: proprioceptive data and sentory predicates by filling 1.315 + #+caption: in the missing sensory detail with previous experience. 1.316 + #+name: worm-recognition-intro 1.317 + #+ATTR_LaTeX: :width 10cm 1.318 + [[./images/wall-push.png]] 1.319 1.320 - factorization , right language, etc 1.321 1.322 - a new possibility for the question ``what is a chair?'' -- it's the 1.323 - feeling of your butt on something and your knees bent, with your 1.324 - back muscles and legs relaxed. 1.325 + 1.326 + One powerful advantage of empathic problem solving is that it 1.327 + factors the action recognition problem into two easier problems. To 1.328 + use empathy, you need an /aligner/, which takes the video and a 1.329 + model of your body, and aligns the model with the video. Then, you 1.330 + need a /recognizer/, which uses the aligned model to interpret the 1.331 + action. The power in this method lies in the fact that you describe 1.332 + all actions form a body-centered, rich viewpoint. This way, if you 1.333 + teach the system what ``running'' is, and you have a good enough 1.334 + aligner, the system will from then on be able to recognize running 1.335 + from any point of view, even strange points of view like above or 1.336 + underneath the runner. This is in contrast to action recognition 1.337 + schemes that try to identify actions using a non-embodied approach 1.338 + such as TODO:REFERENCE. If these systems learn about running as viewed 1.339 + from the side, they will not automatically be able to recognize 1.340 + running from any other viewpoint. 1.341 + 1.342 + Another powerful advantage is that using the language of multiple 1.343 + body-centered rich senses to describe body-centerd actions offers a 1.344 + massive boost in descriptive capability. Consider how difficult it 1.345 + would be to compose a set of HOG filters to describe the action of 1.346 + a simple worm-creature "curling" so that its head touches its tail, 1.347 + and then behold the simplicity of describing thus action in a 1.348 + language designed for the task (listing \ref{grand-circle-intro}): 1.349 + 1.350 1.351 ** =CORTEX= is a toolkit for building sensate creatures 1.352 1.353 @@ -151,7 +318,7 @@ 1.354 1.355 ** Empathy is the process of tracing though \Phi-space 1.356 1.357 -** Efficient action recognition =EMPATH= 1.358 +** Efficient action recognition with =EMPATH= 1.359 1.360 * Contributions 1.361 - Built =CORTEX=, a comprehensive platform for embodied AI