Mercurial > cortex
changeset 441:c20de2267d39
completeing first third of first chapter.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Mon, 24 Mar 2014 20:59:35 -0400 |
parents | b01c070b03d4 |
children | eaf8c591372b |
files | thesis/abstract.org thesis/cortex.org thesis/cover.tex thesis/rlm-cortex-meng.tex thesis/to-frames.pl |
diffstat | 5 files changed, 268 insertions(+), 86 deletions(-) [+] |
line wrap: on
line diff
1.1 --- a/thesis/abstract.org Sun Mar 23 23:43:20 2014 -0400 1.2 +++ b/thesis/abstract.org Mon Mar 24 20:59:35 2014 -0400 1.3 @@ -6,11 +6,11 @@ 1.4 curling and wiggling. 1.5 1.6 To attack the action recognition problem, I developed a computational 1.7 -model of empathy (=EMPATH=) which allows me to use simple, embodied 1.8 -representations of actions (which require rich sensory data), even 1.9 -when that sensory data is not actually available. The missing sense 1.10 -data is ``imagined'' by the system by combining previous experiences 1.11 -gained from unsupervised free play. 1.12 +model of empathy (=EMPATH=) which allows me to recognize actions using 1.13 +simple, embodied representations of actions (which require rich 1.14 +sensory data), even when that sensory data is not actually 1.15 +available. The missing sense data is ``imagined'' by the system by 1.16 +combining previous experiences gained from unsupervised free play. 1.17 1.18 In order to build this empathic, action-recognizing system, I created 1.19 a program called =CORTEX=, which is a complete platform for embodied
2.1 --- a/thesis/cortex.org Sun Mar 23 23:43:20 2014 -0400 2.2 +++ b/thesis/cortex.org Mon Mar 24 20:59:35 2014 -0400 2.3 @@ -10,104 +10,271 @@ 2.4 By the end of this thesis, you will have seen a novel approach to 2.5 interpreting video using embodiment and empathy. You will have also 2.6 seen one way to efficiently implement empathy for embodied 2.7 - creatures. 2.8 + creatures. Finally, you will become familiar with =CORTEX=, a 2.9 + system for designing and simulating creatures with rich senses, 2.10 + which you may choose to use in your own research. 2.11 2.12 - The core vision of this thesis is that one of the important ways in 2.13 - which we understand others is by imagining ourselves in their 2.14 - posistion and empathicaly feeling experiences based on our own past 2.15 - experiences and imagination. 2.16 - 2.17 - By understanding events in terms of our own previous corperal 2.18 - experience, we greatly constrain the possibilities of what would 2.19 - otherwise be an unweidly exponential search. This extra constraint 2.20 - can be the difference between easily understanding what is happening 2.21 - in a video and being completely lost in a sea of incomprehensible 2.22 - color and movement. 2.23 + This is the core vision of my thesis: That one of the important ways 2.24 + in which we understand others is by imagining ourselves in their 2.25 + position and emphatically feeling experiences relative to our own 2.26 + bodies. By understanding events in terms of our own previous 2.27 + corporeal experience, we greatly constrain the possibilities of what 2.28 + would otherwise be an unwieldy exponential search. This extra 2.29 + constraint can be the difference between easily understanding what 2.30 + is happening in a video and being completely lost in a sea of 2.31 + incomprehensible color and movement. 2.32 2.33 ** Recognizing actions in video is extremely difficult 2.34 2.35 - Consider for example the problem of determining what is happening in 2.36 - a video of which this is one frame: 2.37 + Consider for example the problem of determining what is happening in 2.38 + a video of which this is one frame: 2.39 2.40 - #+caption: A cat drinking some water. Identifying this action is 2.41 - #+caption: beyond the state of the art for computers. 2.42 - #+ATTR_LaTeX: :width 7cm 2.43 - [[./images/cat-drinking.jpg]] 2.44 + #+caption: A cat drinking some water. Identifying this action is 2.45 + #+caption: beyond the state of the art for computers. 2.46 + #+ATTR_LaTeX: :width 7cm 2.47 + [[./images/cat-drinking.jpg]] 2.48 + 2.49 + It is currently impossible for any computer program to reliably 2.50 + label such an video as "drinking". And rightly so -- it is a very 2.51 + hard problem! What features can you describe in terms of low level 2.52 + functions of pixels that can even begin to describe at a high level 2.53 + what is happening here? 2.54 2.55 - It is currently impossible for any computer program to reliably 2.56 - label such an video as "drinking". And rightly so -- it is a very 2.57 - hard problem! What features can you describe in terms of low level 2.58 - functions of pixels that can even begin to describe what is 2.59 - happening here? 2.60 + Or suppose that you are building a program that recognizes 2.61 + chairs. How could you ``see'' the chair in figure 2.62 + \ref{invisible-chair} and figure \ref{hidden-chair}? 2.63 + 2.64 + #+caption: When you look at this, do you think ``chair''? I certainly do. 2.65 + #+name: invisible-chair 2.66 + #+ATTR_LaTeX: :width 10cm 2.67 + [[./images/invisible-chair.png]] 2.68 + 2.69 + #+caption: The chair in this image is quite obvious to humans, but I 2.70 + #+caption: doubt that any computer program can find it. 2.71 + #+name: hidden-chair 2.72 + #+ATTR_LaTeX: :width 10cm 2.73 + [[./images/fat-person-sitting-at-desk.jpg]] 2.74 + 2.75 + Finally, how is it that you can easily tell the difference between 2.76 + how the girls /muscles/ are working in figure \ref{girl}? 2.77 + 2.78 + #+caption: The mysterious ``common sense'' appears here as you are able 2.79 + #+caption: to discern the difference in how the girl's arm muscles 2.80 + #+caption: are activated between the two images. 2.81 + #+name: girl 2.82 + #+ATTR_LaTeX: :width 10cm 2.83 + [[./images/wall-push.png]] 2.84 2.85 - Or suppose that you are building a program that recognizes 2.86 - chairs. How could you ``see'' the chair in the following pictures? 2.87 + Each of these examples tells us something about what might be going 2.88 + on in our minds as we easily solve these recognition problems. 2.89 + 2.90 + The hidden chairs show us that we are strongly triggered by cues 2.91 + relating to the position of human bodies, and that we can 2.92 + determine the overall physical configuration of a human body even 2.93 + if much of that body is occluded. 2.94 2.95 - #+caption: When you look at this, do you think ``chair''? I certainly do. 2.96 - #+ATTR_LaTeX: :width 10cm 2.97 - [[./images/invisible-chair.png]] 2.98 + The picture of the girl pushing against the wall tells us that we 2.99 + have common sense knowledge about the kinetics of our own bodies. 2.100 + We know well how our muscles would have to work to maintain us in 2.101 + most positions, and we can easily project this self-knowledge to 2.102 + imagined positions triggered by images of the human body. 2.103 + 2.104 +** =EMPATH= neatly solves recognition problems 2.105 + 2.106 + I propose a system that can express the types of recognition 2.107 + problems above in a form amenable to computation. It is split into 2.108 + four parts: 2.109 + 2.110 + - Free/Guided Play :: The creature moves around and experiences the 2.111 + world through its unique perspective. Many otherwise 2.112 + complicated actions are easily described in the language of a 2.113 + full suite of body-centered, rich senses. For example, 2.114 + drinking is the feeling of water sliding down your throat, and 2.115 + cooling your insides. It's often accompanied by bringing your 2.116 + hand close to your face, or bringing your face close to 2.117 + water. Sitting down is the feeling of bending your knees, 2.118 + activating your quadriceps, then feeling a surface with your 2.119 + bottom and relaxing your legs. These body-centered action 2.120 + descriptions can be either learned or hard coded. 2.121 + - Alignment :: When trying to interpret a video or image, the 2.122 + creature takes a model of itself and aligns it with 2.123 + whatever it sees. This can be a rather loose 2.124 + alignment that can cross species, as when humans try 2.125 + to align themselves with things like ponies, dogs, 2.126 + or other humans with a different body type. 2.127 + - Empathy :: The alignment triggers the memories of previous 2.128 + experience. For example, the alignment itself easily 2.129 + maps to proprioceptive data. Any sounds or obvious 2.130 + skin contact in the video can to a lesser extent 2.131 + trigger previous experience. The creatures previous 2.132 + experience is chained together in short bursts to 2.133 + coherently describe the new scene. 2.134 + - Recognition :: With the scene now described in terms of past 2.135 + experience, the creature can now run its 2.136 + action-identification programs on this synthesized 2.137 + sensory data, just as it would if it were actually 2.138 + experiencing the scene first-hand. If previous 2.139 + experience has been accurately retrieved, and if 2.140 + it is analogous enough to the scene, then the 2.141 + creature will correctly identify the action in the 2.142 + scene. 2.143 + 2.144 + 2.145 + For example, I think humans are able to label the cat video as 2.146 + "drinking" because they imagine /themselves/ as the cat, and 2.147 + imagine putting their face up against a stream of water and 2.148 + sticking out their tongue. In that imagined world, they can feel 2.149 + the cool water hitting their tongue, and feel the water entering 2.150 + their body, and are able to recognize that /feeling/ as 2.151 + drinking. So, the label of the action is not really in the pixels 2.152 + of the image, but is found clearly in a simulation inspired by 2.153 + those pixels. An imaginative system, having been trained on 2.154 + drinking and non-drinking examples and learning that the most 2.155 + important component of drinking is the feeling of water sliding 2.156 + down one's throat, would analyze a video of a cat drinking in the 2.157 + following manner: 2.158 + 2.159 + 1. Create a physical model of the video by putting a "fuzzy" model 2.160 + of its own body in place of the cat. Possibly also create a 2.161 + simulation of the stream of water. 2.162 + 2.163 + 2. Play out this simulated scene and generate imagined sensory 2.164 + experience. This will include relevant muscle contractions, a 2.165 + close up view of the stream from the cat's perspective, and most 2.166 + importantly, the imagined feeling of water entering the 2.167 + mouth. The imagined sensory experience can come from both a 2.168 + simulation of the event, but can also be pattern-matched from 2.169 + previous, similar embodied experience. 2.170 + 2.171 + 3. The action is now easily identified as drinking by the sense of 2.172 + taste alone. The other senses (such as the tongue moving in and 2.173 + out) help to give plausibility to the simulated action. Note that 2.174 + the sense of vision, while critical in creating the simulation, 2.175 + is not critical for identifying the action from the simulation. 2.176 + 2.177 + For the chair examples, the process is even easier: 2.178 + 2.179 + 1. Align a model of your body to the person in the image. 2.180 + 2.181 + 2. Generate proprioceptive sensory data from this alignment. 2.182 2.183 - #+caption: The chair in this image is quite obvious to humans, but I 2.184 - #+caption: doubt that any computer program can find it. 2.185 - #+ATTR_LaTeX: :width 10cm 2.186 - [[./images/fat-person-sitting-at-desk.jpg]] 2.187 + 3. Use the imagined proprioceptive data as a key to lookup related 2.188 + sensory experience associated with that particular proproceptive 2.189 + feeling. 2.190 2.191 - Finally, how is it that you can easily tell the difference between 2.192 - how the girls /muscles/ are working in \ref{girl}? 2.193 + 4. Retrieve the feeling of your bottom resting on a surface and 2.194 + your leg muscles relaxed. 2.195 2.196 - #+caption: The mysterious ``common sense'' appears here as you are able 2.197 - #+caption: to ``see'' the difference in how the girl's arm muscles 2.198 - #+caption: are activated differently in the two images. 2.199 - #+name: girl 2.200 - #+ATTR_LaTeX: :width 10cm 2.201 - [[./images/wall-push.png]] 2.202 - 2.203 + 5. This sensory information is consistent with the =sitting?= 2.204 + sensory predicate, so you (and the entity in the image) must be 2.205 + sitting. 2.206 2.207 - These problems are difficult because the language of pixels is far 2.208 - removed from what we would consider to be an acceptable description 2.209 - of the events in these images. In order to process them, we must 2.210 - raise the images into some higher level of abstraction where their 2.211 - descriptions become more similar to how we would describe them in 2.212 - English. The question is, how can we raise 2.213 - 2.214 + 6. There must be a chair-like object since you are sitting. 2.215 2.216 - I think humans are able to label such video as "drinking" because 2.217 - they imagine /themselves/ as the cat, and imagine putting their face 2.218 - up against a stream of water and sticking out their tongue. In that 2.219 - imagined world, they can feel the cool water hitting their tongue, 2.220 - and feel the water entering their body, and are able to recognize 2.221 - that /feeling/ as drinking. So, the label of the action is not 2.222 - really in the pixels of the image, but is found clearly in a 2.223 - simulation inspired by those pixels. An imaginative system, having 2.224 - been trained on drinking and non-drinking examples and learning that 2.225 - the most important component of drinking is the feeling of water 2.226 - sliding down one's throat, would analyze a video of a cat drinking 2.227 - in the following manner: 2.228 + Empathy offers yet another alternative to the age-old AI 2.229 + representation question: ``What is a chair?'' --- A chair is the 2.230 + feeling of sitting. 2.231 + 2.232 + My program, =EMPATH= uses this empathic problem solving technique 2.233 + to interpret the actions of a simple, worm-like creature. 2.234 2.235 - - Create a physical model of the video by putting a "fuzzy" model 2.236 - of its own body in place of the cat. Also, create a simulation of 2.237 - the stream of water. 2.238 + #+caption: The worm performs many actions during free play such as 2.239 + #+caption: curling, wiggling, and resting. 2.240 + #+name: worm-intro 2.241 + #+ATTR_LaTeX: :width 10cm 2.242 + [[./images/wall-push.png]] 2.243 2.244 - - Play out this simulated scene and generate imagined sensory 2.245 - experience. This will include relevant muscle contractions, a 2.246 - close up view of the stream from the cat's perspective, and most 2.247 - importantly, the imagined feeling of water entering the mouth. 2.248 + #+caption: This sensory predicate detects when the worm is resting on the 2.249 + #+caption: ground. 2.250 + #+name: resting-intro 2.251 + #+begin_listing clojure 2.252 + #+begin_src clojure 2.253 +(defn resting? 2.254 + "Is the worm resting on the ground?" 2.255 + [experiences] 2.256 + (every? 2.257 + (fn [touch-data] 2.258 + (< 0.9 (contact worm-segment-bottom touch-data))) 2.259 + (:touch (peek experiences)))) 2.260 + #+end_src 2.261 + #+end_listing 2.262 2.263 - - The action is now easily identified as drinking by the sense of 2.264 - taste alone. The other senses (such as the tongue moving in and 2.265 - out) help to give plausibility to the simulated action. Note that 2.266 - the sense of vision, while critical in creating the simulation, 2.267 - is not critical for identifying the action from the simulation. 2.268 + #+caption: Body-centerd actions are best expressed in a body-centered 2.269 + #+caption: language. This code detects when the worm has curled into a 2.270 + #+caption: full circle. Imagine how you would replicate this functionality 2.271 + #+caption: using low-level pixel features such as HOG filters! 2.272 + #+name: grand-circle-intro 2.273 + #+begin_listing clojure 2.274 + #+begin_src clojure 2.275 +(defn grand-circle? 2.276 + "Does the worm form a majestic circle (one end touching the other)?" 2.277 + [experiences] 2.278 + (and (curled? experiences) 2.279 + (let [worm-touch (:touch (peek experiences)) 2.280 + tail-touch (worm-touch 0) 2.281 + head-touch (worm-touch 4)] 2.282 + (and (< 0.55 (contact worm-segment-bottom-tip tail-touch)) 2.283 + (< 0.55 (contact worm-segment-top-tip head-touch)))))) 2.284 + #+end_src 2.285 + #+end_listing 2.286 2.287 - cat drinking, mimes, leaning, common sense 2.288 + #+caption: Even complicated actions such as ``wiggling'' are fairly simple 2.289 + #+caption: to describe with a rich enough language. 2.290 + #+name: wiggling-intro 2.291 + #+begin_listing clojure 2.292 + #+begin_src clojure 2.293 +(defn wiggling? 2.294 + "Is the worm wiggling?" 2.295 + [experiences] 2.296 + (let [analysis-interval 0x40] 2.297 + (when (> (count experiences) analysis-interval) 2.298 + (let [a-flex 3 2.299 + a-ex 2 2.300 + muscle-activity 2.301 + (map :muscle (vector:last-n experiences analysis-interval)) 2.302 + base-activity 2.303 + (map #(- (% a-flex) (% a-ex)) muscle-activity)] 2.304 + (= 2 2.305 + (first 2.306 + (max-indexed 2.307 + (map #(Math/abs %) 2.308 + (take 20 (fft base-activity)))))))))) 2.309 + #+end_src 2.310 + #+end_listing 2.311 2.312 -** =EMPATH= neatly solves recognition problems 2.313 + #+caption: The actions of a worm in a video can be recognized by 2.314 + #+caption: proprioceptive data and sentory predicates by filling 2.315 + #+caption: in the missing sensory detail with previous experience. 2.316 + #+name: worm-recognition-intro 2.317 + #+ATTR_LaTeX: :width 10cm 2.318 + [[./images/wall-push.png]] 2.319 2.320 - factorization , right language, etc 2.321 2.322 - a new possibility for the question ``what is a chair?'' -- it's the 2.323 - feeling of your butt on something and your knees bent, with your 2.324 - back muscles and legs relaxed. 2.325 + 2.326 + One powerful advantage of empathic problem solving is that it 2.327 + factors the action recognition problem into two easier problems. To 2.328 + use empathy, you need an /aligner/, which takes the video and a 2.329 + model of your body, and aligns the model with the video. Then, you 2.330 + need a /recognizer/, which uses the aligned model to interpret the 2.331 + action. The power in this method lies in the fact that you describe 2.332 + all actions form a body-centered, rich viewpoint. This way, if you 2.333 + teach the system what ``running'' is, and you have a good enough 2.334 + aligner, the system will from then on be able to recognize running 2.335 + from any point of view, even strange points of view like above or 2.336 + underneath the runner. This is in contrast to action recognition 2.337 + schemes that try to identify actions using a non-embodied approach 2.338 + such as TODO:REFERENCE. If these systems learn about running as viewed 2.339 + from the side, they will not automatically be able to recognize 2.340 + running from any other viewpoint. 2.341 + 2.342 + Another powerful advantage is that using the language of multiple 2.343 + body-centered rich senses to describe body-centerd actions offers a 2.344 + massive boost in descriptive capability. Consider how difficult it 2.345 + would be to compose a set of HOG filters to describe the action of 2.346 + a simple worm-creature "curling" so that its head touches its tail, 2.347 + and then behold the simplicity of describing thus action in a 2.348 + language designed for the task (listing \ref{grand-circle-intro}): 2.349 + 2.350 2.351 ** =CORTEX= is a toolkit for building sensate creatures 2.352 2.353 @@ -151,7 +318,7 @@ 2.354 2.355 ** Empathy is the process of tracing though \Phi-space 2.356 2.357 -** Efficient action recognition =EMPATH= 2.358 +** Efficient action recognition with =EMPATH= 2.359 2.360 * Contributions 2.361 - Built =CORTEX=, a comprehensive platform for embodied AI
3.1 --- a/thesis/cover.tex Sun Mar 23 23:43:20 2014 -0400 3.2 +++ b/thesis/cover.tex Mon Mar 24 20:59:35 2014 -0400 3.3 @@ -45,7 +45,7 @@ 3.4 % however the specifications can change. We recommend that you verify the 3.5 % layout of your title page with your thesis advisor and/or the MIT 3.6 % Libraries before printing your final copy. 3.7 -\title{Solving Problems using Embodiment \& Empathy.} 3.8 +\title{Solving Problems using Embodiment \& Empathy} 3.9 \author{Robert Louis M\raisebox{\depth}{\small \underline{\underline{c}}}Intyre} 3.10 %\author{Robert McIntyre} 3.11
4.1 --- a/thesis/rlm-cortex-meng.tex Sun Mar 23 23:43:20 2014 -0400 4.2 +++ b/thesis/rlm-cortex-meng.tex Mon Mar 24 20:59:35 2014 -0400 4.3 @@ -25,7 +25,7 @@ 4.4 %% Page Intentionally Left Blank'', use the ``leftblank'' option, as 4.5 %% above. 4.6 4.7 -\documentclass[12pt,twoside,singlespace]{mitthesis} 4.8 +\documentclass[12pt,twoside,singlespace,vi]{mitthesis} 4.9 \usepackage[utf8]{inputenc} 4.10 \usepackage[T1]{fontenc} 4.11 \usepackage{fixltx2e}
5.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 5.2 +++ b/thesis/to-frames.pl Mon Mar 24 20:59:35 2014 -0400 5.3 @@ -0,0 +1,15 @@ 5.4 +#!/bin/perl 5.5 + 5.6 +$movie_file = shift(@ARGV); 5.7 + 5.8 +# get file name without extension 5.9 +$movie_file =~ m/^([^.]+)\.[^.]+$/; 5.10 +$movie_name = $1; 5.11 + 5.12 +@mkdir_command = ("mkdir", "-vp", $movie_name); 5.13 +@ffmpeg_command = ("ffmpeg", "-i", $movie_file, $movie_name."/%07d.png"); 5.14 + 5.15 +print "@mkdir_command\n"; 5.16 +system(@mkdir_command); 5.17 +print "@ffmpeg_command\n"; 5.18 +system(@ffmpeg_command); 5.19 \ No newline at end of file