Mercurial > cortex
diff thesis/dylan-cortex-diff.diff @ 513:4c4d45f6f30b
accept/reject changes
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Sun, 30 Mar 2014 10:41:18 -0400 |
parents | |
children | 447c3c8405a2 |
line wrap: on
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/thesis/dylan-cortex-diff.diff Sun Mar 30 10:41:18 2014 -0400 1.3 @@ -0,0 +1,395 @@ 1.4 +diff -r f639e2139ce2 thesis/cortex.org 1.5 +--- a/thesis/cortex.org Sun Mar 30 01:34:43 2014 -0400 1.6 ++++ b/thesis/cortex.org Sun Mar 30 10:07:17 2014 -0400 1.7 +@@ -41,49 +41,46 @@ 1.8 + [[./images/aurellem-gray.png]] 1.9 + 1.10 + 1.11 +-* Empathy and Embodiment as problem solving strategies 1.12 ++* Empathy \& Embodiment: problem solving strategies 1.13 + 1.14 +- By the end of this thesis, you will have seen a novel approach to 1.15 +- interpreting video using embodiment and empathy. You will have also 1.16 +- seen one way to efficiently implement empathy for embodied 1.17 +- creatures. Finally, you will become familiar with =CORTEX=, a system 1.18 +- for designing and simulating creatures with rich senses, which you 1.19 +- may choose to use in your own research. 1.20 +- 1.21 +- This is the core vision of my thesis: That one of the important ways 1.22 +- in which we understand others is by imagining ourselves in their 1.23 +- position and emphatically feeling experiences relative to our own 1.24 +- bodies. By understanding events in terms of our own previous 1.25 +- corporeal experience, we greatly constrain the possibilities of what 1.26 +- would otherwise be an unwieldy exponential search. This extra 1.27 +- constraint can be the difference between easily understanding what 1.28 +- is happening in a video and being completely lost in a sea of 1.29 +- incomprehensible color and movement. 1.30 +- 1.31 +-** Recognizing actions in video is extremely difficult 1.32 +- 1.33 +- Consider for example the problem of determining what is happening 1.34 +- in a video of which this is one frame: 1.35 +- 1.36 ++** The problem: recognizing actions in video is extremely difficult 1.37 ++# developing / requires useful representations 1.38 ++ 1.39 ++ Examine the following collection of images. As you, and indeed very 1.40 ++ young children, can easily determine, each one is a picture of 1.41 ++ someone drinking. 1.42 ++ 1.43 ++ # dxh: cat, cup, drinking fountain, rain, straw, coconut 1.44 + #+caption: A cat drinking some water. Identifying this action is 1.45 +- #+caption: beyond the state of the art for computers. 1.46 ++ #+caption: beyond the capabilities of existing computer vision systems. 1.47 + #+ATTR_LaTeX: :width 7cm 1.48 + [[./images/cat-drinking.jpg]] 1.49 ++ 1.50 ++ Nevertheless, it is beyond the state of the art for a computer 1.51 ++ vision program to describe what's happening in each of these 1.52 ++ images, or what's common to them. Part of the problem is that many 1.53 ++ computer vision systems focus on pixel-level details or probability 1.54 ++ distributions of pixels, with little focus on [...] 1.55 ++ 1.56 ++ 1.57 ++ In fact, the contents of scene may have much less to do with pixel 1.58 ++ probabilities than with recognizing various affordances: things you 1.59 ++ can move, objects you can grasp, spaces that can be filled 1.60 ++ (Gibson). For example, what processes might enable you to see the 1.61 ++ chair in figure \ref{hidden-chair}? 1.62 ++ # Or suppose that you are building a program that recognizes chairs. 1.63 ++ # How could you ``see'' the chair ? 1.64 + 1.65 +- It is currently impossible for any computer program to reliably 1.66 +- label such a video as ``drinking''. And rightly so -- it is a very 1.67 +- hard problem! What features can you describe in terms of low level 1.68 +- functions of pixels that can even begin to describe at a high level 1.69 +- what is happening here? 1.70 +- 1.71 +- Or suppose that you are building a program that recognizes chairs. 1.72 +- How could you ``see'' the chair in figure \ref{hidden-chair}? 1.73 +- 1.74 ++ # dxh: blur chair 1.75 + #+caption: The chair in this image is quite obvious to humans, but I 1.76 + #+caption: doubt that any modern computer vision program can find it. 1.77 + #+name: hidden-chair 1.78 + #+ATTR_LaTeX: :width 10cm 1.79 + [[./images/fat-person-sitting-at-desk.jpg]] 1.80 ++ 1.81 ++ 1.82 ++ 1.83 ++ 1.84 + 1.85 + Finally, how is it that you can easily tell the difference between 1.86 + how the girls /muscles/ are working in figure \ref{girl}? 1.87 +@@ -95,10 +92,13 @@ 1.88 + #+ATTR_LaTeX: :width 7cm 1.89 + [[./images/wall-push.png]] 1.90 + 1.91 ++ 1.92 ++ 1.93 ++ 1.94 + Each of these examples tells us something about what might be going 1.95 + on in our minds as we easily solve these recognition problems. 1.96 + 1.97 +- The hidden chairs show us that we are strongly triggered by cues 1.98 ++ The hidden chair shows us that we are strongly triggered by cues 1.99 + relating to the position of human bodies, and that we can determine 1.100 + the overall physical configuration of a human body even if much of 1.101 + that body is occluded. 1.102 +@@ -109,10 +109,107 @@ 1.103 + most positions, and we can easily project this self-knowledge to 1.104 + imagined positions triggered by images of the human body. 1.105 + 1.106 +-** =EMPATH= neatly solves recognition problems 1.107 ++** A step forward: the sensorimotor-centered approach 1.108 ++# ** =EMPATH= recognizes what creatures are doing 1.109 ++# neatly solves recognition problems 1.110 ++ In this thesis, I explore the idea that our knowledge of our own 1.111 ++ bodies enables us to recognize the actions of others. 1.112 ++ 1.113 ++ First, I built a system for constructing virtual creatures with 1.114 ++ physiologically plausible sensorimotor systems and detailed 1.115 ++ environments. The result is =CORTEX=, which is described in section 1.116 ++ \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other 1.117 ++ AI researchers; it is provided in full with detailed instructions 1.118 ++ on the web [here].) 1.119 ++ 1.120 ++ Next, I wrote routines which enabled a simple worm-like creature to 1.121 ++ infer the actions of a second worm-like creature, using only its 1.122 ++ own prior sensorimotor experiences and knowledge of the second 1.123 ++ worm's joint positions. This program, =EMPATH=, is described in 1.124 ++ section \ref{sec-3}, and the key results of this experiment are 1.125 ++ summarized below. 1.126 ++ 1.127 ++ #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer 1.128 ++ #+caption: the complete sensory experience and classify these four poses. 1.129 ++ #+caption: The last image is a composite, depicting the intermediate stages of \emph{wriggling}. 1.130 ++ #+name: worm-recognition-intro-2 1.131 ++ #+ATTR_LaTeX: :width 15cm 1.132 ++ [[./images/empathy-1.png]] 1.133 ++ 1.134 ++ # =CORTEX= provides a language for describing the sensorimotor 1.135 ++ # experiences of various creatures. 1.136 ++ 1.137 ++ # Next, I developed an experiment to test the power of =CORTEX='s 1.138 ++ # sensorimotor-centered language for solving recognition problems. As 1.139 ++ # a proof of concept, I wrote routines which enabled a simple 1.140 ++ # worm-like creature to infer the actions of a second worm-like 1.141 ++ # creature, using only its own previous sensorimotor experiences and 1.142 ++ # knowledge of the second worm's joints (figure 1.143 ++ # \ref{worm-recognition-intro-2}). The result of this proof of 1.144 ++ # concept was the program =EMPATH=, described in section 1.145 ++ # \ref{sec-3}. The key results of this 1.146 ++ 1.147 ++ # Using only first-person sensorimotor experiences and third-person 1.148 ++ # proprioceptive data, 1.149 ++ 1.150 ++*** Key results 1.151 ++ - After one-shot supervised training, =EMPATH= was able recognize a 1.152 ++ wide variety of static poses and dynamic actions---ranging from 1.153 ++ curling in a circle to wriggling with a particular frequency --- 1.154 ++ with 95\% accuracy. 1.155 ++ - These results were completely independent of viewing angle 1.156 ++ because the underlying body-centered language fundamentally is; 1.157 ++ once an action is learned, it can be recognized equally well from 1.158 ++ any viewing angle. 1.159 ++ - =EMPATH= is surprisingly short; the sensorimotor-centered 1.160 ++ language provided by =CORTEX= resulted in extremely economical 1.161 ++ recognition routines --- about 0000 lines in all --- suggesting 1.162 ++ that such representations are very powerful, and often 1.163 ++ indispensible for the types of recognition tasks considered here. 1.164 ++ - Although for expediency's sake, I relied on direct knowledge of 1.165 ++ joint positions in this proof of concept, it would be 1.166 ++ straightforward to extend =EMPATH= so that it (more 1.167 ++ realistically) infers joint positions from its visual data. 1.168 ++ 1.169 ++# because the underlying language is fundamentally orientation-independent 1.170 ++ 1.171 ++# recognize the actions of a worm with 95\% accuracy. The 1.172 ++# recognition tasks 1.173 + 1.174 +- I propose a system that can express the types of recognition 1.175 +- problems above in a form amenable to computation. It is split into 1.176 ++ 1.177 ++ 1.178 ++ 1.179 ++ [Talk about these results and what you find promising about them] 1.180 ++ 1.181 ++** Roadmap 1.182 ++ [I'm going to explain how =CORTEX= works, then break down how 1.183 ++ =EMPATH= does its thing. Because the details reveal such-and-such 1.184 ++ about the approach.] 1.185 ++ 1.186 ++ # The success of this simple proof-of-concept offers a tantalizing 1.187 ++ 1.188 ++ 1.189 ++ # explore the idea 1.190 ++ # The key contribution of this thesis is the idea that body-centered 1.191 ++ # representations (which express 1.192 ++ 1.193 ++ 1.194 ++ # the 1.195 ++ # body-centered approach --- in which I try to determine what's 1.196 ++ # happening in a scene by bringing it into registration with my own 1.197 ++ # bodily experiences --- are indispensible for recognizing what 1.198 ++ # creatures are doing in a scene. 1.199 ++ 1.200 ++* COMMENT 1.201 ++# body-centered language 1.202 ++ 1.203 ++ In this thesis, I'll describe =EMPATH=, which solves a certain 1.204 ++ class of recognition problems 1.205 ++ 1.206 ++ The key idea is to use self-centered (or first-person) language. 1.207 ++ 1.208 ++ I have built a system that can express the types of recognition 1.209 ++ problems in a form amenable to computation. It is split into 1.210 + four parts: 1.211 + 1.212 + - Free/Guided Play :: The creature moves around and experiences the 1.213 +@@ -286,14 +383,14 @@ 1.214 + code to create a creature, and can use a wide library of 1.215 + pre-existing blender models as a base for your own creatures. 1.216 + 1.217 +- - =CORTEX= implements a wide variety of senses, including touch, 1.218 ++ - =CORTEX= implements a wide variety of senses: touch, 1.219 + proprioception, vision, hearing, and muscle tension. Complicated 1.220 + senses like touch, and vision involve multiple sensory elements 1.221 + embedded in a 2D surface. You have complete control over the 1.222 + distribution of these sensor elements through the use of simple 1.223 + png image files. In particular, =CORTEX= implements more 1.224 + comprehensive hearing than any other creature simulation system 1.225 +- available. 1.226 ++ available. 1.227 + 1.228 + - =CORTEX= supports any number of creatures and any number of 1.229 + senses. Time in =CORTEX= dialates so that the simulated creatures 1.230 +@@ -353,7 +450,24 @@ 1.231 + \end{sidewaysfigure} 1.232 + #+END_LaTeX 1.233 + 1.234 +-** Contributions 1.235 ++** Road map 1.236 ++ 1.237 ++ By the end of this thesis, you will have seen a novel approach to 1.238 ++ interpreting video using embodiment and empathy. You will have also 1.239 ++ seen one way to efficiently implement empathy for embodied 1.240 ++ creatures. Finally, you will become familiar with =CORTEX=, a system 1.241 ++ for designing and simulating creatures with rich senses, which you 1.242 ++ may choose to use in your own research. 1.243 ++ 1.244 ++ This is the core vision of my thesis: That one of the important ways 1.245 ++ in which we understand others is by imagining ourselves in their 1.246 ++ position and emphatically feeling experiences relative to our own 1.247 ++ bodies. By understanding events in terms of our own previous 1.248 ++ corporeal experience, we greatly constrain the possibilities of what 1.249 ++ would otherwise be an unwieldy exponential search. This extra 1.250 ++ constraint can be the difference between easily understanding what 1.251 ++ is happening in a video and being completely lost in a sea of 1.252 ++ incomprehensible color and movement. 1.253 + 1.254 + - I built =CORTEX=, a comprehensive platform for embodied AI 1.255 + experiments. =CORTEX= supports many features lacking in other 1.256 +@@ -363,18 +477,22 @@ 1.257 + - I built =EMPATH=, which uses =CORTEX= to identify the actions of 1.258 + a worm-like creature using a computational model of empathy. 1.259 + 1.260 +-* Building =CORTEX= 1.261 +- 1.262 +- I intend for =CORTEX= to be used as a general-purpose library for 1.263 +- building creatures and outfitting them with senses, so that it will 1.264 +- be useful for other researchers who want to test out ideas of their 1.265 +- own. To this end, wherver I have had to make archetictural choices 1.266 +- about =CORTEX=, I have chosen to give as much freedom to the user as 1.267 +- possible, so that =CORTEX= may be used for things I have not 1.268 +- forseen. 1.269 +- 1.270 +-** Simulation or Reality? 1.271 +- 1.272 ++ 1.273 ++* Designing =CORTEX= 1.274 ++ In this section, I outline the design decisions that went into 1.275 ++ making =CORTEX=, along with some details about its 1.276 ++ implementation. (A practical guide to getting started with =CORTEX=, 1.277 ++ which skips over the history and implementation details presented 1.278 ++ here, is provided in an appendix \ref{} at the end of this paper.) 1.279 ++ 1.280 ++ Throughout this project, I intended for =CORTEX= to be flexible and 1.281 ++ extensible enough to be useful for other researchers who want to 1.282 ++ test out ideas of their own. To this end, wherver I have had to make 1.283 ++ archetictural choices about =CORTEX=, I have chosen to give as much 1.284 ++ freedom to the user as possible, so that =CORTEX= may be used for 1.285 ++ things I have not forseen. 1.286 ++ 1.287 ++** Building in simulation versus reality 1.288 + The most important archetictural decision of all is the choice to 1.289 + use a computer-simulated environemnt in the first place! The world 1.290 + is a vast and rich place, and for now simulations are a very poor 1.291 +@@ -436,7 +554,7 @@ 1.292 + doing everything in software is far cheaper than building custom 1.293 + real-time hardware. All you need is a laptop and some patience. 1.294 + 1.295 +-** Because of Time, simulation is perferable to reality 1.296 ++** Simulated time enables rapid prototyping and complex scenes 1.297 + 1.298 + I envision =CORTEX= being used to support rapid prototyping and 1.299 + iteration of ideas. Even if I could put together a well constructed 1.300 +@@ -459,8 +577,8 @@ 1.301 + simulations of very simple creatures in =CORTEX= generally run at 1.302 + 40x on my machine! 1.303 + 1.304 +-** What is a sense? 1.305 +- 1.306 ++** All sense organs are two-dimensional surfaces 1.307 ++# What is a sense? 1.308 + If =CORTEX= is to support a wide variety of senses, it would help 1.309 + to have a better understanding of what a ``sense'' actually is! 1.310 + While vision, touch, and hearing all seem like they are quite 1.311 +@@ -956,7 +1074,7 @@ 1.312 + #+ATTR_LaTeX: :width 15cm 1.313 + [[./images/physical-hand.png]] 1.314 + 1.315 +-** Eyes reuse standard video game components 1.316 ++** Sight reuses standard video game components... 1.317 + 1.318 + Vision is one of the most important senses for humans, so I need to 1.319 + build a simulated sense of vision for my AI. I will do this with 1.320 +@@ -1257,8 +1375,8 @@ 1.321 + community and is now (in modified form) part of a system for 1.322 + capturing in-game video to a file. 1.323 + 1.324 +-** Hearing is hard; =CORTEX= does it right 1.325 +- 1.326 ++** ...but hearing must be built from scratch 1.327 ++# is hard; =CORTEX= does it right 1.328 + At the end of this section I will have simulated ears that work the 1.329 + same way as the simulated eyes in the last section. I will be able to 1.330 + place any number of ear-nodes in a blender file, and they will bind to 1.331 +@@ -1565,7 +1683,7 @@ 1.332 + jMonkeyEngine3 community and is used to record audio for demo 1.333 + videos. 1.334 + 1.335 +-** Touch uses hundreds of hair-like elements 1.336 ++** Hundreds of hair-like elements provide a sense of touch 1.337 + 1.338 + Touch is critical to navigation and spatial reasoning and as such I 1.339 + need a simulated version of it to give to my AI creatures. 1.340 +@@ -2059,7 +2177,7 @@ 1.341 + #+ATTR_LaTeX: :width 15cm 1.342 + [[./images/touch-cube.png]] 1.343 + 1.344 +-** Proprioception is the sense that makes everything ``real'' 1.345 ++** Proprioception provides knowledge of your own body's position 1.346 + 1.347 + Close your eyes, and touch your nose with your right index finger. 1.348 + How did you do it? You could not see your hand, and neither your 1.349 +@@ -2193,7 +2311,7 @@ 1.350 + #+ATTR_LaTeX: :width 11cm 1.351 + [[./images/proprio.png]] 1.352 + 1.353 +-** Muscles are both effectors and sensors 1.354 ++** Muscles contain both sensors and effectors 1.355 + 1.356 + Surprisingly enough, terrestrial creatures only move by using 1.357 + torque applied about their joints. There's not a single straight 1.358 +@@ -2440,7 +2558,8 @@ 1.359 + hard control problems without worrying about physics or 1.360 + senses. 1.361 + 1.362 +-* Empathy in a simulated worm 1.363 ++* =EMPATH=: the simulated worm experiment 1.364 ++# Empathy in a simulated worm 1.365 + 1.366 + Here I develop a computational model of empathy, using =CORTEX= as a 1.367 + base. Empathy in this context is the ability to observe another 1.368 +@@ -2732,7 +2851,7 @@ 1.369 + provided by an experience vector and reliably infering the rest of 1.370 + the senses. 1.371 + 1.372 +-** Empathy is the process of tracing though \Phi-space 1.373 ++** ``Empathy'' requires retracing steps though \Phi-space 1.374 + 1.375 + Here is the core of a basic empathy algorithm, starting with an 1.376 + experience vector: 1.377 +@@ -2888,7 +3007,7 @@ 1.378 + #+end_src 1.379 + #+end_listing 1.380 + 1.381 +-** Efficient action recognition with =EMPATH= 1.382 ++** =EMPATH= recognizes actions efficiently 1.383 + 1.384 + To use =EMPATH= with the worm, I first need to gather a set of 1.385 + experiences from the worm that includes the actions I want to 1.386 +@@ -3044,9 +3163,9 @@ 1.387 + to interpretation, and dissaggrement between empathy and experience 1.388 + is more excusable. 1.389 + 1.390 +-** Digression: bootstrapping touch using free exploration 1.391 +- 1.392 +- In the previous section I showed how to compute actions in terms of 1.393 ++** Digression: Learn touch sensor layout through haptic experimentation, instead 1.394 ++# Boostraping touch using free exploration 1.395 ++In the previous section I showed how to compute actions in terms of 1.396 + body-centered predicates which relied averate touch activation of 1.397 + pre-defined regions of the worm's skin. What if, instead of recieving 1.398 + touch pre-grouped into the six faces of each worm segment, the true