Mercurial > cortex
diff thesis/cortex.org @ 511:07c3feb32df3
go over changes by Dylan.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Sun, 30 Mar 2014 10:17:43 -0400 |
parents | f639e2139ce2 |
children | 447c3c8405a2 |
line wrap: on
line diff
1.1 --- a/thesis/cortex.org Sun Mar 30 01:34:43 2014 -0400 1.2 +++ b/thesis/cortex.org Sun Mar 30 10:17:43 2014 -0400 1.3 @@ -41,49 +41,46 @@ 1.4 [[./images/aurellem-gray.png]] 1.5 1.6 1.7 -* Empathy and Embodiment as problem solving strategies 1.8 +* Empathy \& Embodiment: problem solving strategies 1.9 1.10 - By the end of this thesis, you will have seen a novel approach to 1.11 - interpreting video using embodiment and empathy. You will have also 1.12 - seen one way to efficiently implement empathy for embodied 1.13 - creatures. Finally, you will become familiar with =CORTEX=, a system 1.14 - for designing and simulating creatures with rich senses, which you 1.15 - may choose to use in your own research. 1.16 - 1.17 - This is the core vision of my thesis: That one of the important ways 1.18 - in which we understand others is by imagining ourselves in their 1.19 - position and emphatically feeling experiences relative to our own 1.20 - bodies. By understanding events in terms of our own previous 1.21 - corporeal experience, we greatly constrain the possibilities of what 1.22 - would otherwise be an unwieldy exponential search. This extra 1.23 - constraint can be the difference between easily understanding what 1.24 - is happening in a video and being completely lost in a sea of 1.25 - incomprehensible color and movement. 1.26 - 1.27 -** Recognizing actions in video is extremely difficult 1.28 - 1.29 - Consider for example the problem of determining what is happening 1.30 - in a video of which this is one frame: 1.31 - 1.32 +** The problem: recognizing actions in video is extremely difficult 1.33 +# developing / requires useful representations 1.34 + 1.35 + Examine the following collection of images. As you, and indeed very 1.36 + young children, can easily determine, each one is a picture of 1.37 + someone drinking. 1.38 + 1.39 + # dxh: cat, cup, drinking fountain, rain, straw, coconut 1.40 #+caption: A cat drinking some water. Identifying this action is 1.41 - #+caption: beyond the state of the art for computers. 1.42 + #+caption: beyond the capabilities of existing computer vision systems. 1.43 #+ATTR_LaTeX: :width 7cm 1.44 [[./images/cat-drinking.jpg]] 1.45 + 1.46 + Nevertheless, it is beyond the state of the art for a computer 1.47 + vision program to describe what's happening in each of these 1.48 + images, or what's common to them. Part of the problem is that many 1.49 + computer vision systems focus on pixel-level details or probability 1.50 + distributions of pixels, with little focus on [...] 1.51 + 1.52 + 1.53 + In fact, the contents of scene may have much less to do with pixel 1.54 + probabilities than with recognizing various affordances: things you 1.55 + can move, objects you can grasp, spaces that can be filled 1.56 + (Gibson). For example, what processes might enable you to see the 1.57 + chair in figure \ref{hidden-chair}? 1.58 + # Or suppose that you are building a program that recognizes chairs. 1.59 + # How could you ``see'' the chair ? 1.60 1.61 - It is currently impossible for any computer program to reliably 1.62 - label such a video as ``drinking''. And rightly so -- it is a very 1.63 - hard problem! What features can you describe in terms of low level 1.64 - functions of pixels that can even begin to describe at a high level 1.65 - what is happening here? 1.66 - 1.67 - Or suppose that you are building a program that recognizes chairs. 1.68 - How could you ``see'' the chair in figure \ref{hidden-chair}? 1.69 - 1.70 + # dxh: blur chair 1.71 #+caption: The chair in this image is quite obvious to humans, but I 1.72 #+caption: doubt that any modern computer vision program can find it. 1.73 #+name: hidden-chair 1.74 #+ATTR_LaTeX: :width 10cm 1.75 [[./images/fat-person-sitting-at-desk.jpg]] 1.76 + 1.77 + 1.78 + 1.79 + 1.80 1.81 Finally, how is it that you can easily tell the difference between 1.82 how the girls /muscles/ are working in figure \ref{girl}? 1.83 @@ -95,10 +92,13 @@ 1.84 #+ATTR_LaTeX: :width 7cm 1.85 [[./images/wall-push.png]] 1.86 1.87 + 1.88 + 1.89 + 1.90 Each of these examples tells us something about what might be going 1.91 on in our minds as we easily solve these recognition problems. 1.92 1.93 - The hidden chairs show us that we are strongly triggered by cues 1.94 + The hidden chair shows us that we are strongly triggered by cues 1.95 relating to the position of human bodies, and that we can determine 1.96 the overall physical configuration of a human body even if much of 1.97 that body is occluded. 1.98 @@ -109,10 +109,107 @@ 1.99 most positions, and we can easily project this self-knowledge to 1.100 imagined positions triggered by images of the human body. 1.101 1.102 -** =EMPATH= neatly solves recognition problems 1.103 +** A step forward: the sensorimotor-centered approach 1.104 +# ** =EMPATH= recognizes what creatures are doing 1.105 +# neatly solves recognition problems 1.106 + In this thesis, I explore the idea that our knowledge of our own 1.107 + bodies enables us to recognize the actions of others. 1.108 + 1.109 + First, I built a system for constructing virtual creatures with 1.110 + physiologically plausible sensorimotor systems and detailed 1.111 + environments. The result is =CORTEX=, which is described in section 1.112 + \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other 1.113 + AI researchers; it is provided in full with detailed instructions 1.114 + on the web [here].) 1.115 + 1.116 + Next, I wrote routines which enabled a simple worm-like creature to 1.117 + infer the actions of a second worm-like creature, using only its 1.118 + own prior sensorimotor experiences and knowledge of the second 1.119 + worm's joint positions. This program, =EMPATH=, is described in 1.120 + section \ref{sec-3}, and the key results of this experiment are 1.121 + summarized below. 1.122 + 1.123 + #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer 1.124 + #+caption: the complete sensory experience and classify these four poses. 1.125 + #+caption: The last image is a composite, depicting the intermediate stages of \emph{wriggling}. 1.126 + #+name: worm-recognition-intro-2 1.127 + #+ATTR_LaTeX: :width 15cm 1.128 + [[./images/empathy-1.png]] 1.129 + 1.130 + # =CORTEX= provides a language for describing the sensorimotor 1.131 + # experiences of various creatures. 1.132 + 1.133 + # Next, I developed an experiment to test the power of =CORTEX='s 1.134 + # sensorimotor-centered language for solving recognition problems. As 1.135 + # a proof of concept, I wrote routines which enabled a simple 1.136 + # worm-like creature to infer the actions of a second worm-like 1.137 + # creature, using only its own previous sensorimotor experiences and 1.138 + # knowledge of the second worm's joints (figure 1.139 + # \ref{worm-recognition-intro-2}). The result of this proof of 1.140 + # concept was the program =EMPATH=, described in section 1.141 + # \ref{sec-3}. The key results of this 1.142 + 1.143 + # Using only first-person sensorimotor experiences and third-person 1.144 + # proprioceptive data, 1.145 + 1.146 +*** Key results 1.147 + - After one-shot supervised training, =EMPATH= was able recognize a 1.148 + wide variety of static poses and dynamic actions---ranging from 1.149 + curling in a circle to wriggling with a particular frequency --- 1.150 + with 95\% accuracy. 1.151 + - These results were completely independent of viewing angle 1.152 + because the underlying body-centered language fundamentally is 1.153 + independent; once an action is learned, it can be recognized 1.154 + equally well from any viewing angle. 1.155 + - =EMPATH= is surprisingly short; the sensorimotor-centered 1.156 + language provided by =CORTEX= resulted in extremely economical 1.157 + recognition routines --- about 0000 lines in all --- suggesting 1.158 + that such representations are very powerful, and often 1.159 + indispensible for the types of recognition tasks considered here. 1.160 + - Although for expediency's sake, I relied on direct knowledge of 1.161 + joint positions in this proof of concept, it would be 1.162 + straightforward to extend =EMPATH= so that it (more 1.163 + realistically) infers joint positions from its visual data. 1.164 + 1.165 +# because the underlying language is fundamentally orientation-independent 1.166 + 1.167 +# recognize the actions of a worm with 95\% accuracy. The 1.168 +# recognition tasks 1.169 1.170 - I propose a system that can express the types of recognition 1.171 - problems above in a form amenable to computation. It is split into 1.172 + 1.173 + 1.174 + 1.175 + [Talk about these results and what you find promising about them] 1.176 + 1.177 +** Roadmap 1.178 + [I'm going to explain how =CORTEX= works, then break down how 1.179 + =EMPATH= does its thing. Because the details reveal such-and-such 1.180 + about the approach.] 1.181 + 1.182 + # The success of this simple proof-of-concept offers a tantalizing 1.183 + 1.184 + 1.185 + # explore the idea 1.186 + # The key contribution of this thesis is the idea that body-centered 1.187 + # representations (which express 1.188 + 1.189 + 1.190 + # the 1.191 + # body-centered approach --- in which I try to determine what's 1.192 + # happening in a scene by bringing it into registration with my own 1.193 + # bodily experiences --- are indispensible for recognizing what 1.194 + # creatures are doing in a scene. 1.195 + 1.196 +* COMMENT 1.197 +# body-centered language 1.198 + 1.199 + In this thesis, I'll describe =EMPATH=, which solves a certain 1.200 + class of recognition problems 1.201 + 1.202 + The key idea is to use self-centered (or first-person) language. 1.203 + 1.204 + I have built a system that can express the types of recognition 1.205 + problems in a form amenable to computation. It is split into 1.206 four parts: 1.207 1.208 - Free/Guided Play :: The creature moves around and experiences the 1.209 @@ -286,14 +383,14 @@ 1.210 code to create a creature, and can use a wide library of 1.211 pre-existing blender models as a base for your own creatures. 1.212 1.213 - - =CORTEX= implements a wide variety of senses, including touch, 1.214 + - =CORTEX= implements a wide variety of senses: touch, 1.215 proprioception, vision, hearing, and muscle tension. Complicated 1.216 senses like touch, and vision involve multiple sensory elements 1.217 embedded in a 2D surface. You have complete control over the 1.218 distribution of these sensor elements through the use of simple 1.219 png image files. In particular, =CORTEX= implements more 1.220 comprehensive hearing than any other creature simulation system 1.221 - available. 1.222 + available. 1.223 1.224 - =CORTEX= supports any number of creatures and any number of 1.225 senses. Time in =CORTEX= dialates so that the simulated creatures 1.226 @@ -353,7 +450,24 @@ 1.227 \end{sidewaysfigure} 1.228 #+END_LaTeX 1.229 1.230 -** Contributions 1.231 +** Road map 1.232 + 1.233 + By the end of this thesis, you will have seen a novel approach to 1.234 + interpreting video using embodiment and empathy. You will have also 1.235 + seen one way to efficiently implement empathy for embodied 1.236 + creatures. Finally, you will become familiar with =CORTEX=, a system 1.237 + for designing and simulating creatures with rich senses, which you 1.238 + may choose to use in your own research. 1.239 + 1.240 + This is the core vision of my thesis: That one of the important ways 1.241 + in which we understand others is by imagining ourselves in their 1.242 + position and emphatically feeling experiences relative to our own 1.243 + bodies. By understanding events in terms of our own previous 1.244 + corporeal experience, we greatly constrain the possibilities of what 1.245 + would otherwise be an unwieldy exponential search. This extra 1.246 + constraint can be the difference between easily understanding what 1.247 + is happening in a video and being completely lost in a sea of 1.248 + incomprehensible color and movement. 1.249 1.250 - I built =CORTEX=, a comprehensive platform for embodied AI 1.251 experiments. =CORTEX= supports many features lacking in other 1.252 @@ -363,18 +477,22 @@ 1.253 - I built =EMPATH=, which uses =CORTEX= to identify the actions of 1.254 a worm-like creature using a computational model of empathy. 1.255 1.256 -* Building =CORTEX= 1.257 - 1.258 - I intend for =CORTEX= to be used as a general-purpose library for 1.259 - building creatures and outfitting them with senses, so that it will 1.260 - be useful for other researchers who want to test out ideas of their 1.261 - own. To this end, wherver I have had to make archetictural choices 1.262 - about =CORTEX=, I have chosen to give as much freedom to the user as 1.263 - possible, so that =CORTEX= may be used for things I have not 1.264 - forseen. 1.265 - 1.266 -** Simulation or Reality? 1.267 - 1.268 + 1.269 +* Designing =CORTEX= 1.270 + In this section, I outline the design decisions that went into 1.271 + making =CORTEX=, along with some details about its 1.272 + implementation. (A practical guide to getting started with =CORTEX=, 1.273 + which skips over the history and implementation details presented 1.274 + here, is provided in an appendix \ref{} at the end of this paper.) 1.275 + 1.276 + Throughout this project, I intended for =CORTEX= to be flexible and 1.277 + extensible enough to be useful for other researchers who want to 1.278 + test out ideas of their own. To this end, wherver I have had to make 1.279 + archetictural choices about =CORTEX=, I have chosen to give as much 1.280 + freedom to the user as possible, so that =CORTEX= may be used for 1.281 + things I have not forseen. 1.282 + 1.283 +** Building in simulation versus reality 1.284 The most important archetictural decision of all is the choice to 1.285 use a computer-simulated environemnt in the first place! The world 1.286 is a vast and rich place, and for now simulations are a very poor 1.287 @@ -436,7 +554,7 @@ 1.288 doing everything in software is far cheaper than building custom 1.289 real-time hardware. All you need is a laptop and some patience. 1.290 1.291 -** Because of Time, simulation is perferable to reality 1.292 +** Simulated time enables rapid prototyping and complex scenes 1.293 1.294 I envision =CORTEX= being used to support rapid prototyping and 1.295 iteration of ideas. Even if I could put together a well constructed 1.296 @@ -459,8 +577,8 @@ 1.297 simulations of very simple creatures in =CORTEX= generally run at 1.298 40x on my machine! 1.299 1.300 -** What is a sense? 1.301 - 1.302 +** All sense organs are two-dimensional surfaces 1.303 +# What is a sense? 1.304 If =CORTEX= is to support a wide variety of senses, it would help 1.305 to have a better understanding of what a ``sense'' actually is! 1.306 While vision, touch, and hearing all seem like they are quite 1.307 @@ -956,7 +1074,7 @@ 1.308 #+ATTR_LaTeX: :width 15cm 1.309 [[./images/physical-hand.png]] 1.310 1.311 -** Eyes reuse standard video game components 1.312 +** Sight reuses standard video game components... 1.313 1.314 Vision is one of the most important senses for humans, so I need to 1.315 build a simulated sense of vision for my AI. I will do this with 1.316 @@ -1257,8 +1375,8 @@ 1.317 community and is now (in modified form) part of a system for 1.318 capturing in-game video to a file. 1.319 1.320 -** Hearing is hard; =CORTEX= does it right 1.321 - 1.322 +** ...but hearing must be built from scratch 1.323 +# is hard; =CORTEX= does it right 1.324 At the end of this section I will have simulated ears that work the 1.325 same way as the simulated eyes in the last section. I will be able to 1.326 place any number of ear-nodes in a blender file, and they will bind to 1.327 @@ -1565,7 +1683,7 @@ 1.328 jMonkeyEngine3 community and is used to record audio for demo 1.329 videos. 1.330 1.331 -** Touch uses hundreds of hair-like elements 1.332 +** Hundreds of hair-like elements provide a sense of touch 1.333 1.334 Touch is critical to navigation and spatial reasoning and as such I 1.335 need a simulated version of it to give to my AI creatures. 1.336 @@ -2059,7 +2177,7 @@ 1.337 #+ATTR_LaTeX: :width 15cm 1.338 [[./images/touch-cube.png]] 1.339 1.340 -** Proprioception is the sense that makes everything ``real'' 1.341 +** Proprioception provides knowledge of your own body's position 1.342 1.343 Close your eyes, and touch your nose with your right index finger. 1.344 How did you do it? You could not see your hand, and neither your 1.345 @@ -2193,7 +2311,7 @@ 1.346 #+ATTR_LaTeX: :width 11cm 1.347 [[./images/proprio.png]] 1.348 1.349 -** Muscles are both effectors and sensors 1.350 +** Muscles contain both sensors and effectors 1.351 1.352 Surprisingly enough, terrestrial creatures only move by using 1.353 torque applied about their joints. There's not a single straight 1.354 @@ -2440,7 +2558,8 @@ 1.355 hard control problems without worrying about physics or 1.356 senses. 1.357 1.358 -* Empathy in a simulated worm 1.359 +* =EMPATH=: the simulated worm experiment 1.360 +# Empathy in a simulated worm 1.361 1.362 Here I develop a computational model of empathy, using =CORTEX= as a 1.363 base. Empathy in this context is the ability to observe another 1.364 @@ -2732,7 +2851,7 @@ 1.365 provided by an experience vector and reliably infering the rest of 1.366 the senses. 1.367 1.368 -** Empathy is the process of tracing though \Phi-space 1.369 +** ``Empathy'' requires retracing steps though \Phi-space 1.370 1.371 Here is the core of a basic empathy algorithm, starting with an 1.372 experience vector: 1.373 @@ -2888,7 +3007,7 @@ 1.374 #+end_src 1.375 #+end_listing 1.376 1.377 -** Efficient action recognition with =EMPATH= 1.378 +** =EMPATH= recognizes actions efficiently 1.379 1.380 To use =EMPATH= with the worm, I first need to gather a set of 1.381 experiences from the worm that includes the actions I want to 1.382 @@ -3044,9 +3163,9 @@ 1.383 to interpretation, and dissaggrement between empathy and experience 1.384 is more excusable. 1.385 1.386 -** Digression: bootstrapping touch using free exploration 1.387 - 1.388 - In the previous section I showed how to compute actions in terms of 1.389 +** Digression: Learn touch sensor layout through haptic experimentation, instead 1.390 +# Boostraping touch using free exploration 1.391 +In the previous section I showed how to compute actions in terms of 1.392 body-centered predicates which relied averate touch activation of 1.393 pre-defined regions of the worm's skin. What if, instead of recieving 1.394 touch pre-grouped into the six faces of each worm segment, the true 1.395 @@ -3210,13 +3329,14 @@ 1.396 1.397 In this thesis you have seen the =CORTEX= system, a complete 1.398 environment for creating simulated creatures. You have seen how to 1.399 - implement five senses including touch, proprioception, hearing, 1.400 - vision, and muscle tension. You have seen how to create new creatues 1.401 - using blender, a 3D modeling tool. I hope that =CORTEX= will be 1.402 - useful in further research projects. To this end I have included the 1.403 - full source to =CORTEX= along with a large suite of tests and 1.404 - examples. I have also created a user guide for =CORTEX= which is 1.405 - inculded in an appendix to this thesis. 1.406 + implement five senses: touch, proprioception, hearing, vision, and 1.407 + muscle tension. You have seen how to create new creatues using 1.408 + blender, a 3D modeling tool. I hope that =CORTEX= will be useful in 1.409 + further research projects. To this end I have included the full 1.410 + source to =CORTEX= along with a large suite of tests and examples. I 1.411 + have also created a user guide for =CORTEX= which is inculded in an 1.412 + appendix to this thesis \ref{}. 1.413 +# dxh: todo reference appendix 1.414 1.415 You have also seen how I used =CORTEX= as a platform to attach the 1.416 /action recognition/ problem, which is the problem of recognizing