Mercurial > cortex
diff thesis/aux/org/first-chapter.org @ 422:6b0f77df0e53
building latex scaffolding for thesis.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Fri, 21 Mar 2014 01:17:41 -0400 |
parents | thesis/org/first-chapter.org@7ee735a836da |
children | b5d0f0adf19f |
line wrap: on
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/thesis/aux/org/first-chapter.org Fri Mar 21 01:17:41 2014 -0400 1.3 @@ -0,0 +1,238 @@ 1.4 +#+title: =CORTEX= 1.5 +#+author: Robert McIntyre 1.6 +#+email: rlm@mit.edu 1.7 +#+description: Using embodied AI to facilitate Artificial Imagination. 1.8 +#+keywords: AI, clojure, embodiment 1.9 +#+SETUPFILE: ../../aurellem/org/setup.org 1.10 +#+INCLUDE: ../../aurellem/org/level-0.org 1.11 +#+babel: :mkdirp yes :noweb yes :exports both 1.12 +#+OPTIONS: toc:nil, num:nil 1.13 + 1.14 +* Artificial Imagination 1.15 + 1.16 + Imagine watching a video of someone skateboarding. When you watch 1.17 + the video, you can imagine yourself skateboarding, and your 1.18 + knowledge of the human body and its dynamics guides your 1.19 + interpretation of the scene. For example, even if the skateboarder 1.20 + is partially occluded, you can infer the positions of his arms and 1.21 + body from your own knowledge of how your body would be positioned if 1.22 + you were skateboarding. If the skateboarder suffers an accident, you 1.23 + wince in sympathy, imagining the pain your own body would experience 1.24 + if it were in the same situation. This empathy with other people 1.25 + guides our understanding of whatever they are doing because it is a 1.26 + powerful constraint on what is probable and possible. In order to 1.27 + make use of this powerful empathy constraint, I need a system that 1.28 + can generate and make sense of sensory data from the many different 1.29 + senses that humans possess. The two key proprieties of such a system 1.30 + are /embodiment/ and /imagination/. 1.31 + 1.32 +** What is imagination? 1.33 + 1.34 + One kind of imagination is /sympathetic/ imagination: you imagine 1.35 + yourself in the position of something/someone you are 1.36 + observing. This type of imagination comes into play when you follow 1.37 + along visually when watching someone perform actions, or when you 1.38 + sympathetically grimace when someone hurts themselves. This type of 1.39 + imagination uses the constraints you have learned about your own 1.40 + body to highly constrain the possibilities in whatever you are 1.41 + seeing. It uses all your senses to including your senses of touch, 1.42 + proprioception, etc. Humans are flexible when it comes to "putting 1.43 + themselves in another's shoes," and can sympathetically understand 1.44 + not only other humans, but entities ranging from animals to cartoon 1.45 + characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen! 1.46 + 1.47 + Another kind of imagination is /predictive/ imagination: you 1.48 + construct scenes in your mind that are not entirely related to 1.49 + whatever you are observing, but instead are predictions of the 1.50 + future or simply flights of fancy. You use this type of imagination 1.51 + to plan out multi-step actions, or play out dangerous situations in 1.52 + your mind so as to avoid messing them up in reality. 1.53 + 1.54 + Of course, sympathetic and predictive imagination blend into each 1.55 + other and are not completely separate concepts. One dimension along 1.56 + which you can distinguish types of imagination is dependence on raw 1.57 + sense data. Sympathetic imagination is highly constrained by your 1.58 + senses, while predictive imagination can be more or less dependent 1.59 + on your senses depending on how far ahead you imagine. Daydreaming 1.60 + is an extreme form of predictive imagination that wanders through 1.61 + different possibilities without concern for whether they are 1.62 + related to whatever is happening in reality. 1.63 + 1.64 + For this thesis, I will mostly focus on sympathetic imagination and 1.65 + the constraint it provides for understanding sensory data. 1.66 + 1.67 +** What problems can imagination solve? 1.68 + 1.69 + Consider a video of a cat drinking some water. 1.70 + 1.71 + #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers. 1.72 + #+ATTR_LaTeX: width=5cm 1.73 + [[../images/cat-drinking.jpg]] 1.74 + 1.75 + It is currently impossible for any computer program to reliably 1.76 + label such an video as "drinking". I think humans are able to label 1.77 + such video as "drinking" because they imagine /themselves/ as the 1.78 + cat, and imagine putting their face up against a stream of water 1.79 + and sticking out their tongue. In that imagined world, they can 1.80 + feel the cool water hitting their tongue, and feel the water 1.81 + entering their body, and are able to recognize that /feeling/ as 1.82 + drinking. So, the label of the action is not really in the pixels 1.83 + of the image, but is found clearly in a simulation inspired by 1.84 + those pixels. An imaginative system, having been trained on 1.85 + drinking and non-drinking examples and learning that the most 1.86 + important component of drinking is the feeling of water sliding 1.87 + down one's throat, would analyze a video of a cat drinking in the 1.88 + following manner: 1.89 + 1.90 + - Create a physical model of the video by putting a "fuzzy" model 1.91 + of its own body in place of the cat. Also, create a simulation of 1.92 + the stream of water. 1.93 + 1.94 + - Play out this simulated scene and generate imagined sensory 1.95 + experience. This will include relevant muscle contractions, a 1.96 + close up view of the stream from the cat's perspective, and most 1.97 + importantly, the imagined feeling of water entering the mouth. 1.98 + 1.99 + - The action is now easily identified as drinking by the sense of 1.100 + taste alone. The other senses (such as the tongue moving in and 1.101 + out) help to give plausibility to the simulated action. Note that 1.102 + the sense of vision, while critical in creating the simulation, 1.103 + is not critical for identifying the action from the simulation. 1.104 + 1.105 + More generally, I expect imaginative systems to be particularly 1.106 + good at identifying embodied actions in videos. 1.107 + 1.108 +* Cortex 1.109 + 1.110 + The previous example involves liquids, the sense of taste, and 1.111 + imagining oneself as a cat. For this thesis I constrain myself to 1.112 + simpler, more easily digitizable senses and situations. 1.113 + 1.114 + My system, =CORTEX= performs imagination in two different simplified 1.115 + worlds: /worm world/ and /stick-figure world/. In each of these 1.116 + worlds, entities capable of imagination recognize actions by 1.117 + simulating the experience from their own perspective, and then 1.118 + recognizing the action from a database of examples. 1.119 + 1.120 + In order to serve as a framework for experiments in imagination, 1.121 + =CORTEX= requires simulated bodies, worlds, and senses like vision, 1.122 + hearing, touch, proprioception, etc. 1.123 + 1.124 +** A Video Game Engine takes care of some of the groundwork 1.125 + 1.126 + When it comes to simulation environments, the engines used to 1.127 + create the worlds in video games offer top-notch physics and 1.128 + graphics support. These engines also have limited support for 1.129 + creating cameras and rendering 3D sound, which can be repurposed 1.130 + for vision and hearing respectively. Physics collision detection 1.131 + can be expanded to create a sense of touch. 1.132 + 1.133 + jMonkeyEngine3 is one such engine for creating video games in 1.134 + Java. It uses OpenGL to render to the screen and uses screengraphs 1.135 + to avoid drawing things that do not appear on the screen. It has an 1.136 + active community and several games in the pipeline. The engine was 1.137 + not built to serve any particular game but is instead meant to be 1.138 + used for any 3D game. I chose jMonkeyEngine3 it because it had the 1.139 + most features out of all the open projects I looked at, and because 1.140 + I could then write my code in Clojure, an implementation of LISP 1.141 + that runs on the JVM. 1.142 + 1.143 +** =CORTEX= Extends jMonkeyEngine3 to implement rich senses 1.144 + 1.145 + Using the game-making primitives provided by jMonkeyEngine3, I have 1.146 + constructed every major human sense except for smell and 1.147 + taste. =CORTEX= also provides an interface for creating creatures 1.148 + in Blender, a 3D modeling environment, and then "rigging" the 1.149 + creatures with senses using 3D annotations in Blender. A creature 1.150 + can have any number of senses, and there can be any number of 1.151 + creatures in a simulation. 1.152 + 1.153 + The senses available in =CORTEX= are: 1.154 + 1.155 + - [[../../cortex/html/vision.html][Vision]] 1.156 + - [[../../cortex/html/hearing.html][Hearing]] 1.157 + - [[../../cortex/html/touch.html][Touch]] 1.158 + - [[../../cortex/html/proprioception.html][Proprioception]] 1.159 + - [[../../cortex/html/movement.html][Muscle Tension]] 1.160 + 1.161 +* A roadmap for =CORTEX= experiments 1.162 + 1.163 +** Worm World 1.164 + 1.165 + Worms in =CORTEX= are segmented creatures which vary in length and 1.166 + number of segments, and have the senses of vision, proprioception, 1.167 + touch, and muscle tension. 1.168 + 1.169 +#+attr_html: width=755 1.170 +#+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse). 1.171 +[[../images/finger-UV.png]] 1.172 + 1.173 + 1.174 +#+begin_html 1.175 +<div class="figure"> 1.176 + <center> 1.177 + <video controls="controls" width="550"> 1.178 + <source src="../video/worm-touch.ogg" type="video/ogg" 1.179 + preload="none" /> 1.180 + </video> 1.181 + <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a> 1.182 + </center> 1.183 + <p>The worm responds to touch.</p> 1.184 +</div> 1.185 +#+end_html 1.186 + 1.187 +#+begin_html 1.188 +<div class="figure"> 1.189 + <center> 1.190 + <video controls="controls" width="550"> 1.191 + <source src="../video/test-proprioception.ogg" type="video/ogg" 1.192 + preload="none" /> 1.193 + </video> 1.194 + <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a> 1.195 + </center> 1.196 + <p>Proprioception in a worm. The proprioceptive readout is 1.197 + in the upper left corner of the screen.</p> 1.198 +</div> 1.199 +#+end_html 1.200 + 1.201 + A worm is trained in various actions such as sinusoidal movement, 1.202 + curling, flailing, and spinning by directly playing motor 1.203 + contractions while the worm "feels" the experience. These actions 1.204 + are recorded both as vectors of muscle tension, touch, and 1.205 + proprioceptive data, but also in higher level forms such as 1.206 + frequencies of the various contractions and a symbolic name for the 1.207 + action. 1.208 + 1.209 + Then, the worm watches a video of another worm performing one of 1.210 + the actions, and must judge which action was performed. Normally 1.211 + this would be an extremely difficult problem, but the worm is able 1.212 + to greatly diminish the search space through sympathetic 1.213 + imagination. First, it creates an imagined copy of its body which 1.214 + it observes from a third person point of view. Then for each frame 1.215 + of the video, it maneuvers its simulated body to be in registration 1.216 + with the worm depicted in the video. The physical constraints 1.217 + imposed by the physics simulation greatly decrease the number of 1.218 + poses that have to be tried, making the search feasible. As the 1.219 + imaginary worm moves, it generates imaginary muscle tension and 1.220 + proprioceptive sensations. The worm determines the action not by 1.221 + vision, but by matching the imagined proprioceptive data with 1.222 + previous examples. 1.223 + 1.224 + By using non-visual sensory data such as touch, the worms can also 1.225 + answer body related questions such as "did your head touch your 1.226 + tail?" and "did worm A touch worm B?" 1.227 + 1.228 + The proprioceptive information used for action identification is 1.229 + body-centric, so only the registration step is dependent on point 1.230 + of view, not the identification step. Registration is not specific 1.231 + to any particular action. Thus, action identification can be 1.232 + divided into a point-of-view dependent generic registration step, 1.233 + and a action-specific step that is body-centered and invariant to 1.234 + point of view. 1.235 + 1.236 +** Stick Figure World 1.237 + 1.238 + This environment is similar to Worm World, except the creatures are 1.239 + more complicated and the actions and questions more varied. It is 1.240 + an experiment to see how far imagination can go in interpreting 1.241 + actions.