Mercurial > cortex
diff thesis/org/first-chapter.org @ 430:5205535237fb
fix skew in self-organizing-touch, work on thesis.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Sat, 22 Mar 2014 16:10:34 -0400 |
parents | thesis/aux/org/first-chapter.org@b5d0f0adf19f |
children |
line wrap: on
line diff
1.1 --- /dev/null Thu Jan 01 00:00:00 1970 +0000 1.2 +++ b/thesis/org/first-chapter.org Sat Mar 22 16:10:34 2014 -0400 1.3 @@ -0,0 +1,241 @@ 1.4 +#+title: =CORTEX= 1.5 +#+author: Robert McIntyre 1.6 +#+email: rlm@mit.edu 1.7 +#+description: Using embodied AI to facilitate Artificial Imagination. 1.8 +#+keywords: AI, clojure, embodiment 1.9 +#+SETUPFILE: ../../aurellem/org/setup.org 1.10 +#+INCLUDE: ../../aurellem/org/level-0.org 1.11 +#+babel: :mkdirp yes :noweb yes :exports both 1.12 +#+OPTIONS: toc:nil, num:nil 1.13 + 1.14 +* Artificial Imagination 1.15 + Imagine watching a video of someone skateboarding. When you watch 1.16 + the video, you can imagine yourself skateboarding, and your 1.17 + knowledge of the human body and its dynamics guides your 1.18 + interpretation of the scene. For example, even if the skateboarder 1.19 + is partially occluded, you can infer the positions of his arms and 1.20 + body from your own knowledge of how your body would be positioned if 1.21 + you were skateboarding. If the skateboarder suffers an accident, you 1.22 + wince in sympathy, imagining the pain your own body would experience 1.23 + if it were in the same situation. This empathy with other people 1.24 + guides our understanding of whatever they are doing because it is a 1.25 + powerful constraint on what is probable and possible. In order to 1.26 + make use of this powerful empathy constraint, I need a system that 1.27 + can generate and make sense of sensory data from the many different 1.28 + senses that humans possess. The two key proprieties of such a system 1.29 + are /embodiment/ and /imagination/. 1.30 + 1.31 +** What is imagination? 1.32 + 1.33 + One kind of imagination is /sympathetic/ imagination: you imagine 1.34 + yourself in the position of something/someone you are 1.35 + observing. This type of imagination comes into play when you follow 1.36 + along visually when watching someone perform actions, or when you 1.37 + sympathetically grimace when someone hurts themselves. This type of 1.38 + imagination uses the constraints you have learned about your own 1.39 + body to highly constrain the possibilities in whatever you are 1.40 + seeing. It uses all your senses to including your senses of touch, 1.41 + proprioception, etc. Humans are flexible when it comes to "putting 1.42 + themselves in another's shoes," and can sympathetically understand 1.43 + not only other humans, but entities ranging from animals to cartoon 1.44 + characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen! 1.45 + 1.46 +# and can infer intention from the actions of not only other humans, 1.47 +# but also animals, cartoon characters, and even abstract moving dots 1.48 +# on a screen! 1.49 + 1.50 + Another kind of imagination is /predictive/ imagination: you 1.51 + construct scenes in your mind that are not entirely related to 1.52 + whatever you are observing, but instead are predictions of the 1.53 + future or simply flights of fancy. You use this type of imagination 1.54 + to plan out multi-step actions, or play out dangerous situations in 1.55 + your mind so as to avoid messing them up in reality. 1.56 + 1.57 + Of course, sympathetic and predictive imagination blend into each 1.58 + other and are not completely separate concepts. One dimension along 1.59 + which you can distinguish types of imagination is dependence on raw 1.60 + sense data. Sympathetic imagination is highly constrained by your 1.61 + senses, while predictive imagination can be more or less dependent 1.62 + on your senses depending on how far ahead you imagine. Daydreaming 1.63 + is an extreme form of predictive imagination that wanders through 1.64 + different possibilities without concern for whether they are 1.65 + related to whatever is happening in reality. 1.66 + 1.67 + For this thesis, I will mostly focus on sympathetic imagination and 1.68 + the constraint it provides for understanding sensory data. 1.69 + 1.70 +** What problems can imagination solve? 1.71 + 1.72 + Consider a video of a cat drinking some water. 1.73 + 1.74 + #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers. 1.75 + #+ATTR_LaTeX: width=5cm 1.76 + [[../images/cat-drinking.jpg]] 1.77 + 1.78 + It is currently impossible for any computer program to reliably 1.79 + label such an video as "drinking". I think humans are able to label 1.80 + such video as "drinking" because they imagine /themselves/ as the 1.81 + cat, and imagine putting their face up against a stream of water 1.82 + and sticking out their tongue. In that imagined world, they can 1.83 + feel the cool water hitting their tongue, and feel the water 1.84 + entering their body, and are able to recognize that /feeling/ as 1.85 + drinking. So, the label of the action is not really in the pixels 1.86 + of the image, but is found clearly in a simulation inspired by 1.87 + those pixels. An imaginative system, having been trained on 1.88 + drinking and non-drinking examples and learning that the most 1.89 + important component of drinking is the feeling of water sliding 1.90 + down one's throat, would analyze a video of a cat drinking in the 1.91 + following manner: 1.92 + 1.93 + - Create a physical model of the video by putting a "fuzzy" model 1.94 + of its own body in place of the cat. Also, create a simulation of 1.95 + the stream of water. 1.96 + 1.97 + - Play out this simulated scene and generate imagined sensory 1.98 + experience. This will include relevant muscle contractions, a 1.99 + close up view of the stream from the cat's perspective, and most 1.100 + importantly, the imagined feeling of water entering the mouth. 1.101 + 1.102 + - The action is now easily identified as drinking by the sense of 1.103 + taste alone. The other senses (such as the tongue moving in and 1.104 + out) help to give plausibility to the simulated action. Note that 1.105 + the sense of vision, while critical in creating the simulation, 1.106 + is not critical for identifying the action from the simulation. 1.107 + 1.108 + More generally, I expect imaginative systems to be particularly 1.109 + good at identifying embodied actions in videos. 1.110 + 1.111 +* Cortex 1.112 + 1.113 + The previous example involves liquids, the sense of taste, and 1.114 + imagining oneself as a cat. For this thesis I constrain myself to 1.115 + simpler, more easily digitizable senses and situations. 1.116 + 1.117 + My system, =CORTEX= performs imagination in two different simplified 1.118 + worlds: /worm world/ and /stick-figure world/. In each of these 1.119 + worlds, entities capable of imagination recognize actions by 1.120 + simulating the experience from their own perspective, and then 1.121 + recognizing the action from a database of examples. 1.122 + 1.123 + In order to serve as a framework for experiments in imagination, 1.124 + =CORTEX= requires simulated bodies, worlds, and senses like vision, 1.125 + hearing, touch, proprioception, etc. 1.126 + 1.127 +** A Video Game Engine takes care of some of the groundwork 1.128 + 1.129 + When it comes to simulation environments, the engines used to 1.130 + create the worlds in video games offer top-notch physics and 1.131 + graphics support. These engines also have limited support for 1.132 + creating cameras and rendering 3D sound, which can be repurposed 1.133 + for vision and hearing respectively. Physics collision detection 1.134 + can be expanded to create a sense of touch. 1.135 + 1.136 + jMonkeyEngine3 is one such engine for creating video games in 1.137 + Java. It uses OpenGL to render to the screen and uses screengraphs 1.138 + to avoid drawing things that do not appear on the screen. It has an 1.139 + active community and several games in the pipeline. The engine was 1.140 + not built to serve any particular game but is instead meant to be 1.141 + used for any 3D game. I chose jMonkeyEngine3 it because it had the 1.142 + most features out of all the open projects I looked at, and because 1.143 + I could then write my code in Clojure, an implementation of LISP 1.144 + that runs on the JVM. 1.145 + 1.146 +** =CORTEX= Extends jMonkeyEngine3 to implement rich senses 1.147 + 1.148 + Using the game-making primitives provided by jMonkeyEngine3, I have 1.149 + constructed every major human sense except for smell and 1.150 + taste. =CORTEX= also provides an interface for creating creatures 1.151 + in Blender, a 3D modeling environment, and then "rigging" the 1.152 + creatures with senses using 3D annotations in Blender. A creature 1.153 + can have any number of senses, and there can be any number of 1.154 + creatures in a simulation. 1.155 + 1.156 + The senses available in =CORTEX= are: 1.157 + 1.158 + - [[../../cortex/html/vision.html][Vision]] 1.159 + - [[../../cortex/html/hearing.html][Hearing]] 1.160 + - [[../../cortex/html/touch.html][Touch]] 1.161 + - [[../../cortex/html/proprioception.html][Proprioception]] 1.162 + - [[../../cortex/html/movement.html][Muscle Tension]] 1.163 + 1.164 +* A roadmap for =CORTEX= experiments 1.165 + 1.166 +** Worm World 1.167 + 1.168 + Worms in =CORTEX= are segmented creatures which vary in length and 1.169 + number of segments, and have the senses of vision, proprioception, 1.170 + touch, and muscle tension. 1.171 + 1.172 +#+attr_html: width=755 1.173 +#+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse). 1.174 +[[../images/finger-UV.png]] 1.175 + 1.176 + 1.177 +#+begin_html 1.178 +<div class="figure"> 1.179 + <center> 1.180 + <video controls="controls" width="550"> 1.181 + <source src="../video/worm-touch.ogg" type="video/ogg" 1.182 + preload="none" /> 1.183 + </video> 1.184 + <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a> 1.185 + </center> 1.186 + <p>The worm responds to touch.</p> 1.187 +</div> 1.188 +#+end_html 1.189 + 1.190 +#+begin_html 1.191 +<div class="figure"> 1.192 + <center> 1.193 + <video controls="controls" width="550"> 1.194 + <source src="../video/test-proprioception.ogg" type="video/ogg" 1.195 + preload="none" /> 1.196 + </video> 1.197 + <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a> 1.198 + </center> 1.199 + <p>Proprioception in a worm. The proprioceptive readout is 1.200 + in the upper left corner of the screen.</p> 1.201 +</div> 1.202 +#+end_html 1.203 + 1.204 + A worm is trained in various actions such as sinusoidal movement, 1.205 + curling, flailing, and spinning by directly playing motor 1.206 + contractions while the worm "feels" the experience. These actions 1.207 + are recorded both as vectors of muscle tension, touch, and 1.208 + proprioceptive data, but also in higher level forms such as 1.209 + frequencies of the various contractions and a symbolic name for the 1.210 + action. 1.211 + 1.212 + Then, the worm watches a video of another worm performing one of 1.213 + the actions, and must judge which action was performed. Normally 1.214 + this would be an extremely difficult problem, but the worm is able 1.215 + to greatly diminish the search space through sympathetic 1.216 + imagination. First, it creates an imagined copy of its body which 1.217 + it observes from a third person point of view. Then for each frame 1.218 + of the video, it maneuvers its simulated body to be in registration 1.219 + with the worm depicted in the video. The physical constraints 1.220 + imposed by the physics simulation greatly decrease the number of 1.221 + poses that have to be tried, making the search feasible. As the 1.222 + imaginary worm moves, it generates imaginary muscle tension and 1.223 + proprioceptive sensations. The worm determines the action not by 1.224 + vision, but by matching the imagined proprioceptive data with 1.225 + previous examples. 1.226 + 1.227 + By using non-visual sensory data such as touch, the worms can also 1.228 + answer body related questions such as "did your head touch your 1.229 + tail?" and "did worm A touch worm B?" 1.230 + 1.231 + The proprioceptive information used for action identification is 1.232 + body-centric, so only the registration step is dependent on point 1.233 + of view, not the identification step. Registration is not specific 1.234 + to any particular action. Thus, action identification can be 1.235 + divided into a point-of-view dependent generic registration step, 1.236 + and a action-specific step that is body-centered and invariant to 1.237 + point of view. 1.238 + 1.239 +** Stick Figure World 1.240 + 1.241 + This environment is similar to Worm World, except the creatures are 1.242 + more complicated and the actions and questions more varied. It is 1.243 + an experiment to see how far imagination can go in interpreting 1.244 + actions.