cortex: thesis/org/first-chapter.org annotate

annotate thesis/org/first-chapter.org @ 564:ecae29320b00

even more changes from winston.

author	Robert McIntyre <rlm@mit.edu>
date	Mon, 12 May 2014 13:31:55 -0400
parents	5205535237fb
children

rev	line source
rlm@401	1 #+title: =CORTEX=
rlm@401	2 #+author: Robert McIntyre
rlm@401	3 #+email: rlm@mit.edu
rlm@401	4 #+description: Using embodied AI to facilitate Artificial Imagination.
rlm@401	5 #+keywords: AI, clojure, embodiment
rlm@401	6 #+SETUPFILE: ../../aurellem/org/setup.org
rlm@401	7 #+INCLUDE: ../../aurellem/org/level-0.org
rlm@401	8 #+babel: :mkdirp yes :noweb yes :exports both
rlm@401	9 #+OPTIONS: toc:nil, num:nil
rlm@401	10
rlm@401	11 * Artificial Imagination
rlm@401	12 Imagine watching a video of someone skateboarding. When you watch
rlm@401	13 the video, you can imagine yourself skateboarding, and your
rlm@401	14 knowledge of the human body and its dynamics guides your
rlm@401	15 interpretation of the scene. For example, even if the skateboarder
rlm@401	16 is partially occluded, you can infer the positions of his arms and
rlm@401	17 body from your own knowledge of how your body would be positioned if
rlm@401	18 you were skateboarding. If the skateboarder suffers an accident, you
rlm@401	19 wince in sympathy, imagining the pain your own body would experience
rlm@401	20 if it were in the same situation. This empathy with other people
rlm@401	21 guides our understanding of whatever they are doing because it is a
rlm@401	22 powerful constraint on what is probable and possible. In order to
rlm@401	23 make use of this powerful empathy constraint, I need a system that
rlm@401	24 can generate and make sense of sensory data from the many different
rlm@401	25 senses that humans possess. The two key proprieties of such a system
rlm@401	26 are /embodiment/ and /imagination/.
rlm@401	27
rlm@401	28 ** What is imagination?
rlm@401	29
rlm@401	30 One kind of imagination is /sympathetic/ imagination: you imagine
rlm@401	31 yourself in the position of something/someone you are
rlm@401	32 observing. This type of imagination comes into play when you follow
rlm@401	33 along visually when watching someone perform actions, or when you
rlm@401	34 sympathetically grimace when someone hurts themselves. This type of
rlm@401	35 imagination uses the constraints you have learned about your own
rlm@401	36 body to highly constrain the possibilities in whatever you are
rlm@401	37 seeing. It uses all your senses to including your senses of touch,
rlm@401	38 proprioception, etc. Humans are flexible when it comes to "putting
rlm@401	39 themselves in another's shoes," and can sympathetically understand
rlm@401	40 not only other humans, but entities ranging from animals to cartoon
rlm@401	41 characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!
rlm@401	42
rlm@429	43 # and can infer intention from the actions of not only other humans,
rlm@429	44 # but also animals, cartoon characters, and even abstract moving dots
rlm@429	45 # on a screen!
rlm@429	46
rlm@401	47 Another kind of imagination is /predictive/ imagination: you
rlm@401	48 construct scenes in your mind that are not entirely related to
rlm@401	49 whatever you are observing, but instead are predictions of the
rlm@401	50 future or simply flights of fancy. You use this type of imagination
rlm@401	51 to plan out multi-step actions, or play out dangerous situations in
rlm@401	52 your mind so as to avoid messing them up in reality.
rlm@401	53
rlm@401	54 Of course, sympathetic and predictive imagination blend into each
rlm@401	55 other and are not completely separate concepts. One dimension along
rlm@401	56 which you can distinguish types of imagination is dependence on raw
rlm@401	57 sense data. Sympathetic imagination is highly constrained by your
rlm@401	58 senses, while predictive imagination can be more or less dependent
rlm@401	59 on your senses depending on how far ahead you imagine. Daydreaming
rlm@401	60 is an extreme form of predictive imagination that wanders through
rlm@401	61 different possibilities without concern for whether they are
rlm@401	62 related to whatever is happening in reality.
rlm@401	63
rlm@401	64 For this thesis, I will mostly focus on sympathetic imagination and
rlm@401	65 the constraint it provides for understanding sensory data.
rlm@401	66
rlm@401	67 ** What problems can imagination solve?
rlm@401	68
rlm@401	69 Consider a video of a cat drinking some water.
rlm@401	70
rlm@401	71 #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.
rlm@401	72 #+ATTR_LaTeX: width=5cm
rlm@401	73 [[../images/cat-drinking.jpg]]
rlm@401	74
rlm@401	75 It is currently impossible for any computer program to reliably
rlm@401	76 label such an video as "drinking". I think humans are able to label
rlm@401	77 such video as "drinking" because they imagine /themselves/ as the
rlm@401	78 cat, and imagine putting their face up against a stream of water
rlm@401	79 and sticking out their tongue. In that imagined world, they can
rlm@401	80 feel the cool water hitting their tongue, and feel the water
rlm@401	81 entering their body, and are able to recognize that /feeling/ as
rlm@401	82 drinking. So, the label of the action is not really in the pixels
rlm@401	83 of the image, but is found clearly in a simulation inspired by
rlm@401	84 those pixels. An imaginative system, having been trained on
rlm@401	85 drinking and non-drinking examples and learning that the most
rlm@401	86 important component of drinking is the feeling of water sliding
rlm@401	87 down one's throat, would analyze a video of a cat drinking in the
rlm@401	88 following manner:
rlm@401	89
rlm@401	90 - Create a physical model of the video by putting a "fuzzy" model
rlm@401	91 of its own body in place of the cat. Also, create a simulation of
rlm@401	92 the stream of water.
rlm@401	93
rlm@401	94 - Play out this simulated scene and generate imagined sensory
rlm@401	95 experience. This will include relevant muscle contractions, a
rlm@401	96 close up view of the stream from the cat's perspective, and most
rlm@401	97 importantly, the imagined feeling of water entering the mouth.
rlm@401	98
rlm@401	99 - The action is now easily identified as drinking by the sense of
rlm@401	100 taste alone. The other senses (such as the tongue moving in and
rlm@401	101 out) help to give plausibility to the simulated action. Note that
rlm@401	102 the sense of vision, while critical in creating the simulation,
rlm@401	103 is not critical for identifying the action from the simulation.
rlm@401	104
rlm@401	105 More generally, I expect imaginative systems to be particularly
rlm@401	106 good at identifying embodied actions in videos.
rlm@401	107
rlm@401	108 * Cortex
rlm@401	109
rlm@401	110 The previous example involves liquids, the sense of taste, and
rlm@401	111 imagining oneself as a cat. For this thesis I constrain myself to
rlm@401	112 simpler, more easily digitizable senses and situations.
rlm@401	113
rlm@401	114 My system, =CORTEX= performs imagination in two different simplified
rlm@401	115 worlds: /worm world/ and /stick-figure world/. In each of these
rlm@401	116 worlds, entities capable of imagination recognize actions by
rlm@401	117 simulating the experience from their own perspective, and then
rlm@401	118 recognizing the action from a database of examples.
rlm@401	119
rlm@401	120 In order to serve as a framework for experiments in imagination,
rlm@401	121 =CORTEX= requires simulated bodies, worlds, and senses like vision,
rlm@401	122 hearing, touch, proprioception, etc.
rlm@401	123
rlm@401	124 ** A Video Game Engine takes care of some of the groundwork
rlm@401	125
rlm@401	126 When it comes to simulation environments, the engines used to
rlm@401	127 create the worlds in video games offer top-notch physics and
rlm@401	128 graphics support. These engines also have limited support for
rlm@401	129 creating cameras and rendering 3D sound, which can be repurposed
rlm@401	130 for vision and hearing respectively. Physics collision detection
rlm@401	131 can be expanded to create a sense of touch.
rlm@401	132
rlm@401	133 jMonkeyEngine3 is one such engine for creating video games in
rlm@401	134 Java. It uses OpenGL to render to the screen and uses screengraphs
rlm@401	135 to avoid drawing things that do not appear on the screen. It has an
rlm@401	136 active community and several games in the pipeline. The engine was
rlm@401	137 not built to serve any particular game but is instead meant to be
rlm@401	138 used for any 3D game. I chose jMonkeyEngine3 it because it had the
rlm@401	139 most features out of all the open projects I looked at, and because
rlm@401	140 I could then write my code in Clojure, an implementation of LISP
rlm@401	141 that runs on the JVM.
rlm@401	142
rlm@401	143 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses
rlm@401	144
rlm@401	145 Using the game-making primitives provided by jMonkeyEngine3, I have
rlm@401	146 constructed every major human sense except for smell and
rlm@401	147 taste. =CORTEX= also provides an interface for creating creatures
rlm@401	148 in Blender, a 3D modeling environment, and then "rigging" the
rlm@401	149 creatures with senses using 3D annotations in Blender. A creature
rlm@401	150 can have any number of senses, and there can be any number of
rlm@401	151 creatures in a simulation.
rlm@401	152
rlm@401	153 The senses available in =CORTEX= are:
rlm@401	154
rlm@401	155 - [[../../cortex/html/vision.html][Vision]]
rlm@401	156 - [[../../cortex/html/hearing.html][Hearing]]
rlm@401	157 - [[../../cortex/html/touch.html][Touch]]
rlm@401	158 - [[../../cortex/html/proprioception.html][Proprioception]]
rlm@401	159 - [[../../cortex/html/movement.html][Muscle Tension]]
rlm@401	160
rlm@401	161 * A roadmap for =CORTEX= experiments
rlm@401	162
rlm@401	163 ** Worm World
rlm@401	164
rlm@401	165 Worms in =CORTEX= are segmented creatures which vary in length and
rlm@401	166 number of segments, and have the senses of vision, proprioception,
rlm@401	167 touch, and muscle tension.
rlm@401	168
rlm@401	169 #+attr_html: width=755
rlm@401	170 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).
rlm@401	171 [[../images/finger-UV.png]]
rlm@401	172
rlm@401	173
rlm@401	174 #+begin_html
rlm@401	175 <div class="figure">
rlm@401	176 <center>
rlm@401	177 <video controls="controls" width="550">
rlm@401	178 <source src="../video/worm-touch.ogg" type="video/ogg"
rlm@401	179 preload="none" />
rlm@401	180 </video>
rlm@401	181 <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>
rlm@401	182 </center>
rlm@401	183 <p>The worm responds to touch.</p>
rlm@401	184 </div>
rlm@401	185 #+end_html
rlm@401	186
rlm@401	187 #+begin_html
rlm@401	188 <div class="figure">
rlm@401	189 <center>
rlm@401	190 <video controls="controls" width="550">
rlm@401	191 <source src="../video/test-proprioception.ogg" type="video/ogg"
rlm@401	192 preload="none" />
rlm@401	193 </video>
rlm@401	194 <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>
rlm@401	195 </center>
rlm@401	196 <p>Proprioception in a worm. The proprioceptive readout is
rlm@401	197 in the upper left corner of the screen.</p>
rlm@401	198 </div>
rlm@401	199 #+end_html
rlm@401	200
rlm@401	201 A worm is trained in various actions such as sinusoidal movement,
rlm@401	202 curling, flailing, and spinning by directly playing motor
rlm@401	203 contractions while the worm "feels" the experience. These actions
rlm@401	204 are recorded both as vectors of muscle tension, touch, and
rlm@401	205 proprioceptive data, but also in higher level forms such as
rlm@401	206 frequencies of the various contractions and a symbolic name for the
rlm@401	207 action.
rlm@401	208
rlm@401	209 Then, the worm watches a video of another worm performing one of
rlm@401	210 the actions, and must judge which action was performed. Normally
rlm@401	211 this would be an extremely difficult problem, but the worm is able
rlm@401	212 to greatly diminish the search space through sympathetic
rlm@401	213 imagination. First, it creates an imagined copy of its body which
rlm@401	214 it observes from a third person point of view. Then for each frame
rlm@401	215 of the video, it maneuvers its simulated body to be in registration
rlm@401	216 with the worm depicted in the video. The physical constraints
rlm@401	217 imposed by the physics simulation greatly decrease the number of
rlm@401	218 poses that have to be tried, making the search feasible. As the
rlm@401	219 imaginary worm moves, it generates imaginary muscle tension and
rlm@401	220 proprioceptive sensations. The worm determines the action not by
rlm@401	221 vision, but by matching the imagined proprioceptive data with
rlm@401	222 previous examples.
rlm@401	223
rlm@401	224 By using non-visual sensory data such as touch, the worms can also
rlm@401	225 answer body related questions such as "did your head touch your
rlm@401	226 tail?" and "did worm A touch worm B?"
rlm@401	227
rlm@401	228 The proprioceptive information used for action identification is
rlm@401	229 body-centric, so only the registration step is dependent on point
rlm@401	230 of view, not the identification step. Registration is not specific
rlm@401	231 to any particular action. Thus, action identification can be
rlm@401	232 divided into a point-of-view dependent generic registration step,
rlm@401	233 and a action-specific step that is body-centered and invariant to
rlm@401	234 point of view.
rlm@401	235
rlm@401	236 ** Stick Figure World
rlm@401	237
rlm@401	238 This environment is similar to Worm World, except the creatures are
rlm@401	239 more complicated and the actions and questions more varied. It is
rlm@401	240 an experiment to see how far imagination can go in interpreting
rlm@401	241 actions.

Mercurial > cortex

annotate thesis/org/first-chapter.org @ 564:ecae29320b00