view thesis/org/first-chapter.org @ 524:8e52a2802821

incorporating winston's changes.
author Robert McIntyre <rlm@mit.edu>
date Sun, 20 Apr 2014 21:46:46 -0400 (2014-04-21)
parents 5205535237fb
children
line wrap: on
line source
1 #+title: =CORTEX=
2 #+author: Robert McIntyre
3 #+email: rlm@mit.edu
4 #+description: Using embodied AI to facilitate Artificial Imagination.
5 #+keywords: AI, clojure, embodiment
6 #+SETUPFILE: ../../aurellem/org/setup.org
7 #+INCLUDE: ../../aurellem/org/level-0.org
8 #+babel: :mkdirp yes :noweb yes :exports both
9 #+OPTIONS: toc:nil, num:nil
11 * Artificial Imagination
12 Imagine watching a video of someone skateboarding. When you watch
13 the video, you can imagine yourself skateboarding, and your
14 knowledge of the human body and its dynamics guides your
15 interpretation of the scene. For example, even if the skateboarder
16 is partially occluded, you can infer the positions of his arms and
17 body from your own knowledge of how your body would be positioned if
18 you were skateboarding. If the skateboarder suffers an accident, you
19 wince in sympathy, imagining the pain your own body would experience
20 if it were in the same situation. This empathy with other people
21 guides our understanding of whatever they are doing because it is a
22 powerful constraint on what is probable and possible. In order to
23 make use of this powerful empathy constraint, I need a system that
24 can generate and make sense of sensory data from the many different
25 senses that humans possess. The two key proprieties of such a system
26 are /embodiment/ and /imagination/.
28 ** What is imagination?
30 One kind of imagination is /sympathetic/ imagination: you imagine
31 yourself in the position of something/someone you are
32 observing. This type of imagination comes into play when you follow
33 along visually when watching someone perform actions, or when you
34 sympathetically grimace when someone hurts themselves. This type of
35 imagination uses the constraints you have learned about your own
36 body to highly constrain the possibilities in whatever you are
37 seeing. It uses all your senses to including your senses of touch,
38 proprioception, etc. Humans are flexible when it comes to "putting
39 themselves in another's shoes," and can sympathetically understand
40 not only other humans, but entities ranging from animals to cartoon
41 characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!
43 # and can infer intention from the actions of not only other humans,
44 # but also animals, cartoon characters, and even abstract moving dots
45 # on a screen!
47 Another kind of imagination is /predictive/ imagination: you
48 construct scenes in your mind that are not entirely related to
49 whatever you are observing, but instead are predictions of the
50 future or simply flights of fancy. You use this type of imagination
51 to plan out multi-step actions, or play out dangerous situations in
52 your mind so as to avoid messing them up in reality.
54 Of course, sympathetic and predictive imagination blend into each
55 other and are not completely separate concepts. One dimension along
56 which you can distinguish types of imagination is dependence on raw
57 sense data. Sympathetic imagination is highly constrained by your
58 senses, while predictive imagination can be more or less dependent
59 on your senses depending on how far ahead you imagine. Daydreaming
60 is an extreme form of predictive imagination that wanders through
61 different possibilities without concern for whether they are
62 related to whatever is happening in reality.
64 For this thesis, I will mostly focus on sympathetic imagination and
65 the constraint it provides for understanding sensory data.
67 ** What problems can imagination solve?
69 Consider a video of a cat drinking some water.
71 #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.
72 #+ATTR_LaTeX: width=5cm
73 [[../images/cat-drinking.jpg]]
75 It is currently impossible for any computer program to reliably
76 label such an video as "drinking". I think humans are able to label
77 such video as "drinking" because they imagine /themselves/ as the
78 cat, and imagine putting their face up against a stream of water
79 and sticking out their tongue. In that imagined world, they can
80 feel the cool water hitting their tongue, and feel the water
81 entering their body, and are able to recognize that /feeling/ as
82 drinking. So, the label of the action is not really in the pixels
83 of the image, but is found clearly in a simulation inspired by
84 those pixels. An imaginative system, having been trained on
85 drinking and non-drinking examples and learning that the most
86 important component of drinking is the feeling of water sliding
87 down one's throat, would analyze a video of a cat drinking in the
88 following manner:
90 - Create a physical model of the video by putting a "fuzzy" model
91 of its own body in place of the cat. Also, create a simulation of
92 the stream of water.
94 - Play out this simulated scene and generate imagined sensory
95 experience. This will include relevant muscle contractions, a
96 close up view of the stream from the cat's perspective, and most
97 importantly, the imagined feeling of water entering the mouth.
99 - The action is now easily identified as drinking by the sense of
100 taste alone. The other senses (such as the tongue moving in and
101 out) help to give plausibility to the simulated action. Note that
102 the sense of vision, while critical in creating the simulation,
103 is not critical for identifying the action from the simulation.
105 More generally, I expect imaginative systems to be particularly
106 good at identifying embodied actions in videos.
108 * Cortex
110 The previous example involves liquids, the sense of taste, and
111 imagining oneself as a cat. For this thesis I constrain myself to
112 simpler, more easily digitizable senses and situations.
114 My system, =CORTEX= performs imagination in two different simplified
115 worlds: /worm world/ and /stick-figure world/. In each of these
116 worlds, entities capable of imagination recognize actions by
117 simulating the experience from their own perspective, and then
118 recognizing the action from a database of examples.
120 In order to serve as a framework for experiments in imagination,
121 =CORTEX= requires simulated bodies, worlds, and senses like vision,
122 hearing, touch, proprioception, etc.
124 ** A Video Game Engine takes care of some of the groundwork
126 When it comes to simulation environments, the engines used to
127 create the worlds in video games offer top-notch physics and
128 graphics support. These engines also have limited support for
129 creating cameras and rendering 3D sound, which can be repurposed
130 for vision and hearing respectively. Physics collision detection
131 can be expanded to create a sense of touch.
133 jMonkeyEngine3 is one such engine for creating video games in
134 Java. It uses OpenGL to render to the screen and uses screengraphs
135 to avoid drawing things that do not appear on the screen. It has an
136 active community and several games in the pipeline. The engine was
137 not built to serve any particular game but is instead meant to be
138 used for any 3D game. I chose jMonkeyEngine3 it because it had the
139 most features out of all the open projects I looked at, and because
140 I could then write my code in Clojure, an implementation of LISP
141 that runs on the JVM.
143 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses
145 Using the game-making primitives provided by jMonkeyEngine3, I have
146 constructed every major human sense except for smell and
147 taste. =CORTEX= also provides an interface for creating creatures
148 in Blender, a 3D modeling environment, and then "rigging" the
149 creatures with senses using 3D annotations in Blender. A creature
150 can have any number of senses, and there can be any number of
151 creatures in a simulation.
153 The senses available in =CORTEX= are:
155 - [[../../cortex/html/vision.html][Vision]]
156 - [[../../cortex/html/hearing.html][Hearing]]
157 - [[../../cortex/html/touch.html][Touch]]
158 - [[../../cortex/html/proprioception.html][Proprioception]]
159 - [[../../cortex/html/movement.html][Muscle Tension]]
161 * A roadmap for =CORTEX= experiments
163 ** Worm World
165 Worms in =CORTEX= are segmented creatures which vary in length and
166 number of segments, and have the senses of vision, proprioception,
167 touch, and muscle tension.
169 #+attr_html: width=755
170 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).
171 [[../images/finger-UV.png]]
174 #+begin_html
175 <div class="figure">
176 <center>
177 <video controls="controls" width="550">
178 <source src="../video/worm-touch.ogg" type="video/ogg"
179 preload="none" />
180 </video>
181 <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>
182 </center>
183 <p>The worm responds to touch.</p>
184 </div>
185 #+end_html
187 #+begin_html
188 <div class="figure">
189 <center>
190 <video controls="controls" width="550">
191 <source src="../video/test-proprioception.ogg" type="video/ogg"
192 preload="none" />
193 </video>
194 <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>
195 </center>
196 <p>Proprioception in a worm. The proprioceptive readout is
197 in the upper left corner of the screen.</p>
198 </div>
199 #+end_html
201 A worm is trained in various actions such as sinusoidal movement,
202 curling, flailing, and spinning by directly playing motor
203 contractions while the worm "feels" the experience. These actions
204 are recorded both as vectors of muscle tension, touch, and
205 proprioceptive data, but also in higher level forms such as
206 frequencies of the various contractions and a symbolic name for the
207 action.
209 Then, the worm watches a video of another worm performing one of
210 the actions, and must judge which action was performed. Normally
211 this would be an extremely difficult problem, but the worm is able
212 to greatly diminish the search space through sympathetic
213 imagination. First, it creates an imagined copy of its body which
214 it observes from a third person point of view. Then for each frame
215 of the video, it maneuvers its simulated body to be in registration
216 with the worm depicted in the video. The physical constraints
217 imposed by the physics simulation greatly decrease the number of
218 poses that have to be tried, making the search feasible. As the
219 imaginary worm moves, it generates imaginary muscle tension and
220 proprioceptive sensations. The worm determines the action not by
221 vision, but by matching the imagined proprioceptive data with
222 previous examples.
224 By using non-visual sensory data such as touch, the worms can also
225 answer body related questions such as "did your head touch your
226 tail?" and "did worm A touch worm B?"
228 The proprioceptive information used for action identification is
229 body-centric, so only the registration step is dependent on point
230 of view, not the identification step. Registration is not specific
231 to any particular action. Thus, action identification can be
232 divided into a point-of-view dependent generic registration step,
233 and a action-specific step that is body-centered and invariant to
234 point of view.
236 ** Stick Figure World
238 This environment is similar to Worm World, except the creatures are
239 more complicated and the actions and questions more varied. It is
240 an experiment to see how far imagination can go in interpreting
241 actions.