rlm@401
|
1 #+title: =CORTEX=
|
rlm@401
|
2 #+author: Robert McIntyre
|
rlm@401
|
3 #+email: rlm@mit.edu
|
rlm@401
|
4 #+description: Using embodied AI to facilitate Artificial Imagination.
|
rlm@401
|
5 #+keywords: AI, clojure, embodiment
|
rlm@401
|
6 #+SETUPFILE: ../../aurellem/org/setup.org
|
rlm@401
|
7 #+INCLUDE: ../../aurellem/org/level-0.org
|
rlm@401
|
8 #+babel: :mkdirp yes :noweb yes :exports both
|
rlm@401
|
9 #+OPTIONS: toc:nil, num:nil
|
rlm@401
|
10
|
rlm@401
|
11 * Artificial Imagination
|
rlm@401
|
12 Imagine watching a video of someone skateboarding. When you watch
|
rlm@401
|
13 the video, you can imagine yourself skateboarding, and your
|
rlm@401
|
14 knowledge of the human body and its dynamics guides your
|
rlm@401
|
15 interpretation of the scene. For example, even if the skateboarder
|
rlm@401
|
16 is partially occluded, you can infer the positions of his arms and
|
rlm@401
|
17 body from your own knowledge of how your body would be positioned if
|
rlm@401
|
18 you were skateboarding. If the skateboarder suffers an accident, you
|
rlm@401
|
19 wince in sympathy, imagining the pain your own body would experience
|
rlm@401
|
20 if it were in the same situation. This empathy with other people
|
rlm@401
|
21 guides our understanding of whatever they are doing because it is a
|
rlm@401
|
22 powerful constraint on what is probable and possible. In order to
|
rlm@401
|
23 make use of this powerful empathy constraint, I need a system that
|
rlm@401
|
24 can generate and make sense of sensory data from the many different
|
rlm@401
|
25 senses that humans possess. The two key proprieties of such a system
|
rlm@401
|
26 are /embodiment/ and /imagination/.
|
rlm@401
|
27
|
rlm@401
|
28 ** What is imagination?
|
rlm@401
|
29
|
rlm@401
|
30 One kind of imagination is /sympathetic/ imagination: you imagine
|
rlm@401
|
31 yourself in the position of something/someone you are
|
rlm@401
|
32 observing. This type of imagination comes into play when you follow
|
rlm@401
|
33 along visually when watching someone perform actions, or when you
|
rlm@401
|
34 sympathetically grimace when someone hurts themselves. This type of
|
rlm@401
|
35 imagination uses the constraints you have learned about your own
|
rlm@401
|
36 body to highly constrain the possibilities in whatever you are
|
rlm@401
|
37 seeing. It uses all your senses to including your senses of touch,
|
rlm@401
|
38 proprioception, etc. Humans are flexible when it comes to "putting
|
rlm@401
|
39 themselves in another's shoes," and can sympathetically understand
|
rlm@401
|
40 not only other humans, but entities ranging from animals to cartoon
|
rlm@401
|
41 characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen!
|
rlm@401
|
42
|
rlm@429
|
43 # and can infer intention from the actions of not only other humans,
|
rlm@429
|
44 # but also animals, cartoon characters, and even abstract moving dots
|
rlm@429
|
45 # on a screen!
|
rlm@429
|
46
|
rlm@401
|
47 Another kind of imagination is /predictive/ imagination: you
|
rlm@401
|
48 construct scenes in your mind that are not entirely related to
|
rlm@401
|
49 whatever you are observing, but instead are predictions of the
|
rlm@401
|
50 future or simply flights of fancy. You use this type of imagination
|
rlm@401
|
51 to plan out multi-step actions, or play out dangerous situations in
|
rlm@401
|
52 your mind so as to avoid messing them up in reality.
|
rlm@401
|
53
|
rlm@401
|
54 Of course, sympathetic and predictive imagination blend into each
|
rlm@401
|
55 other and are not completely separate concepts. One dimension along
|
rlm@401
|
56 which you can distinguish types of imagination is dependence on raw
|
rlm@401
|
57 sense data. Sympathetic imagination is highly constrained by your
|
rlm@401
|
58 senses, while predictive imagination can be more or less dependent
|
rlm@401
|
59 on your senses depending on how far ahead you imagine. Daydreaming
|
rlm@401
|
60 is an extreme form of predictive imagination that wanders through
|
rlm@401
|
61 different possibilities without concern for whether they are
|
rlm@401
|
62 related to whatever is happening in reality.
|
rlm@401
|
63
|
rlm@401
|
64 For this thesis, I will mostly focus on sympathetic imagination and
|
rlm@401
|
65 the constraint it provides for understanding sensory data.
|
rlm@401
|
66
|
rlm@401
|
67 ** What problems can imagination solve?
|
rlm@401
|
68
|
rlm@401
|
69 Consider a video of a cat drinking some water.
|
rlm@401
|
70
|
rlm@401
|
71 #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers.
|
rlm@401
|
72 #+ATTR_LaTeX: width=5cm
|
rlm@401
|
73 [[../images/cat-drinking.jpg]]
|
rlm@401
|
74
|
rlm@401
|
75 It is currently impossible for any computer program to reliably
|
rlm@401
|
76 label such an video as "drinking". I think humans are able to label
|
rlm@401
|
77 such video as "drinking" because they imagine /themselves/ as the
|
rlm@401
|
78 cat, and imagine putting their face up against a stream of water
|
rlm@401
|
79 and sticking out their tongue. In that imagined world, they can
|
rlm@401
|
80 feel the cool water hitting their tongue, and feel the water
|
rlm@401
|
81 entering their body, and are able to recognize that /feeling/ as
|
rlm@401
|
82 drinking. So, the label of the action is not really in the pixels
|
rlm@401
|
83 of the image, but is found clearly in a simulation inspired by
|
rlm@401
|
84 those pixels. An imaginative system, having been trained on
|
rlm@401
|
85 drinking and non-drinking examples and learning that the most
|
rlm@401
|
86 important component of drinking is the feeling of water sliding
|
rlm@401
|
87 down one's throat, would analyze a video of a cat drinking in the
|
rlm@401
|
88 following manner:
|
rlm@401
|
89
|
rlm@401
|
90 - Create a physical model of the video by putting a "fuzzy" model
|
rlm@401
|
91 of its own body in place of the cat. Also, create a simulation of
|
rlm@401
|
92 the stream of water.
|
rlm@401
|
93
|
rlm@401
|
94 - Play out this simulated scene and generate imagined sensory
|
rlm@401
|
95 experience. This will include relevant muscle contractions, a
|
rlm@401
|
96 close up view of the stream from the cat's perspective, and most
|
rlm@401
|
97 importantly, the imagined feeling of water entering the mouth.
|
rlm@401
|
98
|
rlm@401
|
99 - The action is now easily identified as drinking by the sense of
|
rlm@401
|
100 taste alone. The other senses (such as the tongue moving in and
|
rlm@401
|
101 out) help to give plausibility to the simulated action. Note that
|
rlm@401
|
102 the sense of vision, while critical in creating the simulation,
|
rlm@401
|
103 is not critical for identifying the action from the simulation.
|
rlm@401
|
104
|
rlm@401
|
105 More generally, I expect imaginative systems to be particularly
|
rlm@401
|
106 good at identifying embodied actions in videos.
|
rlm@401
|
107
|
rlm@401
|
108 * Cortex
|
rlm@401
|
109
|
rlm@401
|
110 The previous example involves liquids, the sense of taste, and
|
rlm@401
|
111 imagining oneself as a cat. For this thesis I constrain myself to
|
rlm@401
|
112 simpler, more easily digitizable senses and situations.
|
rlm@401
|
113
|
rlm@401
|
114 My system, =CORTEX= performs imagination in two different simplified
|
rlm@401
|
115 worlds: /worm world/ and /stick-figure world/. In each of these
|
rlm@401
|
116 worlds, entities capable of imagination recognize actions by
|
rlm@401
|
117 simulating the experience from their own perspective, and then
|
rlm@401
|
118 recognizing the action from a database of examples.
|
rlm@401
|
119
|
rlm@401
|
120 In order to serve as a framework for experiments in imagination,
|
rlm@401
|
121 =CORTEX= requires simulated bodies, worlds, and senses like vision,
|
rlm@401
|
122 hearing, touch, proprioception, etc.
|
rlm@401
|
123
|
rlm@401
|
124 ** A Video Game Engine takes care of some of the groundwork
|
rlm@401
|
125
|
rlm@401
|
126 When it comes to simulation environments, the engines used to
|
rlm@401
|
127 create the worlds in video games offer top-notch physics and
|
rlm@401
|
128 graphics support. These engines also have limited support for
|
rlm@401
|
129 creating cameras and rendering 3D sound, which can be repurposed
|
rlm@401
|
130 for vision and hearing respectively. Physics collision detection
|
rlm@401
|
131 can be expanded to create a sense of touch.
|
rlm@401
|
132
|
rlm@401
|
133 jMonkeyEngine3 is one such engine for creating video games in
|
rlm@401
|
134 Java. It uses OpenGL to render to the screen and uses screengraphs
|
rlm@401
|
135 to avoid drawing things that do not appear on the screen. It has an
|
rlm@401
|
136 active community and several games in the pipeline. The engine was
|
rlm@401
|
137 not built to serve any particular game but is instead meant to be
|
rlm@401
|
138 used for any 3D game. I chose jMonkeyEngine3 it because it had the
|
rlm@401
|
139 most features out of all the open projects I looked at, and because
|
rlm@401
|
140 I could then write my code in Clojure, an implementation of LISP
|
rlm@401
|
141 that runs on the JVM.
|
rlm@401
|
142
|
rlm@401
|
143 ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses
|
rlm@401
|
144
|
rlm@401
|
145 Using the game-making primitives provided by jMonkeyEngine3, I have
|
rlm@401
|
146 constructed every major human sense except for smell and
|
rlm@401
|
147 taste. =CORTEX= also provides an interface for creating creatures
|
rlm@401
|
148 in Blender, a 3D modeling environment, and then "rigging" the
|
rlm@401
|
149 creatures with senses using 3D annotations in Blender. A creature
|
rlm@401
|
150 can have any number of senses, and there can be any number of
|
rlm@401
|
151 creatures in a simulation.
|
rlm@401
|
152
|
rlm@401
|
153 The senses available in =CORTEX= are:
|
rlm@401
|
154
|
rlm@401
|
155 - [[../../cortex/html/vision.html][Vision]]
|
rlm@401
|
156 - [[../../cortex/html/hearing.html][Hearing]]
|
rlm@401
|
157 - [[../../cortex/html/touch.html][Touch]]
|
rlm@401
|
158 - [[../../cortex/html/proprioception.html][Proprioception]]
|
rlm@401
|
159 - [[../../cortex/html/movement.html][Muscle Tension]]
|
rlm@401
|
160
|
rlm@401
|
161 * A roadmap for =CORTEX= experiments
|
rlm@401
|
162
|
rlm@401
|
163 ** Worm World
|
rlm@401
|
164
|
rlm@401
|
165 Worms in =CORTEX= are segmented creatures which vary in length and
|
rlm@401
|
166 number of segments, and have the senses of vision, proprioception,
|
rlm@401
|
167 touch, and muscle tension.
|
rlm@401
|
168
|
rlm@401
|
169 #+attr_html: width=755
|
rlm@401
|
170 #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).
|
rlm@401
|
171 [[../images/finger-UV.png]]
|
rlm@401
|
172
|
rlm@401
|
173
|
rlm@401
|
174 #+begin_html
|
rlm@401
|
175 <div class="figure">
|
rlm@401
|
176 <center>
|
rlm@401
|
177 <video controls="controls" width="550">
|
rlm@401
|
178 <source src="../video/worm-touch.ogg" type="video/ogg"
|
rlm@401
|
179 preload="none" />
|
rlm@401
|
180 </video>
|
rlm@401
|
181 <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a>
|
rlm@401
|
182 </center>
|
rlm@401
|
183 <p>The worm responds to touch.</p>
|
rlm@401
|
184 </div>
|
rlm@401
|
185 #+end_html
|
rlm@401
|
186
|
rlm@401
|
187 #+begin_html
|
rlm@401
|
188 <div class="figure">
|
rlm@401
|
189 <center>
|
rlm@401
|
190 <video controls="controls" width="550">
|
rlm@401
|
191 <source src="../video/test-proprioception.ogg" type="video/ogg"
|
rlm@401
|
192 preload="none" />
|
rlm@401
|
193 </video>
|
rlm@401
|
194 <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a>
|
rlm@401
|
195 </center>
|
rlm@401
|
196 <p>Proprioception in a worm. The proprioceptive readout is
|
rlm@401
|
197 in the upper left corner of the screen.</p>
|
rlm@401
|
198 </div>
|
rlm@401
|
199 #+end_html
|
rlm@401
|
200
|
rlm@401
|
201 A worm is trained in various actions such as sinusoidal movement,
|
rlm@401
|
202 curling, flailing, and spinning by directly playing motor
|
rlm@401
|
203 contractions while the worm "feels" the experience. These actions
|
rlm@401
|
204 are recorded both as vectors of muscle tension, touch, and
|
rlm@401
|
205 proprioceptive data, but also in higher level forms such as
|
rlm@401
|
206 frequencies of the various contractions and a symbolic name for the
|
rlm@401
|
207 action.
|
rlm@401
|
208
|
rlm@401
|
209 Then, the worm watches a video of another worm performing one of
|
rlm@401
|
210 the actions, and must judge which action was performed. Normally
|
rlm@401
|
211 this would be an extremely difficult problem, but the worm is able
|
rlm@401
|
212 to greatly diminish the search space through sympathetic
|
rlm@401
|
213 imagination. First, it creates an imagined copy of its body which
|
rlm@401
|
214 it observes from a third person point of view. Then for each frame
|
rlm@401
|
215 of the video, it maneuvers its simulated body to be in registration
|
rlm@401
|
216 with the worm depicted in the video. The physical constraints
|
rlm@401
|
217 imposed by the physics simulation greatly decrease the number of
|
rlm@401
|
218 poses that have to be tried, making the search feasible. As the
|
rlm@401
|
219 imaginary worm moves, it generates imaginary muscle tension and
|
rlm@401
|
220 proprioceptive sensations. The worm determines the action not by
|
rlm@401
|
221 vision, but by matching the imagined proprioceptive data with
|
rlm@401
|
222 previous examples.
|
rlm@401
|
223
|
rlm@401
|
224 By using non-visual sensory data such as touch, the worms can also
|
rlm@401
|
225 answer body related questions such as "did your head touch your
|
rlm@401
|
226 tail?" and "did worm A touch worm B?"
|
rlm@401
|
227
|
rlm@401
|
228 The proprioceptive information used for action identification is
|
rlm@401
|
229 body-centric, so only the registration step is dependent on point
|
rlm@401
|
230 of view, not the identification step. Registration is not specific
|
rlm@401
|
231 to any particular action. Thus, action identification can be
|
rlm@401
|
232 divided into a point-of-view dependent generic registration step,
|
rlm@401
|
233 and a action-specific step that is body-centered and invariant to
|
rlm@401
|
234 point of view.
|
rlm@401
|
235
|
rlm@401
|
236 ** Stick Figure World
|
rlm@401
|
237
|
rlm@401
|
238 This environment is similar to Worm World, except the creatures are
|
rlm@401
|
239 more complicated and the actions and questions more varied. It is
|
rlm@401
|
240 an experiment to see how far imagination can go in interpreting
|
rlm@401
|
241 actions.
|