Mercurial > cortex
comparison thesis/dylan-cortex-diff.diff @ 513:4c4d45f6f30b
accept/reject changes
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Sun, 30 Mar 2014 10:41:18 -0400 |
parents | |
children | 447c3c8405a2 |
comparison
equal
deleted
inserted
replaced
512:8b962ab418c8 | 513:4c4d45f6f30b |
---|---|
1 diff -r f639e2139ce2 thesis/cortex.org | |
2 --- a/thesis/cortex.org Sun Mar 30 01:34:43 2014 -0400 | |
3 +++ b/thesis/cortex.org Sun Mar 30 10:07:17 2014 -0400 | |
4 @@ -41,49 +41,46 @@ | |
5 [[./images/aurellem-gray.png]] | |
6 | |
7 | |
8 -* Empathy and Embodiment as problem solving strategies | |
9 +* Empathy \& Embodiment: problem solving strategies | |
10 | |
11 - By the end of this thesis, you will have seen a novel approach to | |
12 - interpreting video using embodiment and empathy. You will have also | |
13 - seen one way to efficiently implement empathy for embodied | |
14 - creatures. Finally, you will become familiar with =CORTEX=, a system | |
15 - for designing and simulating creatures with rich senses, which you | |
16 - may choose to use in your own research. | |
17 - | |
18 - This is the core vision of my thesis: That one of the important ways | |
19 - in which we understand others is by imagining ourselves in their | |
20 - position and emphatically feeling experiences relative to our own | |
21 - bodies. By understanding events in terms of our own previous | |
22 - corporeal experience, we greatly constrain the possibilities of what | |
23 - would otherwise be an unwieldy exponential search. This extra | |
24 - constraint can be the difference between easily understanding what | |
25 - is happening in a video and being completely lost in a sea of | |
26 - incomprehensible color and movement. | |
27 - | |
28 -** Recognizing actions in video is extremely difficult | |
29 - | |
30 - Consider for example the problem of determining what is happening | |
31 - in a video of which this is one frame: | |
32 - | |
33 +** The problem: recognizing actions in video is extremely difficult | |
34 +# developing / requires useful representations | |
35 + | |
36 + Examine the following collection of images. As you, and indeed very | |
37 + young children, can easily determine, each one is a picture of | |
38 + someone drinking. | |
39 + | |
40 + # dxh: cat, cup, drinking fountain, rain, straw, coconut | |
41 #+caption: A cat drinking some water. Identifying this action is | |
42 - #+caption: beyond the state of the art for computers. | |
43 + #+caption: beyond the capabilities of existing computer vision systems. | |
44 #+ATTR_LaTeX: :width 7cm | |
45 [[./images/cat-drinking.jpg]] | |
46 + | |
47 + Nevertheless, it is beyond the state of the art for a computer | |
48 + vision program to describe what's happening in each of these | |
49 + images, or what's common to them. Part of the problem is that many | |
50 + computer vision systems focus on pixel-level details or probability | |
51 + distributions of pixels, with little focus on [...] | |
52 + | |
53 + | |
54 + In fact, the contents of scene may have much less to do with pixel | |
55 + probabilities than with recognizing various affordances: things you | |
56 + can move, objects you can grasp, spaces that can be filled | |
57 + (Gibson). For example, what processes might enable you to see the | |
58 + chair in figure \ref{hidden-chair}? | |
59 + # Or suppose that you are building a program that recognizes chairs. | |
60 + # How could you ``see'' the chair ? | |
61 | |
62 - It is currently impossible for any computer program to reliably | |
63 - label such a video as ``drinking''. And rightly so -- it is a very | |
64 - hard problem! What features can you describe in terms of low level | |
65 - functions of pixels that can even begin to describe at a high level | |
66 - what is happening here? | |
67 - | |
68 - Or suppose that you are building a program that recognizes chairs. | |
69 - How could you ``see'' the chair in figure \ref{hidden-chair}? | |
70 - | |
71 + # dxh: blur chair | |
72 #+caption: The chair in this image is quite obvious to humans, but I | |
73 #+caption: doubt that any modern computer vision program can find it. | |
74 #+name: hidden-chair | |
75 #+ATTR_LaTeX: :width 10cm | |
76 [[./images/fat-person-sitting-at-desk.jpg]] | |
77 + | |
78 + | |
79 + | |
80 + | |
81 | |
82 Finally, how is it that you can easily tell the difference between | |
83 how the girls /muscles/ are working in figure \ref{girl}? | |
84 @@ -95,10 +92,13 @@ | |
85 #+ATTR_LaTeX: :width 7cm | |
86 [[./images/wall-push.png]] | |
87 | |
88 + | |
89 + | |
90 + | |
91 Each of these examples tells us something about what might be going | |
92 on in our minds as we easily solve these recognition problems. | |
93 | |
94 - The hidden chairs show us that we are strongly triggered by cues | |
95 + The hidden chair shows us that we are strongly triggered by cues | |
96 relating to the position of human bodies, and that we can determine | |
97 the overall physical configuration of a human body even if much of | |
98 that body is occluded. | |
99 @@ -109,10 +109,107 @@ | |
100 most positions, and we can easily project this self-knowledge to | |
101 imagined positions triggered by images of the human body. | |
102 | |
103 -** =EMPATH= neatly solves recognition problems | |
104 +** A step forward: the sensorimotor-centered approach | |
105 +# ** =EMPATH= recognizes what creatures are doing | |
106 +# neatly solves recognition problems | |
107 + In this thesis, I explore the idea that our knowledge of our own | |
108 + bodies enables us to recognize the actions of others. | |
109 + | |
110 + First, I built a system for constructing virtual creatures with | |
111 + physiologically plausible sensorimotor systems and detailed | |
112 + environments. The result is =CORTEX=, which is described in section | |
113 + \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other | |
114 + AI researchers; it is provided in full with detailed instructions | |
115 + on the web [here].) | |
116 + | |
117 + Next, I wrote routines which enabled a simple worm-like creature to | |
118 + infer the actions of a second worm-like creature, using only its | |
119 + own prior sensorimotor experiences and knowledge of the second | |
120 + worm's joint positions. This program, =EMPATH=, is described in | |
121 + section \ref{sec-3}, and the key results of this experiment are | |
122 + summarized below. | |
123 + | |
124 + #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer | |
125 + #+caption: the complete sensory experience and classify these four poses. | |
126 + #+caption: The last image is a composite, depicting the intermediate stages of \emph{wriggling}. | |
127 + #+name: worm-recognition-intro-2 | |
128 + #+ATTR_LaTeX: :width 15cm | |
129 + [[./images/empathy-1.png]] | |
130 + | |
131 + # =CORTEX= provides a language for describing the sensorimotor | |
132 + # experiences of various creatures. | |
133 + | |
134 + # Next, I developed an experiment to test the power of =CORTEX='s | |
135 + # sensorimotor-centered language for solving recognition problems. As | |
136 + # a proof of concept, I wrote routines which enabled a simple | |
137 + # worm-like creature to infer the actions of a second worm-like | |
138 + # creature, using only its own previous sensorimotor experiences and | |
139 + # knowledge of the second worm's joints (figure | |
140 + # \ref{worm-recognition-intro-2}). The result of this proof of | |
141 + # concept was the program =EMPATH=, described in section | |
142 + # \ref{sec-3}. The key results of this | |
143 + | |
144 + # Using only first-person sensorimotor experiences and third-person | |
145 + # proprioceptive data, | |
146 + | |
147 +*** Key results | |
148 + - After one-shot supervised training, =EMPATH= was able recognize a | |
149 + wide variety of static poses and dynamic actions---ranging from | |
150 + curling in a circle to wriggling with a particular frequency --- | |
151 + with 95\% accuracy. | |
152 + - These results were completely independent of viewing angle | |
153 + because the underlying body-centered language fundamentally is; | |
154 + once an action is learned, it can be recognized equally well from | |
155 + any viewing angle. | |
156 + - =EMPATH= is surprisingly short; the sensorimotor-centered | |
157 + language provided by =CORTEX= resulted in extremely economical | |
158 + recognition routines --- about 0000 lines in all --- suggesting | |
159 + that such representations are very powerful, and often | |
160 + indispensible for the types of recognition tasks considered here. | |
161 + - Although for expediency's sake, I relied on direct knowledge of | |
162 + joint positions in this proof of concept, it would be | |
163 + straightforward to extend =EMPATH= so that it (more | |
164 + realistically) infers joint positions from its visual data. | |
165 + | |
166 +# because the underlying language is fundamentally orientation-independent | |
167 + | |
168 +# recognize the actions of a worm with 95\% accuracy. The | |
169 +# recognition tasks | |
170 | |
171 - I propose a system that can express the types of recognition | |
172 - problems above in a form amenable to computation. It is split into | |
173 + | |
174 + | |
175 + | |
176 + [Talk about these results and what you find promising about them] | |
177 + | |
178 +** Roadmap | |
179 + [I'm going to explain how =CORTEX= works, then break down how | |
180 + =EMPATH= does its thing. Because the details reveal such-and-such | |
181 + about the approach.] | |
182 + | |
183 + # The success of this simple proof-of-concept offers a tantalizing | |
184 + | |
185 + | |
186 + # explore the idea | |
187 + # The key contribution of this thesis is the idea that body-centered | |
188 + # representations (which express | |
189 + | |
190 + | |
191 + # the | |
192 + # body-centered approach --- in which I try to determine what's | |
193 + # happening in a scene by bringing it into registration with my own | |
194 + # bodily experiences --- are indispensible for recognizing what | |
195 + # creatures are doing in a scene. | |
196 + | |
197 +* COMMENT | |
198 +# body-centered language | |
199 + | |
200 + In this thesis, I'll describe =EMPATH=, which solves a certain | |
201 + class of recognition problems | |
202 + | |
203 + The key idea is to use self-centered (or first-person) language. | |
204 + | |
205 + I have built a system that can express the types of recognition | |
206 + problems in a form amenable to computation. It is split into | |
207 four parts: | |
208 | |
209 - Free/Guided Play :: The creature moves around and experiences the | |
210 @@ -286,14 +383,14 @@ | |
211 code to create a creature, and can use a wide library of | |
212 pre-existing blender models as a base for your own creatures. | |
213 | |
214 - - =CORTEX= implements a wide variety of senses, including touch, | |
215 + - =CORTEX= implements a wide variety of senses: touch, | |
216 proprioception, vision, hearing, and muscle tension. Complicated | |
217 senses like touch, and vision involve multiple sensory elements | |
218 embedded in a 2D surface. You have complete control over the | |
219 distribution of these sensor elements through the use of simple | |
220 png image files. In particular, =CORTEX= implements more | |
221 comprehensive hearing than any other creature simulation system | |
222 - available. | |
223 + available. | |
224 | |
225 - =CORTEX= supports any number of creatures and any number of | |
226 senses. Time in =CORTEX= dialates so that the simulated creatures | |
227 @@ -353,7 +450,24 @@ | |
228 \end{sidewaysfigure} | |
229 #+END_LaTeX | |
230 | |
231 -** Contributions | |
232 +** Road map | |
233 + | |
234 + By the end of this thesis, you will have seen a novel approach to | |
235 + interpreting video using embodiment and empathy. You will have also | |
236 + seen one way to efficiently implement empathy for embodied | |
237 + creatures. Finally, you will become familiar with =CORTEX=, a system | |
238 + for designing and simulating creatures with rich senses, which you | |
239 + may choose to use in your own research. | |
240 + | |
241 + This is the core vision of my thesis: That one of the important ways | |
242 + in which we understand others is by imagining ourselves in their | |
243 + position and emphatically feeling experiences relative to our own | |
244 + bodies. By understanding events in terms of our own previous | |
245 + corporeal experience, we greatly constrain the possibilities of what | |
246 + would otherwise be an unwieldy exponential search. This extra | |
247 + constraint can be the difference between easily understanding what | |
248 + is happening in a video and being completely lost in a sea of | |
249 + incomprehensible color and movement. | |
250 | |
251 - I built =CORTEX=, a comprehensive platform for embodied AI | |
252 experiments. =CORTEX= supports many features lacking in other | |
253 @@ -363,18 +477,22 @@ | |
254 - I built =EMPATH=, which uses =CORTEX= to identify the actions of | |
255 a worm-like creature using a computational model of empathy. | |
256 | |
257 -* Building =CORTEX= | |
258 - | |
259 - I intend for =CORTEX= to be used as a general-purpose library for | |
260 - building creatures and outfitting them with senses, so that it will | |
261 - be useful for other researchers who want to test out ideas of their | |
262 - own. To this end, wherver I have had to make archetictural choices | |
263 - about =CORTEX=, I have chosen to give as much freedom to the user as | |
264 - possible, so that =CORTEX= may be used for things I have not | |
265 - forseen. | |
266 - | |
267 -** Simulation or Reality? | |
268 - | |
269 + | |
270 +* Designing =CORTEX= | |
271 + In this section, I outline the design decisions that went into | |
272 + making =CORTEX=, along with some details about its | |
273 + implementation. (A practical guide to getting started with =CORTEX=, | |
274 + which skips over the history and implementation details presented | |
275 + here, is provided in an appendix \ref{} at the end of this paper.) | |
276 + | |
277 + Throughout this project, I intended for =CORTEX= to be flexible and | |
278 + extensible enough to be useful for other researchers who want to | |
279 + test out ideas of their own. To this end, wherver I have had to make | |
280 + archetictural choices about =CORTEX=, I have chosen to give as much | |
281 + freedom to the user as possible, so that =CORTEX= may be used for | |
282 + things I have not forseen. | |
283 + | |
284 +** Building in simulation versus reality | |
285 The most important archetictural decision of all is the choice to | |
286 use a computer-simulated environemnt in the first place! The world | |
287 is a vast and rich place, and for now simulations are a very poor | |
288 @@ -436,7 +554,7 @@ | |
289 doing everything in software is far cheaper than building custom | |
290 real-time hardware. All you need is a laptop and some patience. | |
291 | |
292 -** Because of Time, simulation is perferable to reality | |
293 +** Simulated time enables rapid prototyping and complex scenes | |
294 | |
295 I envision =CORTEX= being used to support rapid prototyping and | |
296 iteration of ideas. Even if I could put together a well constructed | |
297 @@ -459,8 +577,8 @@ | |
298 simulations of very simple creatures in =CORTEX= generally run at | |
299 40x on my machine! | |
300 | |
301 -** What is a sense? | |
302 - | |
303 +** All sense organs are two-dimensional surfaces | |
304 +# What is a sense? | |
305 If =CORTEX= is to support a wide variety of senses, it would help | |
306 to have a better understanding of what a ``sense'' actually is! | |
307 While vision, touch, and hearing all seem like they are quite | |
308 @@ -956,7 +1074,7 @@ | |
309 #+ATTR_LaTeX: :width 15cm | |
310 [[./images/physical-hand.png]] | |
311 | |
312 -** Eyes reuse standard video game components | |
313 +** Sight reuses standard video game components... | |
314 | |
315 Vision is one of the most important senses for humans, so I need to | |
316 build a simulated sense of vision for my AI. I will do this with | |
317 @@ -1257,8 +1375,8 @@ | |
318 community and is now (in modified form) part of a system for | |
319 capturing in-game video to a file. | |
320 | |
321 -** Hearing is hard; =CORTEX= does it right | |
322 - | |
323 +** ...but hearing must be built from scratch | |
324 +# is hard; =CORTEX= does it right | |
325 At the end of this section I will have simulated ears that work the | |
326 same way as the simulated eyes in the last section. I will be able to | |
327 place any number of ear-nodes in a blender file, and they will bind to | |
328 @@ -1565,7 +1683,7 @@ | |
329 jMonkeyEngine3 community and is used to record audio for demo | |
330 videos. | |
331 | |
332 -** Touch uses hundreds of hair-like elements | |
333 +** Hundreds of hair-like elements provide a sense of touch | |
334 | |
335 Touch is critical to navigation and spatial reasoning and as such I | |
336 need a simulated version of it to give to my AI creatures. | |
337 @@ -2059,7 +2177,7 @@ | |
338 #+ATTR_LaTeX: :width 15cm | |
339 [[./images/touch-cube.png]] | |
340 | |
341 -** Proprioception is the sense that makes everything ``real'' | |
342 +** Proprioception provides knowledge of your own body's position | |
343 | |
344 Close your eyes, and touch your nose with your right index finger. | |
345 How did you do it? You could not see your hand, and neither your | |
346 @@ -2193,7 +2311,7 @@ | |
347 #+ATTR_LaTeX: :width 11cm | |
348 [[./images/proprio.png]] | |
349 | |
350 -** Muscles are both effectors and sensors | |
351 +** Muscles contain both sensors and effectors | |
352 | |
353 Surprisingly enough, terrestrial creatures only move by using | |
354 torque applied about their joints. There's not a single straight | |
355 @@ -2440,7 +2558,8 @@ | |
356 hard control problems without worrying about physics or | |
357 senses. | |
358 | |
359 -* Empathy in a simulated worm | |
360 +* =EMPATH=: the simulated worm experiment | |
361 +# Empathy in a simulated worm | |
362 | |
363 Here I develop a computational model of empathy, using =CORTEX= as a | |
364 base. Empathy in this context is the ability to observe another | |
365 @@ -2732,7 +2851,7 @@ | |
366 provided by an experience vector and reliably infering the rest of | |
367 the senses. | |
368 | |
369 -** Empathy is the process of tracing though \Phi-space | |
370 +** ``Empathy'' requires retracing steps though \Phi-space | |
371 | |
372 Here is the core of a basic empathy algorithm, starting with an | |
373 experience vector: | |
374 @@ -2888,7 +3007,7 @@ | |
375 #+end_src | |
376 #+end_listing | |
377 | |
378 -** Efficient action recognition with =EMPATH= | |
379 +** =EMPATH= recognizes actions efficiently | |
380 | |
381 To use =EMPATH= with the worm, I first need to gather a set of | |
382 experiences from the worm that includes the actions I want to | |
383 @@ -3044,9 +3163,9 @@ | |
384 to interpretation, and dissaggrement between empathy and experience | |
385 is more excusable. | |
386 | |
387 -** Digression: bootstrapping touch using free exploration | |
388 - | |
389 - In the previous section I showed how to compute actions in terms of | |
390 +** Digression: Learn touch sensor layout through haptic experimentation, instead | |
391 +# Boostraping touch using free exploration | |
392 +In the previous section I showed how to compute actions in terms of | |
393 body-centered predicates which relied averate touch activation of | |
394 pre-defined regions of the worm's skin. What if, instead of recieving | |
395 touch pre-grouped into the six faces of each worm segment, the true |