Mercurial > cortex
comparison thesis/cortex.org @ 516:ced955c3c84f
resurrect old cortex to fix flow issues.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Sun, 30 Mar 2014 22:48:19 -0400 |
parents | 58fa1ffd481e |
children | 68665d2c32a7 |
comparison
equal
deleted
inserted
replaced
515:58fa1ffd481e | 516:ced955c3c84f |
---|---|
40 #+ATTR_LaTeX: :width 10cm | 40 #+ATTR_LaTeX: :width 10cm |
41 [[./images/aurellem-gray.png]] | 41 [[./images/aurellem-gray.png]] |
42 | 42 |
43 | 43 |
44 * Empathy \& Embodiment: problem solving strategies | 44 * Empathy \& Embodiment: problem solving strategies |
45 | |
46 By the end of this thesis, you will have seen a novel approach to | |
47 interpreting video using embodiment and empathy. You will have also | |
48 seen one way to efficiently implement empathy for embodied | |
49 creatures. Finally, you will become familiar with =CORTEX=, a system | |
50 for designing and simulating creatures with rich senses, which you | |
51 may choose to use in your own research. | |
45 | 52 |
46 ** The problem: recognizing actions in video is extremely difficult | 53 This is the core vision of my thesis: That one of the important ways |
47 # developing / requires useful representations | 54 in which we understand others is by imagining ourselves in their |
48 | 55 position and emphatically feeling experiences relative to our own |
49 Examine the following collection of images. As you, and indeed very | 56 bodies. By understanding events in terms of our own previous |
50 young children, can easily determine, each one is a picture of | 57 corporeal experience, we greatly constrain the possibilities of what |
51 someone drinking. | 58 would otherwise be an unwieldy exponential search. This extra |
52 | 59 constraint can be the difference between easily understanding what |
53 # dxh: cat, cup, drinking fountain, rain, straw, coconut | 60 is happening in a video and being completely lost in a sea of |
61 incomprehensible color and movement. | |
62 | |
63 | |
64 ** The problem: recognizing actions in video is hard! | |
65 | |
66 Examine the following image. What is happening? As you, and indeed | |
67 very young children, can easily determine, this is an image of | |
68 drinking. | |
69 | |
54 #+caption: A cat drinking some water. Identifying this action is | 70 #+caption: A cat drinking some water. Identifying this action is |
55 #+caption: beyond the capabilities of existing computer vision systems. | 71 #+caption: beyond the capabilities of existing computer vision systems. |
56 #+ATTR_LaTeX: :width 7cm | 72 #+ATTR_LaTeX: :width 7cm |
57 [[./images/cat-drinking.jpg]] | 73 [[./images/cat-drinking.jpg]] |
58 | 74 |
59 Nevertheless, it is beyond the state of the art for a computer | 75 Nevertheless, it is beyond the state of the art for a computer |
60 vision program to describe what's happening in each of these | 76 vision program to describe what's happening in this image. Part of |
61 images, or what's common to them. Part of the problem is that many | 77 the problem is that many computer vision systems focus on |
62 computer vision systems focus on pixel-level details or probability | 78 pixel-level details or comparisons to example images (such as |
63 distributions of pixels, with little focus on [...] | 79 \cite{volume-action-recognition}), but the 3D world is so variable |
64 | 80 that it is hard to descrive the world in terms of possible images. |
65 | 81 |
66 In fact, the contents of scene may have much less to do with pixel | 82 In fact, the contents of scene may have much less to do with pixel |
67 probabilities than with recognizing various affordances: things you | 83 probabilities than with recognizing various affordances: things you |
68 can move, objects you can grasp, spaces that can be filled | 84 can move, objects you can grasp, spaces that can be filled . For |
69 (Gibson). For example, what processes might enable you to see the | 85 example, what processes might enable you to see the chair in figure |
70 chair in figure \ref{hidden-chair}? | 86 \ref{hidden-chair}? |
71 # Or suppose that you are building a program that recognizes chairs. | 87 |
72 # How could you ``see'' the chair ? | |
73 | |
74 # dxh: blur chair | |
75 #+caption: The chair in this image is quite obvious to humans, but I | 88 #+caption: The chair in this image is quite obvious to humans, but I |
76 #+caption: doubt that any modern computer vision program can find it. | 89 #+caption: doubt that any modern computer vision program can find it. |
77 #+name: hidden-chair | 90 #+name: hidden-chair |
78 #+ATTR_LaTeX: :width 10cm | 91 #+ATTR_LaTeX: :width 10cm |
79 [[./images/fat-person-sitting-at-desk.jpg]] | 92 [[./images/fat-person-sitting-at-desk.jpg]] |
80 | 93 |
81 | |
82 | |
83 | |
84 | |
85 Finally, how is it that you can easily tell the difference between | 94 Finally, how is it that you can easily tell the difference between |
86 how the girls /muscles/ are working in figure \ref{girl}? | 95 how the girls /muscles/ are working in figure \ref{girl}? |
87 | 96 |
88 #+caption: The mysterious ``common sense'' appears here as you are able | 97 #+caption: The mysterious ``common sense'' appears here as you are able |
89 #+caption: to discern the difference in how the girl's arm muscles | 98 #+caption: to discern the difference in how the girl's arm muscles |
90 #+caption: are activated between the two images. | 99 #+caption: are activated between the two images. |
91 #+name: girl | 100 #+name: girl |
92 #+ATTR_LaTeX: :width 7cm | 101 #+ATTR_LaTeX: :width 7cm |
93 [[./images/wall-push.png]] | 102 [[./images/wall-push.png]] |
94 | 103 |
95 | |
96 | |
97 | |
98 Each of these examples tells us something about what might be going | 104 Each of these examples tells us something about what might be going |
99 on in our minds as we easily solve these recognition problems. | 105 on in our minds as we easily solve these recognition problems. |
100 | 106 |
101 The hidden chair shows us that we are strongly triggered by cues | 107 The hidden chair shows us that we are strongly triggered by cues |
102 relating to the position of human bodies, and that we can determine | 108 relating to the position of human bodies, and that we can determine |
108 We know well how our muscles would have to work to maintain us in | 114 We know well how our muscles would have to work to maintain us in |
109 most positions, and we can easily project this self-knowledge to | 115 most positions, and we can easily project this self-knowledge to |
110 imagined positions triggered by images of the human body. | 116 imagined positions triggered by images of the human body. |
111 | 117 |
112 ** A step forward: the sensorimotor-centered approach | 118 ** A step forward: the sensorimotor-centered approach |
113 # ** =EMPATH= recognizes what creatures are doing | 119 |
114 # neatly solves recognition problems | |
115 In this thesis, I explore the idea that our knowledge of our own | 120 In this thesis, I explore the idea that our knowledge of our own |
116 bodies enables us to recognize the actions of others. | 121 bodies, combined with our own rich senses, enables us to recognize |
122 the actions of others. | |
123 | |
124 For example, I think humans are able to label the cat video as | |
125 ``drinking'' because they imagine /themselves/ as the cat, and | |
126 imagine putting their face up against a stream of water and | |
127 sticking out their tongue. In that imagined world, they can feel | |
128 the cool water hitting their tongue, and feel the water entering | |
129 their body, and are able to recognize that /feeling/ as drinking. | |
130 So, the label of the action is not really in the pixels of the | |
131 image, but is found clearly in a simulation inspired by those | |
132 pixels. An imaginative system, having been trained on drinking and | |
133 non-drinking examples and learning that the most important | |
134 component of drinking is the feeling of water sliding down one's | |
135 throat, would analyze a video of a cat drinking in the following | |
136 manner: | |
137 | |
138 1. Create a physical model of the video by putting a ``fuzzy'' | |
139 model of its own body in place of the cat. Possibly also create | |
140 a simulation of the stream of water. | |
141 | |
142 2. Play out this simulated scene and generate imagined sensory | |
143 experience. This will include relevant muscle contractions, a | |
144 close up view of the stream from the cat's perspective, and most | |
145 importantly, the imagined feeling of water entering the | |
146 mouth. The imagined sensory experience can come from a | |
147 simulation of the event, but can also be pattern-matched from | |
148 previous, similar embodied experience. | |
149 | |
150 3. The action is now easily identified as drinking by the sense of | |
151 taste alone. The other senses (such as the tongue moving in and | |
152 out) help to give plausibility to the simulated action. Note that | |
153 the sense of vision, while critical in creating the simulation, | |
154 is not critical for identifying the action from the simulation. | |
155 | |
156 For the chair examples, the process is even easier: | |
157 | |
158 1. Align a model of your body to the person in the image. | |
159 | |
160 2. Generate proprioceptive sensory data from this alignment. | |
161 | |
162 3. Use the imagined proprioceptive data as a key to lookup related | |
163 sensory experience associated with that particular proproceptive | |
164 feeling. | |
165 | |
166 4. Retrieve the feeling of your bottom resting on a surface, your | |
167 knees bent, and your leg muscles relaxed. | |
168 | |
169 5. This sensory information is consistent with your =sitting?= | |
170 sensory predicate, so you (and the entity in the image) must be | |
171 sitting. | |
172 | |
173 6. There must be a chair-like object since you are sitting. | |
174 | |
175 Empathy offers yet another alternative to the age-old AI | |
176 representation question: ``What is a chair?'' --- A chair is the | |
177 feeling of sitting! | |
178 | |
179 One powerful advantage of empathic problem solving is that it | |
180 factors the action recognition problem into two easier problems. To | |
181 use empathy, you need an /aligner/, which takes the video and a | |
182 model of your body, and aligns the model with the video. Then, you | |
183 need a /recognizer/, which uses the aligned model to interpret the | |
184 action. The power in this method lies in the fact that you describe | |
185 all actions form a body-centered viewpoint. You are less tied to | |
186 the particulars of any visual representation of the actions. If you | |
187 teach the system what ``running'' is, and you have a good enough | |
188 aligner, the system will from then on be able to recognize running | |
189 from any point of view, even strange points of view like above or | |
190 underneath the runner. This is in contrast to action recognition | |
191 schemes that try to identify actions using a non-embodied approach. | |
192 If these systems learn about running as viewed from the side, they | |
193 will not automatically be able to recognize running from any other | |
194 viewpoint. | |
195 | |
196 Another powerful advantage is that using the language of multiple | |
197 body-centered rich senses to describe body-centerd actions offers a | |
198 massive boost in descriptive capability. Consider how difficult it | |
199 would be to compose a set of HOG filters to describe the action of | |
200 a simple worm-creature ``curling'' so that its head touches its | |
201 tail, and then behold the simplicity of describing thus action in a | |
202 language designed for the task (listing \ref{grand-circle-intro}): | |
203 | |
204 #+caption: Body-centerd actions are best expressed in a body-centered | |
205 #+caption: language. This code detects when the worm has curled into a | |
206 #+caption: full circle. Imagine how you would replicate this functionality | |
207 #+caption: using low-level pixel features such as HOG filters! | |
208 #+name: grand-circle-intro | |
209 #+begin_listing clojure | |
210 #+begin_src clojure | |
211 (defn grand-circle? | |
212 "Does the worm form a majestic circle (one end touching the other)?" | |
213 [experiences] | |
214 (and (curled? experiences) | |
215 (let [worm-touch (:touch (peek experiences)) | |
216 tail-touch (worm-touch 0) | |
217 head-touch (worm-touch 4)] | |
218 (and (< 0.2 (contact worm-segment-bottom-tip tail-touch)) | |
219 (< 0.2 (contact worm-segment-top-tip head-touch)))))) | |
220 #+end_src | |
221 #+end_listing | |
222 | |
223 ** =EMPATH= regognizes actions using empathy | |
117 | 224 |
118 First, I built a system for constructing virtual creatures with | 225 First, I built a system for constructing virtual creatures with |
119 physiologically plausible sensorimotor systems and detailed | 226 physiologically plausible sensorimotor systems and detailed |
120 environments. The result is =CORTEX=, which is described in section | 227 environments. The result is =CORTEX=, which is described in section |
121 \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other | 228 \ref{sec-2}. (=CORTEX= was built to be flexible and useful to other |
126 infer the actions of a second worm-like creature, using only its | 233 infer the actions of a second worm-like creature, using only its |
127 own prior sensorimotor experiences and knowledge of the second | 234 own prior sensorimotor experiences and knowledge of the second |
128 worm's joint positions. This program, =EMPATH=, is described in | 235 worm's joint positions. This program, =EMPATH=, is described in |
129 section \ref{sec-3}, and the key results of this experiment are | 236 section \ref{sec-3}, and the key results of this experiment are |
130 summarized below. | 237 summarized below. |
131 | |
132 #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer | |
133 #+caption: the complete sensory experience and classify these four poses. | |
134 #+caption: The last image is a composite, depicting the intermediate stages of \emph{wriggling}. | |
135 #+name: worm-recognition-intro-2 | |
136 #+ATTR_LaTeX: :width 15cm | |
137 [[./images/empathy-1.png]] | |
138 | |
139 # =CORTEX= provides a language for describing the sensorimotor | |
140 # experiences of various creatures. | |
141 | |
142 # Next, I developed an experiment to test the power of =CORTEX='s | |
143 # sensorimotor-centered language for solving recognition problems. As | |
144 # a proof of concept, I wrote routines which enabled a simple | |
145 # worm-like creature to infer the actions of a second worm-like | |
146 # creature, using only its own previous sensorimotor experiences and | |
147 # knowledge of the second worm's joints (figure | |
148 # \ref{worm-recognition-intro-2}). The result of this proof of | |
149 # concept was the program =EMPATH=, described in section | |
150 # \ref{sec-3}. The key results of this | |
151 | |
152 # Using only first-person sensorimotor experiences and third-person | |
153 # proprioceptive data, | |
154 | |
155 *** Key results | |
156 - After one-shot supervised training, =EMPATH= was able recognize a | |
157 wide variety of static poses and dynamic actions---ranging from | |
158 curling in a circle to wriggling with a particular frequency --- | |
159 with 95\% accuracy. | |
160 - These results were completely independent of viewing angle | |
161 because the underlying body-centered language fundamentally is | |
162 independent; once an action is learned, it can be recognized | |
163 equally well from any viewing angle. | |
164 - =EMPATH= is surprisingly short; the sensorimotor-centered | |
165 language provided by =CORTEX= resulted in extremely economical | |
166 recognition routines --- about 0000 lines in all --- suggesting | |
167 that such representations are very powerful, and often | |
168 indispensible for the types of recognition tasks considered here. | |
169 - Although for expediency's sake, I relied on direct knowledge of | |
170 joint positions in this proof of concept, it would be | |
171 straightforward to extend =EMPATH= so that it (more | |
172 realistically) infers joint positions from its visual data. | |
173 | |
174 # because the underlying language is fundamentally orientation-independent | |
175 | |
176 # recognize the actions of a worm with 95\% accuracy. The | |
177 # recognition tasks | |
178 | |
179 | |
180 | |
181 | |
182 [Talk about these results and what you find promising about them] | |
183 | |
184 ** Roadmap | |
185 [I'm going to explain how =CORTEX= works, then break down how | |
186 =EMPATH= does its thing. Because the details reveal such-and-such | |
187 about the approach.] | |
188 | |
189 # The success of this simple proof-of-concept offers a tantalizing | |
190 | |
191 | |
192 # explore the idea | |
193 # The key contribution of this thesis is the idea that body-centered | |
194 # representations (which express | |
195 | |
196 | |
197 # the | |
198 # body-centered approach --- in which I try to determine what's | |
199 # happening in a scene by bringing it into registration with my own | |
200 # bodily experiences --- are indispensible for recognizing what | |
201 # creatures are doing in a scene. | |
202 | |
203 * COMMENT | |
204 # body-centered language | |
205 | |
206 In this thesis, I'll describe =EMPATH=, which solves a certain | |
207 class of recognition problems | |
208 | |
209 The key idea is to use self-centered (or first-person) language. | |
210 | 238 |
211 I have built a system that can express the types of recognition | 239 I have built a system that can express the types of recognition |
212 problems in a form amenable to computation. It is split into | 240 problems in a form amenable to computation. It is split into |
213 four parts: | 241 four parts: |
214 | 242 |
241 data, just as it would if it were actually experiencing the | 269 data, just as it would if it were actually experiencing the |
242 scene first-hand. If previous experience has been accurately | 270 scene first-hand. If previous experience has been accurately |
243 retrieved, and if it is analogous enough to the scene, then | 271 retrieved, and if it is analogous enough to the scene, then |
244 the creature will correctly identify the action in the scene. | 272 the creature will correctly identify the action in the scene. |
245 | 273 |
246 For example, I think humans are able to label the cat video as | |
247 ``drinking'' because they imagine /themselves/ as the cat, and | |
248 imagine putting their face up against a stream of water and | |
249 sticking out their tongue. In that imagined world, they can feel | |
250 the cool water hitting their tongue, and feel the water entering | |
251 their body, and are able to recognize that /feeling/ as drinking. | |
252 So, the label of the action is not really in the pixels of the | |
253 image, but is found clearly in a simulation inspired by those | |
254 pixels. An imaginative system, having been trained on drinking and | |
255 non-drinking examples and learning that the most important | |
256 component of drinking is the feeling of water sliding down one's | |
257 throat, would analyze a video of a cat drinking in the following | |
258 manner: | |
259 | |
260 1. Create a physical model of the video by putting a ``fuzzy'' | |
261 model of its own body in place of the cat. Possibly also create | |
262 a simulation of the stream of water. | |
263 | |
264 2. Play out this simulated scene and generate imagined sensory | |
265 experience. This will include relevant muscle contractions, a | |
266 close up view of the stream from the cat's perspective, and most | |
267 importantly, the imagined feeling of water entering the | |
268 mouth. The imagined sensory experience can come from a | |
269 simulation of the event, but can also be pattern-matched from | |
270 previous, similar embodied experience. | |
271 | |
272 3. The action is now easily identified as drinking by the sense of | |
273 taste alone. The other senses (such as the tongue moving in and | |
274 out) help to give plausibility to the simulated action. Note that | |
275 the sense of vision, while critical in creating the simulation, | |
276 is not critical for identifying the action from the simulation. | |
277 | |
278 For the chair examples, the process is even easier: | |
279 | |
280 1. Align a model of your body to the person in the image. | |
281 | |
282 2. Generate proprioceptive sensory data from this alignment. | |
283 | |
284 3. Use the imagined proprioceptive data as a key to lookup related | |
285 sensory experience associated with that particular proproceptive | |
286 feeling. | |
287 | |
288 4. Retrieve the feeling of your bottom resting on a surface, your | |
289 knees bent, and your leg muscles relaxed. | |
290 | |
291 5. This sensory information is consistent with the =sitting?= | |
292 sensory predicate, so you (and the entity in the image) must be | |
293 sitting. | |
294 | |
295 6. There must be a chair-like object since you are sitting. | |
296 | |
297 Empathy offers yet another alternative to the age-old AI | |
298 representation question: ``What is a chair?'' --- A chair is the | |
299 feeling of sitting. | |
300 | 274 |
301 My program, =EMPATH= uses this empathic problem solving technique | 275 My program, =EMPATH= uses this empathic problem solving technique |
302 to interpret the actions of a simple, worm-like creature. | 276 to interpret the actions of a simple, worm-like creature. |
303 | 277 |
304 #+caption: The worm performs many actions during free play such as | 278 #+caption: The worm performs many actions during free play such as |
311 #+caption: poses by inferring the complete sensory experience | 285 #+caption: poses by inferring the complete sensory experience |
312 #+caption: from proprioceptive data. | 286 #+caption: from proprioceptive data. |
313 #+name: worm-recognition-intro | 287 #+name: worm-recognition-intro |
314 #+ATTR_LaTeX: :width 15cm | 288 #+ATTR_LaTeX: :width 15cm |
315 [[./images/worm-poses.png]] | 289 [[./images/worm-poses.png]] |
316 | 290 |
317 One powerful advantage of empathic problem solving is that it | 291 #+caption: From only \emph{proprioceptive} data, =EMPATH= was able to infer |
318 factors the action recognition problem into two easier problems. To | 292 #+caption: the complete sensory experience and classify these four poses. |
319 use empathy, you need an /aligner/, which takes the video and a | 293 #+caption: The last image is a composite, depicting the intermediate stages |
320 model of your body, and aligns the model with the video. Then, you | 294 #+caption: of \emph{wriggling}. |
321 need a /recognizer/, which uses the aligned model to interpret the | 295 #+name: worm-recognition-intro-2 |
322 action. The power in this method lies in the fact that you describe | 296 #+ATTR_LaTeX: :width 15cm |
323 all actions form a body-centered viewpoint. You are less tied to | 297 [[./images/empathy-1.png]] |
324 the particulars of any visual representation of the actions. If you | 298 |
325 teach the system what ``running'' is, and you have a good enough | 299 Next, I developed an experiment to test the power of =CORTEX='s |
326 aligner, the system will from then on be able to recognize running | 300 sensorimotor-centered language for solving recognition problems. As |
327 from any point of view, even strange points of view like above or | 301 a proof of concept, I wrote routines which enabled a simple |
328 underneath the runner. This is in contrast to action recognition | 302 worm-like creature to infer the actions of a second worm-like |
329 schemes that try to identify actions using a non-embodied approach. | 303 creature, using only its own previous sensorimotor experiences and |
330 If these systems learn about running as viewed from the side, they | 304 knowledge of the second worm's joints (figure |
331 will not automatically be able to recognize running from any other | 305 \ref{worm-recognition-intro-2}). The result of this proof of |
332 viewpoint. | 306 concept was the program =EMPATH=, described in section \ref{sec-3}. |
333 | 307 |
334 Another powerful advantage is that using the language of multiple | 308 ** =EMPATH= is built on =CORTEX=, en environment for making creatures. |
335 body-centered rich senses to describe body-centerd actions offers a | 309 |
336 massive boost in descriptive capability. Consider how difficult it | 310 # =CORTEX= provides a language for describing the sensorimotor |
337 would be to compose a set of HOG filters to describe the action of | 311 # experiences of various creatures. |
338 a simple worm-creature ``curling'' so that its head touches its | |
339 tail, and then behold the simplicity of describing thus action in a | |
340 language designed for the task (listing \ref{grand-circle-intro}): | |
341 | |
342 #+caption: Body-centerd actions are best expressed in a body-centered | |
343 #+caption: language. This code detects when the worm has curled into a | |
344 #+caption: full circle. Imagine how you would replicate this functionality | |
345 #+caption: using low-level pixel features such as HOG filters! | |
346 #+name: grand-circle-intro | |
347 #+begin_listing clojure | |
348 #+begin_src clojure | |
349 (defn grand-circle? | |
350 "Does the worm form a majestic circle (one end touching the other)?" | |
351 [experiences] | |
352 (and (curled? experiences) | |
353 (let [worm-touch (:touch (peek experiences)) | |
354 tail-touch (worm-touch 0) | |
355 head-touch (worm-touch 4)] | |
356 (and (< 0.2 (contact worm-segment-bottom-tip tail-touch)) | |
357 (< 0.2 (contact worm-segment-top-tip head-touch)))))) | |
358 #+end_src | |
359 #+end_listing | |
360 | |
361 ** =CORTEX= is a toolkit for building sensate creatures | |
362 | 312 |
363 I built =CORTEX= to be a general AI research platform for doing | 313 I built =CORTEX= to be a general AI research platform for doing |
364 experiments involving multiple rich senses and a wide variety and | 314 experiments involving multiple rich senses and a wide variety and |
365 number of creatures. I intend it to be useful as a library for many | 315 number of creatures. I intend it to be useful as a library for many |
366 more projects than just this thesis. =CORTEX= was necessary to meet | 316 more projects than just this thesis. =CORTEX= was necessary to meet |
410 that I know of that can support multiple entities that can each | 360 that I know of that can support multiple entities that can each |
411 hear the world from their own perspective. Other senses also | 361 hear the world from their own perspective. Other senses also |
412 require a small layer of Java code. =CORTEX= also uses =bullet=, a | 362 require a small layer of Java code. =CORTEX= also uses =bullet=, a |
413 physics simulator written in =C=. | 363 physics simulator written in =C=. |
414 | 364 |
415 #+caption: Here is the worm from above modeled in Blender, a free | 365 #+caption: Here is the worm from figure \ref{worm-intro} modeled |
416 #+caption: 3D-modeling program. Senses and joints are described | 366 #+caption: in Blender, a free 3D-modeling program. Senses and |
417 #+caption: using special nodes in Blender. | 367 #+caption: joints are described using special nodes in Blender. |
418 #+name: worm-recognition-intro | 368 #+name: worm-recognition-intro |
419 #+ATTR_LaTeX: :width 12cm | 369 #+ATTR_LaTeX: :width 12cm |
420 [[./images/blender-worm.png]] | 370 [[./images/blender-worm.png]] |
421 | 371 |
422 Here are some thing I anticipate that =CORTEX= might be used for: | 372 Here are some thing I anticipate that =CORTEX= might be used for: |
448 its own finger from the eye in its palm, and that it can feel its | 398 its own finger from the eye in its palm, and that it can feel its |
449 own thumb touching its palm.} | 399 own thumb touching its palm.} |
450 \end{sidewaysfigure} | 400 \end{sidewaysfigure} |
451 #+END_LaTeX | 401 #+END_LaTeX |
452 | 402 |
453 ** Road map | 403 ** Contributions |
454 | |
455 By the end of this thesis, you will have seen a novel approach to | |
456 interpreting video using embodiment and empathy. You will have also | |
457 seen one way to efficiently implement empathy for embodied | |
458 creatures. Finally, you will become familiar with =CORTEX=, a system | |
459 for designing and simulating creatures with rich senses, which you | |
460 may choose to use in your own research. | |
461 | |
462 This is the core vision of my thesis: That one of the important ways | |
463 in which we understand others is by imagining ourselves in their | |
464 position and emphatically feeling experiences relative to our own | |
465 bodies. By understanding events in terms of our own previous | |
466 corporeal experience, we greatly constrain the possibilities of what | |
467 would otherwise be an unwieldy exponential search. This extra | |
468 constraint can be the difference between easily understanding what | |
469 is happening in a video and being completely lost in a sea of | |
470 incomprehensible color and movement. | |
471 | 404 |
472 - I built =CORTEX=, a comprehensive platform for embodied AI | 405 - I built =CORTEX=, a comprehensive platform for embodied AI |
473 experiments. =CORTEX= supports many features lacking in other | 406 experiments. =CORTEX= supports many features lacking in other |
474 systems, such proper simulation of hearing. It is easy to create | 407 systems, such proper simulation of hearing. It is easy to create |
475 new =CORTEX= creatures using Blender, a free 3D modeling program. | 408 new =CORTEX= creatures using Blender, a free 3D modeling program. |
476 | 409 |
477 - I built =EMPATH=, which uses =CORTEX= to identify the actions of | 410 - I built =EMPATH=, which uses =CORTEX= to identify the actions of |
478 a worm-like creature using a computational model of empathy. | 411 a worm-like creature using a computational model of empathy. |
479 | 412 |
413 - After one-shot supervised training, =EMPATH= was able recognize a | |
414 wide variety of static poses and dynamic actions---ranging from | |
415 curling in a circle to wriggling with a particular frequency --- | |
416 with 95\% accuracy. | |
417 | |
418 - These results were completely independent of viewing angle | |
419 because the underlying body-centered language fundamentally is | |
420 independent; once an action is learned, it can be recognized | |
421 equally well from any viewing angle. | |
422 | |
423 - =EMPATH= is surprisingly short; the sensorimotor-centered | |
424 language provided by =CORTEX= resulted in extremely economical | |
425 recognition routines --- about 500 lines in all --- suggesting | |
426 that such representations are very powerful, and often | |
427 indispensible for the types of recognition tasks considered here. | |
428 | |
429 - Although for expediency's sake, I relied on direct knowledge of | |
430 joint positions in this proof of concept, it would be | |
431 straightforward to extend =EMPATH= so that it (more | |
432 realistically) infers joint positions from its visual data. | |
480 | 433 |
481 * Designing =CORTEX= | 434 * Designing =CORTEX= |
435 | |
482 In this section, I outline the design decisions that went into | 436 In this section, I outline the design decisions that went into |
483 making =CORTEX=, along with some details about its | 437 making =CORTEX=, along with some details about its implementation. |
484 implementation. (A practical guide to getting started with =CORTEX=, | 438 (A practical guide to getting started with =CORTEX=, which skips |
485 which skips over the history and implementation details presented | 439 over the history and implementation details presented here, is |
486 here, is provided in an appendix \ref{} at the end of this paper.) | 440 provided in an appendix at the end of this thesis.) |
487 | 441 |
488 Throughout this project, I intended for =CORTEX= to be flexible and | 442 Throughout this project, I intended for =CORTEX= to be flexible and |
489 extensible enough to be useful for other researchers who want to | 443 extensible enough to be useful for other researchers who want to |
490 test out ideas of their own. To this end, wherver I have had to make | 444 test out ideas of their own. To this end, wherver I have had to make |
491 archetictural choices about =CORTEX=, I have chosen to give as much | 445 archetictural choices about =CORTEX=, I have chosen to give as much |
552 time in the simulated world can be slowed down to accommodate the | 506 time in the simulated world can be slowed down to accommodate the |
553 limitations of the character's programming. In terms of cost, | 507 limitations of the character's programming. In terms of cost, |
554 doing everything in software is far cheaper than building custom | 508 doing everything in software is far cheaper than building custom |
555 real-time hardware. All you need is a laptop and some patience. | 509 real-time hardware. All you need is a laptop and some patience. |
556 | 510 |
557 ** Simulated time enables rapid prototyping and complex scenes | 511 ** Simulated time enables rapid prototyping \& simple programs |
558 | 512 |
559 I envision =CORTEX= being used to support rapid prototyping and | 513 I envision =CORTEX= being used to support rapid prototyping and |
560 iteration of ideas. Even if I could put together a well constructed | 514 iteration of ideas. Even if I could put together a well constructed |
561 kit for creating robots, it would still not be enough because of | 515 kit for creating robots, it would still not be enough because of |
562 the scourge of real-time processing. Anyone who wants to test their | 516 the scourge of real-time processing. Anyone who wants to test their |