# HG changeset patch # User Robert McIntyre # Date 1395027076 14400 # Node ID 7ee735a836dad40d52a92768938d8fc26c7c6de8 # Parent 6ba908c1a0a976f050076696391ce41f2bf6a2d9 incorporate thesis. diff -r 6ba908c1a0a9 -r 7ee735a836da thesis/images/cat-drinking.jpg Binary file thesis/images/cat-drinking.jpg has changed diff -r 6ba908c1a0a9 -r 7ee735a836da thesis/images/finger-UV.png Binary file thesis/images/finger-UV.png has changed diff -r 6ba908c1a0a9 -r 7ee735a836da thesis/org/first-chapter.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/thesis/org/first-chapter.html Sun Mar 16 23:31:16 2014 -0400 @@ -0,0 +1,455 @@ + + + + +<code>CORTEX</code> + + + + + + + + + + + + + + + +
+

CORTEX

+ + +
+
+ +
+ +

aurellem

+ +
+ +
Written by Robert McIntyre
+ + + + + + + +
+

Artificial Imagination

+
+ + +

+ Imagine watching a video of someone skateboarding. When you watch + the video, you can imagine yourself skateboarding, and your + knowledge of the human body and its dynamics guides your + interpretation of the scene. For example, even if the skateboarder + is partially occluded, you can infer the positions of his arms and + body from your own knowledge of how your body would be positioned if + you were skateboarding. If the skateboarder suffers an accident, you + wince in sympathy, imagining the pain your own body would experience + if it were in the same situation. This empathy with other people + guides our understanding of whatever they are doing because it is a + powerful constraint on what is probable and possible. In order to + make use of this powerful empathy constraint, I need a system that + can generate and make sense of sensory data from the many different + senses that humans possess. The two key proprieties of such a system + are embodiment and imagination. +

+ +
+ +
+

What is imagination?

+
+ + +

+ One kind of imagination is sympathetic imagination: you imagine + yourself in the position of something/someone you are + observing. This type of imagination comes into play when you follow + along visually when watching someone perform actions, or when you + sympathetically grimace when someone hurts themselves. This type of + imagination uses the constraints you have learned about your own + body to highly constrain the possibilities in whatever you are + seeing. It uses all your senses to including your senses of touch, + proprioception, etc. Humans are flexible when it comes to "putting + themselves in another's shoes," and can sympathetically understand + not only other humans, but entities ranging animals to cartoon + characters to single dots on a screen! +

+

+ Another kind of imagination is predictive imagination: you + construct scenes in your mind that are not entirely related to + whatever you are observing, but instead are predictions of the + future or simply flights of fancy. You use this type of imagination + to plan out multi-step actions, or play out dangerous situations in + your mind so as to avoid messing them up in reality. +

+

+ Of course, sympathetic and predictive imagination blend into each + other and are not completely separate concepts. One dimension along + which you can distinguish types of imagination is dependence on raw + sense data. Sympathetic imagination is highly constrained by your + senses, while predictive imagination can be more or less dependent + on your senses depending on how far ahead you imagine. Daydreaming + is an extreme form of predictive imagination that wanders through + different possibilities without concern for whether they are + related to whatever is happening in reality. +

+

+ For this thesis, I will mostly focus on sympathetic imagination and + the constraint it provides for understanding sensory data. +

+
+ +
+ +
+

What problems can imagination solve?

+
+ + +

+ Consider a video of a cat drinking some water. +

+ +
+

../images/cat-drinking.jpg

+

A cat drinking some water. Identifying this action is beyond the state of the art for computers.

+
+ +

+ It is currently impossible for any computer program to reliably + label such an video as "drinking". I think humans are able to label + such video as "drinking" because they imagine themselves as the + cat, and imagine putting their face up against a stream of water + and sticking out their tongue. In that imagined world, they can + feel the cool water hitting their tongue, and feel the water + entering their body, and are able to recognize that feeling as + drinking. So, the label of the action is not really in the pixels + of the image, but is found clearly in a simulation inspired by + those pixels. An imaginative system, having been trained on + drinking and non-drinking examples and learning that the most + important component of drinking is the feeling of water sliding + down one's throat, would analyze a video of a cat drinking in the + following manner: +

+
    +
  • Create a physical model of the video by putting a "fuzzy" model + of its own body in place of the cat. Also, create a simulation of + the stream of water. + +
  • +
  • Play out this simulated scene and generate imagined sensory + experience. This will include relevant muscle contractions, a + close up view of the stream from the cat's perspective, and most + importantly, the imagined feeling of water entering the mouth. + +
  • +
  • The action is now easily identified as drinking by the sense of + taste alone. The other senses (such as the tongue moving in and + out) help to give plausibility to the simulated action. Note that + the sense of vision, while critical in creating the simulation, + is not critical for identifying the action from the simulation. +
  • +
+ + +

+ More generally, I expect imaginative systems to be particularly + good at identifying embodied actions in videos. +

+
+
+ +
+ +
+

Cortex

+
+ + +

+ The previous example involves liquids, the sense of taste, and + imagining oneself as a cat. For this thesis I constrain myself to + simpler, more easily digitizable senses and situations. +

+

+ My system, Cortex performs imagination in two different simplified + worlds: worm world and stick figure world. In each of these + worlds, entities capable of imagination recognize actions by + simulating the experience from their own perspective, and then + recognizing the action from a database of examples. +

+

+ In order to serve as a framework for experiments in imagination, + Cortex requires simulated bodies, worlds, and senses like vision, + hearing, touch, proprioception, etc. +

+ +
+ +
+

A Video Game Engine takes care of some of the groundwork

+
+ + +

+ When it comes to simulation environments, the engines used to + create the worlds in video games offer top-notch physics and + graphics support. These engines also have limited support for + creating cameras and rendering 3D sound, which can be repurposed + for vision and hearing respectively. Physics collision detection + can be expanded to create a sense of touch. +

+

+ jMonkeyEngine3 is one such engine for creating video games in + Java. It uses OpenGL to render to the screen and uses screengraphs + to avoid drawing things that do not appear on the screen. It has an + active community and several games in the pipeline. The engine was + not built to serve any particular game but is instead meant to be + used for any 3D game. I chose jMonkeyEngine3 it because it had the + most features out of all the open projects I looked at, and because + I could then write my code in Clojure, an implementation of LISP + that runs on the JVM. +

+
+ +
+ +
+

CORTEX Extends jMonkeyEngine3 to implement rich senses

+
+ + +

+ Using the game-making primitives provided by jMonkeyEngine3, I have + constructed every major human sense except for smell and + taste. Cortex also provides an interface for creating creatures + in Blender, a 3D modeling environment, and then "rigging" the + creatures with senses using 3D annotations in Blender. A creature + can have any number of senses, and there can be any number of + creatures in a simulation. +

+

+ The senses available in Cortex are: +

+ + + +
+
+ +
+ +
+

A roadmap for Cortex experiments

+
+ + + +
+ +
+

Worm World

+
+ + +

+ Worms in Cortex are segmented creatures which vary in length and + number of segments, and have the senses of vision, proprioception, + touch, and muscle tension. +

+ +
+

../images/finger-UV.png

+

This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).

+
+ + + + +
+
+ +
YouTube +
+

The worm responds to touch.

+
+ +
+
+ +
YouTube +
+

Proprioception in a worm. The proprioceptive readout is + in the upper left corner of the screen.

+
+ +

+ A worm is trained in various actions such as sinusoidal movement, + curling, flailing, and spinning by directly playing motor + contractions while the worm "feels" the experience. These actions + are recorded both as vectors of muscle tension, touch, and + proprioceptive data, but also in higher level forms such as + frequencies of the various contractions and a symbolic name for the + action. +

+

+ Then, the worm watches a video of another worm performing one of + the actions, and must judge which action was performed. Normally + this would be an extremely difficult problem, but the worm is able + to greatly diminish the search space through sympathetic + imagination. First, it creates an imagined copy of its body which + it observes from a third person point of view. Then for each frame + of the video, it maneuvers its simulated body to be in registration + with the worm depicted in the video. The physical constraints + imposed by the physics simulation greatly decrease the number of + poses that have to be tried, making the search feasible. As the + imaginary worm moves, it generates imaginary muscle tension and + proprioceptive sensations. The worm determines the action not by + vision, but by matching the imagined proprioceptive data with + previous examples. +

+

+ By using non-visual sensory data such as touch, the worms can also + answer body related questions such as "did your head touch your + tail?" and "did worm A touch worm B?" +

+

+ The proprioceptive information used for action identification is + body-centric, so only the registration step is dependent on point + of view, not the identification step. Registration is not specific + to any particular action. Thus, action identification can be + divided into a point-of-view dependent generic registration step, + and a action-specific step that is body-centered and invariant to + point of view. +

+
+ +
+ +
+

Stick Figure World

+
+ + +

+ This environment is similar to Worm World, except the creatures are + more complicated and the actions and questions more varied. It is + an experiment to see how far imagination can go in interpreting + actions. +

+
+
+
+ +
+

Date: 2013-11-07 04:21:29 EST

+

Author: Robert McIntyre

+

Org version 7.7 with Emacs version 24

+Validate XHTML 1.0 + +
+ + diff -r 6ba908c1a0a9 -r 7ee735a836da thesis/org/first-chapter.org --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/thesis/org/first-chapter.org Sun Mar 16 23:31:16 2014 -0400 @@ -0,0 +1,238 @@ +#+title: =CORTEX= +#+author: Robert McIntyre +#+email: rlm@mit.edu +#+description: Using embodied AI to facilitate Artificial Imagination. +#+keywords: AI, clojure, embodiment +#+SETUPFILE: ../../aurellem/org/setup.org +#+INCLUDE: ../../aurellem/org/level-0.org +#+babel: :mkdirp yes :noweb yes :exports both +#+OPTIONS: toc:nil, num:nil + +* Artificial Imagination + + Imagine watching a video of someone skateboarding. When you watch + the video, you can imagine yourself skateboarding, and your + knowledge of the human body and its dynamics guides your + interpretation of the scene. For example, even if the skateboarder + is partially occluded, you can infer the positions of his arms and + body from your own knowledge of how your body would be positioned if + you were skateboarding. If the skateboarder suffers an accident, you + wince in sympathy, imagining the pain your own body would experience + if it were in the same situation. This empathy with other people + guides our understanding of whatever they are doing because it is a + powerful constraint on what is probable and possible. In order to + make use of this powerful empathy constraint, I need a system that + can generate and make sense of sensory data from the many different + senses that humans possess. The two key proprieties of such a system + are /embodiment/ and /imagination/. + +** What is imagination? + + One kind of imagination is /sympathetic/ imagination: you imagine + yourself in the position of something/someone you are + observing. This type of imagination comes into play when you follow + along visually when watching someone perform actions, or when you + sympathetically grimace when someone hurts themselves. This type of + imagination uses the constraints you have learned about your own + body to highly constrain the possibilities in whatever you are + seeing. It uses all your senses to including your senses of touch, + proprioception, etc. Humans are flexible when it comes to "putting + themselves in another's shoes," and can sympathetically understand + not only other humans, but entities ranging from animals to cartoon + characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen! + + Another kind of imagination is /predictive/ imagination: you + construct scenes in your mind that are not entirely related to + whatever you are observing, but instead are predictions of the + future or simply flights of fancy. You use this type of imagination + to plan out multi-step actions, or play out dangerous situations in + your mind so as to avoid messing them up in reality. + + Of course, sympathetic and predictive imagination blend into each + other and are not completely separate concepts. One dimension along + which you can distinguish types of imagination is dependence on raw + sense data. Sympathetic imagination is highly constrained by your + senses, while predictive imagination can be more or less dependent + on your senses depending on how far ahead you imagine. Daydreaming + is an extreme form of predictive imagination that wanders through + different possibilities without concern for whether they are + related to whatever is happening in reality. + + For this thesis, I will mostly focus on sympathetic imagination and + the constraint it provides for understanding sensory data. + +** What problems can imagination solve? + + Consider a video of a cat drinking some water. + + #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers. + #+ATTR_LaTeX: width=5cm + [[../images/cat-drinking.jpg]] + + It is currently impossible for any computer program to reliably + label such an video as "drinking". I think humans are able to label + such video as "drinking" because they imagine /themselves/ as the + cat, and imagine putting their face up against a stream of water + and sticking out their tongue. In that imagined world, they can + feel the cool water hitting their tongue, and feel the water + entering their body, and are able to recognize that /feeling/ as + drinking. So, the label of the action is not really in the pixels + of the image, but is found clearly in a simulation inspired by + those pixels. An imaginative system, having been trained on + drinking and non-drinking examples and learning that the most + important component of drinking is the feeling of water sliding + down one's throat, would analyze a video of a cat drinking in the + following manner: + + - Create a physical model of the video by putting a "fuzzy" model + of its own body in place of the cat. Also, create a simulation of + the stream of water. + + - Play out this simulated scene and generate imagined sensory + experience. This will include relevant muscle contractions, a + close up view of the stream from the cat's perspective, and most + importantly, the imagined feeling of water entering the mouth. + + - The action is now easily identified as drinking by the sense of + taste alone. The other senses (such as the tongue moving in and + out) help to give plausibility to the simulated action. Note that + the sense of vision, while critical in creating the simulation, + is not critical for identifying the action from the simulation. + + More generally, I expect imaginative systems to be particularly + good at identifying embodied actions in videos. + +* Cortex + + The previous example involves liquids, the sense of taste, and + imagining oneself as a cat. For this thesis I constrain myself to + simpler, more easily digitizable senses and situations. + + My system, =CORTEX= performs imagination in two different simplified + worlds: /worm world/ and /stick-figure world/. In each of these + worlds, entities capable of imagination recognize actions by + simulating the experience from their own perspective, and then + recognizing the action from a database of examples. + + In order to serve as a framework for experiments in imagination, + =CORTEX= requires simulated bodies, worlds, and senses like vision, + hearing, touch, proprioception, etc. + +** A Video Game Engine takes care of some of the groundwork + + When it comes to simulation environments, the engines used to + create the worlds in video games offer top-notch physics and + graphics support. These engines also have limited support for + creating cameras and rendering 3D sound, which can be repurposed + for vision and hearing respectively. Physics collision detection + can be expanded to create a sense of touch. + + jMonkeyEngine3 is one such engine for creating video games in + Java. It uses OpenGL to render to the screen and uses screengraphs + to avoid drawing things that do not appear on the screen. It has an + active community and several games in the pipeline. The engine was + not built to serve any particular game but is instead meant to be + used for any 3D game. I chose jMonkeyEngine3 it because it had the + most features out of all the open projects I looked at, and because + I could then write my code in Clojure, an implementation of LISP + that runs on the JVM. + +** =CORTEX= Extends jMonkeyEngine3 to implement rich senses + + Using the game-making primitives provided by jMonkeyEngine3, I have + constructed every major human sense except for smell and + taste. =CORTEX= also provides an interface for creating creatures + in Blender, a 3D modeling environment, and then "rigging" the + creatures with senses using 3D annotations in Blender. A creature + can have any number of senses, and there can be any number of + creatures in a simulation. + + The senses available in =CORTEX= are: + + - [[../../cortex/html/vision.html][Vision]] + - [[../../cortex/html/hearing.html][Hearing]] + - [[../../cortex/html/touch.html][Touch]] + - [[../../cortex/html/proprioception.html][Proprioception]] + - [[../../cortex/html/movement.html][Muscle Tension]] + +* A roadmap for =CORTEX= experiments + +** Worm World + + Worms in =CORTEX= are segmented creatures which vary in length and + number of segments, and have the senses of vision, proprioception, + touch, and muscle tension. + +#+attr_html: width=755 +#+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse). +[[../images/finger-UV.png]] + + +#+begin_html +
+
+ +
YouTube +
+

The worm responds to touch.

+
+#+end_html + +#+begin_html +
+
+ +
YouTube +
+

Proprioception in a worm. The proprioceptive readout is + in the upper left corner of the screen.

+
+#+end_html + + A worm is trained in various actions such as sinusoidal movement, + curling, flailing, and spinning by directly playing motor + contractions while the worm "feels" the experience. These actions + are recorded both as vectors of muscle tension, touch, and + proprioceptive data, but also in higher level forms such as + frequencies of the various contractions and a symbolic name for the + action. + + Then, the worm watches a video of another worm performing one of + the actions, and must judge which action was performed. Normally + this would be an extremely difficult problem, but the worm is able + to greatly diminish the search space through sympathetic + imagination. First, it creates an imagined copy of its body which + it observes from a third person point of view. Then for each frame + of the video, it maneuvers its simulated body to be in registration + with the worm depicted in the video. The physical constraints + imposed by the physics simulation greatly decrease the number of + poses that have to be tried, making the search feasible. As the + imaginary worm moves, it generates imaginary muscle tension and + proprioceptive sensations. The worm determines the action not by + vision, but by matching the imagined proprioceptive data with + previous examples. + + By using non-visual sensory data such as touch, the worms can also + answer body related questions such as "did your head touch your + tail?" and "did worm A touch worm B?" + + The proprioceptive information used for action identification is + body-centric, so only the registration step is dependent on point + of view, not the identification step. Registration is not specific + to any particular action. Thus, action identification can be + divided into a point-of-view dependent generic registration step, + and a action-specific step that is body-centered and invariant to + point of view. + +** Stick Figure World + + This environment is similar to Worm World, except the creatures are + more complicated and the actions and questions more varied. It is + an experiment to see how far imagination can go in interpreting + actions. diff -r 6ba908c1a0a9 -r 7ee735a836da thesis/org/roadmap.org --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/thesis/org/roadmap.org Sun Mar 16 23:31:16 2014 -0400 @@ -0,0 +1,189 @@ +In order for this to be a reasonable thesis that I can be proud of, +what are the /minimum/ number of things I need to get done? + + +* worm OR hand registration + - training from a few examples (2 to start out) + - aligning the body with the scene + - generating sensory data + - matching previous labeled examples using dot-products or some + other basic thing + - showing that it works with different views + +* first draft + - draft of thesis without bibliography or formatting + - should have basic experiment and have full description of + framework with code + - review with Winston + +* final draft + - implement stretch goals from Winston if possible + - complete final formatting and submit + + + + +* CORTEX + DEADLINE: <2014-05-09 Fri> + SHIT THAT'S IN 67 DAYS!!! + +** TODO program simple feature matching code for the worm's segments + DEADLINE: <2014-03-11 Tue> +Subgoals: +*** DONE Get cortex working again, run tests, no jmonkeyengine updates + CLOSED: [2014-03-03 Mon 22:07] SCHEDULED: <2014-03-03 Mon> +*** DONE get blender working again + CLOSED: [2014-03-03 Mon 22:43] SCHEDULED: <2014-03-03 Mon> +*** DONE make sparce touch worm segment in blender + CLOSED: [2014-03-03 Mon 23:16] SCHEDULED: <2014-03-03 Mon> + CLOCK: [2014-03-03 Mon 22:44]--[2014-03-03 Mon 23:16] => 0:32 +*** DONE make multi-segment touch worm with touch sensors and display + CLOSED: [2014-03-03 Mon 23:54] SCHEDULED: <2014-03-03 Mon> + CLOCK: [2014-03-03 Mon 23:17]--[2014-03-03 Mon 23:54] => 0:37 + + +*** DONE Make a worm wiggle and curl + CLOSED: [2014-03-04 Tue 23:03] SCHEDULED: <2014-03-04 Tue> +*** TODO work on alignment for the worm (can "cheat") + SCHEDULED: <2014-03-05 Wed> + +** First draft + DEADLINE: <2014-03-14 Fri> +Subgoals: +*** Writeup new worm experiments. +*** Triage implementation code and get it into chapter form. + + + + + +** for today + +- guided worm :: control the worm with the keyboard. Useful for + testing the body-centered recog scripts, and for + preparing a cool demo video. + +- body-centered recognition :: detect actions using hard coded + body-centered scripts. + +- cool demo video of the worm being moved and recognizing things :: + will be a neat part of the thesis. + +- thesis export :: refactoring and organization of code so that it + spits out a thesis in addition to the web page. + +- video alignment :: analyze the frames of a video in order to align + the worm. Requires body-centered recognition. Can "cheat". + +- smoother actions :: use debugging controls to directly influence the + demo actions, and to generate recoginition procedures. + +- degenerate video demonstration :: show the system recognizing a + curled worm from dead on. Crowning achievement of thesis. + +** Ordered from easiest to hardest + +Just report the positions of everything. I don't think that this +necessairly shows anything usefull. + +Worm-segment vision -- you initialize a view of the worm, but instead +of pixels you use labels via ray tracing. Has the advantage of still +allowing for visual occlusion, but reliably identifies the objects, +even without rainbow coloring. You can code this as an image. + +Same as above, except just with worm/non-worm labels. + +Color code each worm segment and then recognize them using blob +detectors. Then you solve for the perspective and the action +simultaneously. + +The entire worm can be colored the same, high contrast color against a +nearly black background. + +"Rooted" vision. You give the exact coordinates of ONE piece of the +worm, but the algorithm figures out the rest. + +More rooted vision -- start off the entire worm with one posistion. + +The right way to do alignment is to use motion over multiple frames to +snap individual pieces of the model into place sharing and +propragating the individual alignments over the whole model. We also +want to limit the alignment search to just those actions we are +prepared to identify. This might mean that I need some small "micro +actions" such as the individual movements of the worm pieces. + +Get just the centers of each segment projected onto the imaging +plane. (best so far). + + +Repertoire of actions + video frames --> + directed multi-frame-search alg + + + + + + +!! Could also have a bounding box around the worm provided by +filtering the worm/non-worm render, and use bbbgs. As a bonus, I get +to include bbbgs in my thesis! Could finally do that recursive things +where I make bounding boxes be those things that give results that +give good bounding boxes. If I did this I could use a disruptive +pattern on the worm. + +Re imagining using default textures is very simple for this system, +but hard for others. + + +Want to demonstrate, at minimum, alignment of some model of the worm +to the video, and a lookup of the action by simulated perception. + +note: the purple/white points is a very beautiful texture, because +when it moves slightly, the white dots look like they're +twinkling. Would look even better if it was a darker purple. Also +would look better more spread out. + + +embed assumption of one frame of view, search by moving around in +simulated world. + +Allowed to limit search by setting limits to a hemisphere around the +imagined worm! This limits scale also. + + + + + +!! Limited search with worm/non-worm rendering. +How much inverse kinematics do we have to do? +What about cached (allowed state-space) paths, derived from labeled +training. You have to lead from one to another. + +What about initial state? Could start the input videos at a specific +state, then just match that explicitly. + +!! The training doesn't have to be labeled -- you can just move around +for a while!! + +!! Limited search with motion based alignment. + + + + +"play arounds" can establish a chain of linked sensoriums. Future +matches must fall into one of the already experienced things, and once +they do, it greatly limits the things that are possible in the future. + + +frame differences help to detect muscle exertion. + +Can try to match on a few "representative" frames. Can also just have +a few "bodies" in various states which we try to match. + + + +Paths through state-space have the exact same signature as +simulation. BUT, these can be searched in parallel and don't interfere +with each other. + +