rlm@401: #+title: =CORTEX= rlm@401: #+author: Robert McIntyre rlm@401: #+email: rlm@mit.edu rlm@401: #+description: Using embodied AI to facilitate Artificial Imagination. rlm@401: #+keywords: AI, clojure, embodiment rlm@401: #+SETUPFILE: ../../aurellem/org/setup.org rlm@401: #+INCLUDE: ../../aurellem/org/level-0.org rlm@401: #+babel: :mkdirp yes :noweb yes :exports both rlm@401: #+OPTIONS: toc:nil, num:nil rlm@401: rlm@401: * Artificial Imagination rlm@401: rlm@401: Imagine watching a video of someone skateboarding. When you watch rlm@401: the video, you can imagine yourself skateboarding, and your rlm@401: knowledge of the human body and its dynamics guides your rlm@401: interpretation of the scene. For example, even if the skateboarder rlm@401: is partially occluded, you can infer the positions of his arms and rlm@401: body from your own knowledge of how your body would be positioned if rlm@401: you were skateboarding. If the skateboarder suffers an accident, you rlm@401: wince in sympathy, imagining the pain your own body would experience rlm@401: if it were in the same situation. This empathy with other people rlm@401: guides our understanding of whatever they are doing because it is a rlm@401: powerful constraint on what is probable and possible. In order to rlm@401: make use of this powerful empathy constraint, I need a system that rlm@401: can generate and make sense of sensory data from the many different rlm@401: senses that humans possess. The two key proprieties of such a system rlm@401: are /embodiment/ and /imagination/. rlm@401: rlm@401: ** What is imagination? rlm@401: rlm@401: One kind of imagination is /sympathetic/ imagination: you imagine rlm@401: yourself in the position of something/someone you are rlm@401: observing. This type of imagination comes into play when you follow rlm@401: along visually when watching someone perform actions, or when you rlm@401: sympathetically grimace when someone hurts themselves. This type of rlm@401: imagination uses the constraints you have learned about your own rlm@401: body to highly constrain the possibilities in whatever you are rlm@401: seeing. It uses all your senses to including your senses of touch, rlm@401: proprioception, etc. Humans are flexible when it comes to "putting rlm@401: themselves in another's shoes," and can sympathetically understand rlm@401: not only other humans, but entities ranging from animals to cartoon rlm@401: characters to [[http://www.youtube.com/watch?v=0jz4HcwTQmU][single dots]] on a screen! rlm@401: rlm@401: Another kind of imagination is /predictive/ imagination: you rlm@401: construct scenes in your mind that are not entirely related to rlm@401: whatever you are observing, but instead are predictions of the rlm@401: future or simply flights of fancy. You use this type of imagination rlm@401: to plan out multi-step actions, or play out dangerous situations in rlm@401: your mind so as to avoid messing them up in reality. rlm@401: rlm@401: Of course, sympathetic and predictive imagination blend into each rlm@401: other and are not completely separate concepts. One dimension along rlm@401: which you can distinguish types of imagination is dependence on raw rlm@401: sense data. Sympathetic imagination is highly constrained by your rlm@401: senses, while predictive imagination can be more or less dependent rlm@401: on your senses depending on how far ahead you imagine. Daydreaming rlm@401: is an extreme form of predictive imagination that wanders through rlm@401: different possibilities without concern for whether they are rlm@401: related to whatever is happening in reality. rlm@401: rlm@401: For this thesis, I will mostly focus on sympathetic imagination and rlm@401: the constraint it provides for understanding sensory data. rlm@401: rlm@401: ** What problems can imagination solve? rlm@401: rlm@401: Consider a video of a cat drinking some water. rlm@401: rlm@401: #+caption: A cat drinking some water. Identifying this action is beyond the state of the art for computers. rlm@401: #+ATTR_LaTeX: width=5cm rlm@401: [[../images/cat-drinking.jpg]] rlm@401: rlm@401: It is currently impossible for any computer program to reliably rlm@401: label such an video as "drinking". I think humans are able to label rlm@401: such video as "drinking" because they imagine /themselves/ as the rlm@401: cat, and imagine putting their face up against a stream of water rlm@401: and sticking out their tongue. In that imagined world, they can rlm@401: feel the cool water hitting their tongue, and feel the water rlm@401: entering their body, and are able to recognize that /feeling/ as rlm@401: drinking. So, the label of the action is not really in the pixels rlm@401: of the image, but is found clearly in a simulation inspired by rlm@401: those pixels. An imaginative system, having been trained on rlm@401: drinking and non-drinking examples and learning that the most rlm@401: important component of drinking is the feeling of water sliding rlm@401: down one's throat, would analyze a video of a cat drinking in the rlm@401: following manner: rlm@401: rlm@401: - Create a physical model of the video by putting a "fuzzy" model rlm@401: of its own body in place of the cat. Also, create a simulation of rlm@401: the stream of water. rlm@401: rlm@401: - Play out this simulated scene and generate imagined sensory rlm@401: experience. This will include relevant muscle contractions, a rlm@401: close up view of the stream from the cat's perspective, and most rlm@401: importantly, the imagined feeling of water entering the mouth. rlm@401: rlm@401: - The action is now easily identified as drinking by the sense of rlm@401: taste alone. The other senses (such as the tongue moving in and rlm@401: out) help to give plausibility to the simulated action. Note that rlm@401: the sense of vision, while critical in creating the simulation, rlm@401: is not critical for identifying the action from the simulation. rlm@401: rlm@401: More generally, I expect imaginative systems to be particularly rlm@401: good at identifying embodied actions in videos. rlm@401: rlm@401: * Cortex rlm@401: rlm@401: The previous example involves liquids, the sense of taste, and rlm@401: imagining oneself as a cat. For this thesis I constrain myself to rlm@401: simpler, more easily digitizable senses and situations. rlm@401: rlm@401: My system, =CORTEX= performs imagination in two different simplified rlm@401: worlds: /worm world/ and /stick-figure world/. In each of these rlm@401: worlds, entities capable of imagination recognize actions by rlm@401: simulating the experience from their own perspective, and then rlm@401: recognizing the action from a database of examples. rlm@401: rlm@401: In order to serve as a framework for experiments in imagination, rlm@401: =CORTEX= requires simulated bodies, worlds, and senses like vision, rlm@401: hearing, touch, proprioception, etc. rlm@401: rlm@401: ** A Video Game Engine takes care of some of the groundwork rlm@401: rlm@401: When it comes to simulation environments, the engines used to rlm@401: create the worlds in video games offer top-notch physics and rlm@401: graphics support. These engines also have limited support for rlm@401: creating cameras and rendering 3D sound, which can be repurposed rlm@401: for vision and hearing respectively. Physics collision detection rlm@401: can be expanded to create a sense of touch. rlm@401: rlm@401: jMonkeyEngine3 is one such engine for creating video games in rlm@401: Java. It uses OpenGL to render to the screen and uses screengraphs rlm@401: to avoid drawing things that do not appear on the screen. It has an rlm@401: active community and several games in the pipeline. The engine was rlm@401: not built to serve any particular game but is instead meant to be rlm@401: used for any 3D game. I chose jMonkeyEngine3 it because it had the rlm@401: most features out of all the open projects I looked at, and because rlm@401: I could then write my code in Clojure, an implementation of LISP rlm@401: that runs on the JVM. rlm@401: rlm@401: ** =CORTEX= Extends jMonkeyEngine3 to implement rich senses rlm@401: rlm@401: Using the game-making primitives provided by jMonkeyEngine3, I have rlm@401: constructed every major human sense except for smell and rlm@401: taste. =CORTEX= also provides an interface for creating creatures rlm@401: in Blender, a 3D modeling environment, and then "rigging" the rlm@401: creatures with senses using 3D annotations in Blender. A creature rlm@401: can have any number of senses, and there can be any number of rlm@401: creatures in a simulation. rlm@401: rlm@401: The senses available in =CORTEX= are: rlm@401: rlm@401: - [[../../cortex/html/vision.html][Vision]] rlm@401: - [[../../cortex/html/hearing.html][Hearing]] rlm@401: - [[../../cortex/html/touch.html][Touch]] rlm@401: - [[../../cortex/html/proprioception.html][Proprioception]] rlm@401: - [[../../cortex/html/movement.html][Muscle Tension]] rlm@401: rlm@401: * A roadmap for =CORTEX= experiments rlm@401: rlm@401: ** Worm World rlm@401: rlm@401: Worms in =CORTEX= are segmented creatures which vary in length and rlm@401: number of segments, and have the senses of vision, proprioception, rlm@401: touch, and muscle tension. rlm@401: rlm@401: #+attr_html: width=755 rlm@401: #+caption: This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse). rlm@401: [[../images/finger-UV.png]] rlm@401: rlm@401: rlm@401: #+begin_html rlm@401:
rlm@401:
rlm@401: rlm@401:
YouTube rlm@401:
rlm@401:

The worm responds to touch.

rlm@401:
rlm@401: #+end_html rlm@401: rlm@401: #+begin_html rlm@401:
rlm@401:
rlm@401: rlm@401:
YouTube rlm@401:
rlm@401:

Proprioception in a worm. The proprioceptive readout is rlm@401: in the upper left corner of the screen.

rlm@401:
rlm@401: #+end_html rlm@401: rlm@401: A worm is trained in various actions such as sinusoidal movement, rlm@401: curling, flailing, and spinning by directly playing motor rlm@401: contractions while the worm "feels" the experience. These actions rlm@401: are recorded both as vectors of muscle tension, touch, and rlm@401: proprioceptive data, but also in higher level forms such as rlm@401: frequencies of the various contractions and a symbolic name for the rlm@401: action. rlm@401: rlm@401: Then, the worm watches a video of another worm performing one of rlm@401: the actions, and must judge which action was performed. Normally rlm@401: this would be an extremely difficult problem, but the worm is able rlm@401: to greatly diminish the search space through sympathetic rlm@401: imagination. First, it creates an imagined copy of its body which rlm@401: it observes from a third person point of view. Then for each frame rlm@401: of the video, it maneuvers its simulated body to be in registration rlm@401: with the worm depicted in the video. The physical constraints rlm@401: imposed by the physics simulation greatly decrease the number of rlm@401: poses that have to be tried, making the search feasible. As the rlm@401: imaginary worm moves, it generates imaginary muscle tension and rlm@401: proprioceptive sensations. The worm determines the action not by rlm@401: vision, but by matching the imagined proprioceptive data with rlm@401: previous examples. rlm@401: rlm@401: By using non-visual sensory data such as touch, the worms can also rlm@401: answer body related questions such as "did your head touch your rlm@401: tail?" and "did worm A touch worm B?" rlm@401: rlm@401: The proprioceptive information used for action identification is rlm@401: body-centric, so only the registration step is dependent on point rlm@401: of view, not the identification step. Registration is not specific rlm@401: to any particular action. Thus, action identification can be rlm@401: divided into a point-of-view dependent generic registration step, rlm@401: and a action-specific step that is body-centered and invariant to rlm@401: point of view. rlm@401: rlm@401: ** Stick Figure World rlm@401: rlm@401: This environment is similar to Worm World, except the creatures are rlm@401: more complicated and the actions and questions more varied. It is rlm@401: an experiment to see how far imagination can go in interpreting rlm@401: actions.