rlm@401: <?xml version="1.0" encoding="utf-8"?> rlm@401: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" rlm@401: "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> rlm@401: <html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en"> rlm@401: <head> rlm@401: <title><code>CORTEX</code></title> rlm@401: <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/> rlm@401: <meta name="title" content="<code>CORTEX</code>"/> rlm@401: <meta name="generator" content="Org-mode"/> rlm@401: <meta name="generated" content="2013-11-07 04:21:29 EST"/> rlm@401: <meta name="author" content="Robert McIntyre"/> rlm@401: <meta name="description" content="Using embodied AI to facilitate Artificial Imagination."/> rlm@401: <meta name="keywords" content="AI, clojure, embodiment"/> rlm@401: <style type="text/css"> rlm@401: <!--/*--><![CDATA[/*><!--*/ rlm@401: html { font-family: Times, serif; font-size: 12pt; } rlm@401: .title { text-align: center; } rlm@401: .todo { color: red; } rlm@401: .done { color: green; } rlm@401: .tag { background-color: #add8e6; font-weight:normal } rlm@401: .target { } rlm@401: .timestamp { color: #bebebe; } rlm@401: .timestamp-kwd { color: #5f9ea0; } rlm@401: .right {margin-left:auto; margin-right:0px; text-align:right;} rlm@401: .left {margin-left:0px; margin-right:auto; text-align:left;} rlm@401: .center {margin-left:auto; margin-right:auto; text-align:center;} rlm@401: p.verse { margin-left: 3% } rlm@401: pre { rlm@401: border: 1pt solid #AEBDCC; rlm@401: background-color: #F3F5F7; rlm@401: padding: 5pt; rlm@401: font-family: courier, monospace; rlm@401: font-size: 90%; rlm@401: overflow:auto; rlm@401: } rlm@401: table { border-collapse: collapse; } rlm@401: td, th { vertical-align: top; } rlm@401: th.right { text-align:center; } rlm@401: th.left { text-align:center; } rlm@401: th.center { text-align:center; } rlm@401: td.right { text-align:right; } rlm@401: td.left { text-align:left; } rlm@401: td.center { text-align:center; } rlm@401: dt { font-weight: bold; } rlm@401: div.figure { padding: 0.5em; } rlm@401: div.figure p { text-align: center; } rlm@401: div.inlinetask { rlm@401: padding:10px; rlm@401: border:2px solid gray; rlm@401: margin:10px; rlm@401: background: #ffffcc; rlm@401: } rlm@401: textarea { overflow-x: auto; } rlm@401: .linenr { font-size:smaller } rlm@401: .code-highlighted {background-color:#ffff00;} rlm@401: .org-info-js_info-navigation { border-style:none; } rlm@401: #org-info-js_console-label { font-size:10px; font-weight:bold; rlm@401: white-space:nowrap; } rlm@401: .org-info-js_search-highlight {background-color:#ffff00; color:#000000; rlm@401: font-weight:bold; } rlm@401: /*]]>*/--> rlm@401: </style> rlm@401: <script type="text/javascript">var _gaq = _gaq || [];_gaq.push(['_setAccount', 'UA-31261312-1']);_gaq.push(['_trackPageview']);(function() {var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);})();</script><link rel="stylesheet" type="text/css" href="../../aurellem/css/argentum.css" /> rlm@401: <script type="text/javascript"> rlm@401: <!--/*--><![CDATA[/*><!--*/ rlm@401: function CodeHighlightOn(elem, id) rlm@401: { rlm@401: var target = document.getElementById(id); rlm@401: if(null != target) { rlm@401: elem.cacheClassElem = elem.className; rlm@401: elem.cacheClassTarget = target.className; rlm@401: target.className = "code-highlighted"; rlm@401: elem.className = "code-highlighted"; rlm@401: } rlm@401: } rlm@401: function CodeHighlightOff(elem, id) rlm@401: { rlm@401: var target = document.getElementById(id); rlm@401: if(elem.cacheClassElem) rlm@401: elem.className = elem.cacheClassElem; rlm@401: if(elem.cacheClassTarget) rlm@401: target.className = elem.cacheClassTarget; rlm@401: } rlm@401: /*]]>*///--> rlm@401: </script> rlm@401: rlm@401: </head> rlm@401: <body> rlm@401: rlm@401: rlm@401: <div id="content"> rlm@401: <h1 class="title"><code>CORTEX</code></h1> rlm@401: rlm@401: rlm@401: <div class="header"> rlm@401: <div class="float-right"> rlm@401: <!-- rlm@401: <form> rlm@401: <input type="text"/><input type="submit" value="search the blog »"/> rlm@401: </form> rlm@401: --> rlm@401: </div> rlm@401: rlm@401: <h1>aurellem <em>☉</em></h1> rlm@401: <ul class="nav"> rlm@401: <li><a href="/">read the blog »</a></li> rlm@401: <!-- li><a href="#">learn about us »</a></li--> rlm@401: </ul> rlm@401: </div> rlm@401: rlm@401: <div class="author">Written by <author>Robert McIntyre</author></div> rlm@401: rlm@401: rlm@401: rlm@401: rlm@401: rlm@401: rlm@401: rlm@401: <div id="outline-container-1" class="outline-2"> rlm@401: <h2 id="sec-1">Artificial Imagination</h2> rlm@401: <div class="outline-text-2" id="text-1"> rlm@401: rlm@401: rlm@401: <p> rlm@401: Imagine watching a video of someone skateboarding. When you watch rlm@401: the video, you can imagine yourself skateboarding, and your rlm@401: knowledge of the human body and its dynamics guides your rlm@401: interpretation of the scene. For example, even if the skateboarder rlm@401: is partially occluded, you can infer the positions of his arms and rlm@401: body from your own knowledge of how your body would be positioned if rlm@401: you were skateboarding. If the skateboarder suffers an accident, you rlm@401: wince in sympathy, imagining the pain your own body would experience rlm@401: if it were in the same situation. This empathy with other people rlm@401: guides our understanding of whatever they are doing because it is a rlm@401: powerful constraint on what is probable and possible. In order to rlm@401: make use of this powerful empathy constraint, I need a system that rlm@401: can generate and make sense of sensory data from the many different rlm@401: senses that humans possess. The two key proprieties of such a system rlm@401: are <i>embodiment</i> and <i>imagination</i>. rlm@401: </p> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-1-1" class="outline-3"> rlm@401: <h3 id="sec-1-1">What is imagination?</h3> rlm@401: <div class="outline-text-3" id="text-1-1"> rlm@401: rlm@401: rlm@401: <p> rlm@401: One kind of imagination is <i>sympathetic</i> imagination: you imagine rlm@401: yourself in the position of something/someone you are rlm@401: observing. This type of imagination comes into play when you follow rlm@401: along visually when watching someone perform actions, or when you rlm@401: sympathetically grimace when someone hurts themselves. This type of rlm@401: imagination uses the constraints you have learned about your own rlm@401: body to highly constrain the possibilities in whatever you are rlm@401: seeing. It uses all your senses to including your senses of touch, rlm@401: proprioception, etc. Humans are flexible when it comes to "putting rlm@401: themselves in another's shoes," and can sympathetically understand rlm@401: not only other humans, but entities ranging animals to cartoon rlm@401: characters to <a href="http://www.youtube.com/watch?v=0jz4HcwTQmU">single dots</a> on a screen! rlm@401: </p> rlm@401: <p> rlm@401: Another kind of imagination is <i>predictive</i> imagination: you rlm@401: construct scenes in your mind that are not entirely related to rlm@401: whatever you are observing, but instead are predictions of the rlm@401: future or simply flights of fancy. You use this type of imagination rlm@401: to plan out multi-step actions, or play out dangerous situations in rlm@401: your mind so as to avoid messing them up in reality. rlm@401: </p> rlm@401: <p> rlm@401: Of course, sympathetic and predictive imagination blend into each rlm@401: other and are not completely separate concepts. One dimension along rlm@401: which you can distinguish types of imagination is dependence on raw rlm@401: sense data. Sympathetic imagination is highly constrained by your rlm@401: senses, while predictive imagination can be more or less dependent rlm@401: on your senses depending on how far ahead you imagine. Daydreaming rlm@401: is an extreme form of predictive imagination that wanders through rlm@401: different possibilities without concern for whether they are rlm@401: related to whatever is happening in reality. rlm@401: </p> rlm@401: <p> rlm@401: For this thesis, I will mostly focus on sympathetic imagination and rlm@401: the constraint it provides for understanding sensory data. rlm@401: </p> rlm@401: </div> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-1-2" class="outline-3"> rlm@401: <h3 id="sec-1-2">What problems can imagination solve?</h3> rlm@401: <div class="outline-text-3" id="text-1-2"> rlm@401: rlm@401: rlm@401: <p> rlm@401: Consider a video of a cat drinking some water. rlm@401: </p> rlm@401: rlm@401: <div class="figure"> rlm@401: <p><img src="../images/cat-drinking.jpg" alt="../images/cat-drinking.jpg" /></p> rlm@401: <p>A cat drinking some water. Identifying this action is beyond the state of the art for computers.</p> rlm@401: </div> rlm@401: rlm@401: <p> rlm@401: It is currently impossible for any computer program to reliably rlm@401: label such an video as "drinking". I think humans are able to label rlm@401: such video as "drinking" because they imagine <i>themselves</i> as the rlm@401: cat, and imagine putting their face up against a stream of water rlm@401: and sticking out their tongue. In that imagined world, they can rlm@401: feel the cool water hitting their tongue, and feel the water rlm@401: entering their body, and are able to recognize that <i>feeling</i> as rlm@401: drinking. So, the label of the action is not really in the pixels rlm@401: of the image, but is found clearly in a simulation inspired by rlm@401: those pixels. An imaginative system, having been trained on rlm@401: drinking and non-drinking examples and learning that the most rlm@401: important component of drinking is the feeling of water sliding rlm@401: down one's throat, would analyze a video of a cat drinking in the rlm@401: following manner: rlm@401: </p> rlm@401: <ul> rlm@401: <li>Create a physical model of the video by putting a "fuzzy" model rlm@401: of its own body in place of the cat. Also, create a simulation of rlm@401: the stream of water. rlm@401: rlm@401: </li> rlm@401: <li>Play out this simulated scene and generate imagined sensory rlm@401: experience. This will include relevant muscle contractions, a rlm@401: close up view of the stream from the cat's perspective, and most rlm@401: importantly, the imagined feeling of water entering the mouth. rlm@401: rlm@401: </li> rlm@401: <li>The action is now easily identified as drinking by the sense of rlm@401: taste alone. The other senses (such as the tongue moving in and rlm@401: out) help to give plausibility to the simulated action. Note that rlm@401: the sense of vision, while critical in creating the simulation, rlm@401: is not critical for identifying the action from the simulation. rlm@401: </li> rlm@401: </ul> rlm@401: rlm@401: rlm@401: <p> rlm@401: More generally, I expect imaginative systems to be particularly rlm@401: good at identifying embodied actions in videos. rlm@401: </p> rlm@401: </div> rlm@401: </div> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-2" class="outline-2"> rlm@401: <h2 id="sec-2">Cortex</h2> rlm@401: <div class="outline-text-2" id="text-2"> rlm@401: rlm@401: rlm@401: <p> rlm@401: The previous example involves liquids, the sense of taste, and rlm@401: imagining oneself as a cat. For this thesis I constrain myself to rlm@401: simpler, more easily digitizable senses and situations. rlm@401: </p> rlm@401: <p> rlm@401: My system, <code>Cortex</code> performs imagination in two different simplified rlm@401: worlds: <i>worm world</i> and <i>stick figure world</i>. In each of these rlm@401: worlds, entities capable of imagination recognize actions by rlm@401: simulating the experience from their own perspective, and then rlm@401: recognizing the action from a database of examples. rlm@401: </p> rlm@401: <p> rlm@401: In order to serve as a framework for experiments in imagination, rlm@401: <code>Cortex</code> requires simulated bodies, worlds, and senses like vision, rlm@401: hearing, touch, proprioception, etc. rlm@401: </p> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-2-1" class="outline-3"> rlm@401: <h3 id="sec-2-1">A Video Game Engine takes care of some of the groundwork</h3> rlm@401: <div class="outline-text-3" id="text-2-1"> rlm@401: rlm@401: rlm@401: <p> rlm@401: When it comes to simulation environments, the engines used to rlm@401: create the worlds in video games offer top-notch physics and rlm@401: graphics support. These engines also have limited support for rlm@401: creating cameras and rendering 3D sound, which can be repurposed rlm@401: for vision and hearing respectively. Physics collision detection rlm@401: can be expanded to create a sense of touch. rlm@401: </p> rlm@401: <p> rlm@401: jMonkeyEngine3 is one such engine for creating video games in rlm@401: Java. It uses OpenGL to render to the screen and uses screengraphs rlm@401: to avoid drawing things that do not appear on the screen. It has an rlm@401: active community and several games in the pipeline. The engine was rlm@401: not built to serve any particular game but is instead meant to be rlm@401: used for any 3D game. I chose jMonkeyEngine3 it because it had the rlm@401: most features out of all the open projects I looked at, and because rlm@401: I could then write my code in Clojure, an implementation of LISP rlm@401: that runs on the JVM. rlm@401: </p> rlm@401: </div> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-2-2" class="outline-3"> rlm@401: <h3 id="sec-2-2"><code>CORTEX</code> Extends jMonkeyEngine3 to implement rich senses</h3> rlm@401: <div class="outline-text-3" id="text-2-2"> rlm@401: rlm@401: rlm@401: <p> rlm@401: Using the game-making primitives provided by jMonkeyEngine3, I have rlm@401: constructed every major human sense except for smell and rlm@401: taste. <code>Cortex</code> also provides an interface for creating creatures rlm@401: in Blender, a 3D modeling environment, and then "rigging" the rlm@401: creatures with senses using 3D annotations in Blender. A creature rlm@401: can have any number of senses, and there can be any number of rlm@401: creatures in a simulation. rlm@401: </p> rlm@401: <p> rlm@401: The senses available in <code>Cortex</code> are: rlm@401: </p> rlm@401: <ul> rlm@401: <li><a href="../../cortex/html/vision.html">Vision</a> rlm@401: </li> rlm@401: <li><a href="../../cortex/html/hearing.html">Hearing</a> rlm@401: </li> rlm@401: <li><a href="../../cortex/html/touch.html">Touch</a> rlm@401: </li> rlm@401: <li><a href="../../cortex/html/proprioception.html">Proprioception</a> rlm@401: </li> rlm@401: <li><a href="../../cortex/html/movement.html">Muscle Tension</a> rlm@401: </li> rlm@401: </ul> rlm@401: rlm@401: rlm@401: </div> rlm@401: </div> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-3" class="outline-2"> rlm@401: <h2 id="sec-3">A roadmap for <code>Cortex</code> experiments</h2> rlm@401: <div class="outline-text-2" id="text-3"> rlm@401: rlm@401: rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-3-1" class="outline-3"> rlm@401: <h3 id="sec-3-1">Worm World</h3> rlm@401: <div class="outline-text-3" id="text-3-1"> rlm@401: rlm@401: rlm@401: <p> rlm@401: Worms in <code>Cortex</code> are segmented creatures which vary in length and rlm@401: number of segments, and have the senses of vision, proprioception, rlm@401: touch, and muscle tension. rlm@401: </p> rlm@401: rlm@401: <div class="figure"> rlm@401: <p><img src="../images/finger-UV.png" width=755 alt="../images/finger-UV.png" /></p> rlm@401: <p>This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).</p> rlm@401: </div> rlm@401: rlm@401: rlm@401: rlm@401: rlm@401: <div class="figure"> rlm@401: <center> rlm@401: <video controls="controls" width="550"> rlm@401: <source src="../video/worm-touch.ogg" type="video/ogg" rlm@401: preload="none" /> rlm@401: </video> rlm@401: <br> <a href="http://youtu.be/RHx2wqzNVcU"> YouTube </a> rlm@401: </center> rlm@401: <p>The worm responds to touch.</p> rlm@401: </div> rlm@401: rlm@401: <div class="figure"> rlm@401: <center> rlm@401: <video controls="controls" width="550"> rlm@401: <source src="../video/test-proprioception.ogg" type="video/ogg" rlm@401: preload="none" /> rlm@401: </video> rlm@401: <br> <a href="http://youtu.be/JjdDmyM8b0w"> YouTube </a> rlm@401: </center> rlm@401: <p>Proprioception in a worm. The proprioceptive readout is rlm@401: in the upper left corner of the screen.</p> rlm@401: </div> rlm@401: rlm@401: <p> rlm@401: A worm is trained in various actions such as sinusoidal movement, rlm@401: curling, flailing, and spinning by directly playing motor rlm@401: contractions while the worm "feels" the experience. These actions rlm@401: are recorded both as vectors of muscle tension, touch, and rlm@401: proprioceptive data, but also in higher level forms such as rlm@401: frequencies of the various contractions and a symbolic name for the rlm@401: action. rlm@401: </p> rlm@401: <p> rlm@401: Then, the worm watches a video of another worm performing one of rlm@401: the actions, and must judge which action was performed. Normally rlm@401: this would be an extremely difficult problem, but the worm is able rlm@401: to greatly diminish the search space through sympathetic rlm@401: imagination. First, it creates an imagined copy of its body which rlm@401: it observes from a third person point of view. Then for each frame rlm@401: of the video, it maneuvers its simulated body to be in registration rlm@401: with the worm depicted in the video. The physical constraints rlm@401: imposed by the physics simulation greatly decrease the number of rlm@401: poses that have to be tried, making the search feasible. As the rlm@401: imaginary worm moves, it generates imaginary muscle tension and rlm@401: proprioceptive sensations. The worm determines the action not by rlm@401: vision, but by matching the imagined proprioceptive data with rlm@401: previous examples. rlm@401: </p> rlm@401: <p> rlm@401: By using non-visual sensory data such as touch, the worms can also rlm@401: answer body related questions such as "did your head touch your rlm@401: tail?" and "did worm A touch worm B?" rlm@401: </p> rlm@401: <p> rlm@401: The proprioceptive information used for action identification is rlm@401: body-centric, so only the registration step is dependent on point rlm@401: of view, not the identification step. Registration is not specific rlm@401: to any particular action. Thus, action identification can be rlm@401: divided into a point-of-view dependent generic registration step, rlm@401: and a action-specific step that is body-centered and invariant to rlm@401: point of view. rlm@401: </p> rlm@401: </div> rlm@401: rlm@401: </div> rlm@401: rlm@401: <div id="outline-container-3-2" class="outline-3"> rlm@401: <h3 id="sec-3-2">Stick Figure World</h3> rlm@401: <div class="outline-text-3" id="text-3-2"> rlm@401: rlm@401: rlm@401: <p> rlm@401: This environment is similar to Worm World, except the creatures are rlm@401: more complicated and the actions and questions more varied. It is rlm@401: an experiment to see how far imagination can go in interpreting rlm@401: actions. rlm@401: </p></div> rlm@401: </div> rlm@401: </div> rlm@401: </div> rlm@401: rlm@401: <div id="postamble"> rlm@401: <p class="date">Date: 2013-11-07 04:21:29 EST</p> rlm@401: <p class="author">Author: Robert McIntyre</p> rlm@401: <p class="creator">Org version 7.7 with Emacs version 24</p> rlm@401: <a href="http://validator.w3.org/check?uri=referer">Validate XHTML 1.0</a> rlm@401: rlm@401: </div> rlm@401: </body> rlm@401: </html>