rlm@425: #+title: =CORTEX= rlm@425: #+author: Robert McIntyre rlm@425: #+email: rlm@mit.edu rlm@425: #+description: Using embodied AI to facilitate Artificial Imagination. rlm@425: #+keywords: AI, clojure, embodiment rlm@422: rlm@437: rlm@439: * Empathy and Embodiment as problem solving strategies rlm@437: rlm@437: By the end of this thesis, you will have seen a novel approach to rlm@437: interpreting video using embodiment and empathy. You will have also rlm@437: seen one way to efficiently implement empathy for embodied rlm@437: creatures. rlm@437: rlm@437: The core vision of this thesis is that one of the important ways in rlm@437: which we understand others is by imagining ourselves in their rlm@437: posistion and empathicaly feeling experiences based on our own past rlm@437: experiences and imagination. rlm@437: rlm@437: By understanding events in terms of our own previous corperal rlm@437: experience, we greatly constrain the possibilities of what would rlm@437: otherwise be an unweidly exponential search. This extra constraint rlm@437: can be the difference between easily understanding what is happening rlm@437: in a video and being completely lost in a sea of incomprehensible rlm@437: color and movement. rlm@435: rlm@436: ** Recognizing actions in video is extremely difficult rlm@437: rlm@437: Consider for example the problem of determining what is happening in rlm@437: a video of which this is one frame: rlm@437: rlm@439: #+caption: A cat drinking some water. Identifying this action is rlm@439: #+caption: beyond the state of the art for computers. rlm@437: #+ATTR_LaTeX: :width 7cm rlm@437: [[./images/cat-drinking.jpg]] rlm@437: rlm@437: It is currently impossible for any computer program to reliably rlm@437: label such an video as "drinking". And rightly so -- it is a very rlm@437: hard problem! What features can you describe in terms of low level rlm@437: functions of pixels that can even begin to describe what is rlm@437: happening here? rlm@437: rlm@437: Or suppose that you are building a program that recognizes rlm@440: chairs. How could you ``see'' the chair in the following pictures? rlm@437: rlm@437: #+caption: When you look at this, do you think ``chair''? I certainly do. rlm@437: #+ATTR_LaTeX: :width 10cm rlm@437: [[./images/invisible-chair.png]] rlm@437: rlm@439: #+caption: The chair in this image is quite obvious to humans, but I rlm@439: #+caption: doubt that any computer program can find it. rlm@437: #+ATTR_LaTeX: :width 10cm rlm@437: [[./images/fat-person-sitting-at-desk.jpg]] rlm@437: rlm@440: Finally, how is it that you can easily tell the difference between rlm@440: how the girls /muscles/ are working in \ref{girl}? rlm@437: rlm@440: #+caption: The mysterious ``common sense'' appears here as you are able rlm@440: #+caption: to ``see'' the difference in how the girl's arm muscles rlm@440: #+caption: are activated differently in the two images. rlm@440: #+name: girl rlm@440: #+ATTR_LaTeX: :width 10cm rlm@440: [[./images/wall-push.png]] rlm@440: rlm@440: rlm@440: These problems are difficult because the language of pixels is far rlm@440: removed from what we would consider to be an acceptable description rlm@440: of the events in these images. In order to process them, we must rlm@440: raise the images into some higher level of abstraction where their rlm@440: descriptions become more similar to how we would describe them in rlm@440: English. The question is, how can we raise rlm@440: rlm@440: rlm@440: I think humans are able to label such video as "drinking" because rlm@440: they imagine /themselves/ as the cat, and imagine putting their face rlm@440: up against a stream of water and sticking out their tongue. In that rlm@440: imagined world, they can feel the cool water hitting their tongue, rlm@440: and feel the water entering their body, and are able to recognize rlm@440: that /feeling/ as drinking. So, the label of the action is not rlm@440: really in the pixels of the image, but is found clearly in a rlm@440: simulation inspired by those pixels. An imaginative system, having rlm@440: been trained on drinking and non-drinking examples and learning that rlm@440: the most important component of drinking is the feeling of water rlm@440: sliding down one's throat, would analyze a video of a cat drinking rlm@440: in the following manner: rlm@437: rlm@437: - Create a physical model of the video by putting a "fuzzy" model rlm@437: of its own body in place of the cat. Also, create a simulation of rlm@437: the stream of water. rlm@437: rlm@437: - Play out this simulated scene and generate imagined sensory rlm@437: experience. This will include relevant muscle contractions, a rlm@437: close up view of the stream from the cat's perspective, and most rlm@437: importantly, the imagined feeling of water entering the mouth. rlm@437: rlm@437: - The action is now easily identified as drinking by the sense of rlm@437: taste alone. The other senses (such as the tongue moving in and rlm@437: out) help to give plausibility to the simulated action. Note that rlm@437: the sense of vision, while critical in creating the simulation, rlm@437: is not critical for identifying the action from the simulation. rlm@437: rlm@436: cat drinking, mimes, leaning, common sense rlm@435: rlm@437: ** =EMPATH= neatly solves recognition problems rlm@437: rlm@437: factorization , right language, etc rlm@435: rlm@436: a new possibility for the question ``what is a chair?'' -- it's the rlm@436: feeling of your butt on something and your knees bent, with your rlm@436: back muscles and legs relaxed. rlm@435: rlm@437: ** =CORTEX= is a toolkit for building sensate creatures rlm@435: rlm@436: Hand integration demo rlm@435: rlm@437: ** Contributions rlm@435: rlm@436: * Building =CORTEX= rlm@435: rlm@436: ** To explore embodiment, we need a world, body, and senses rlm@435: rlm@436: ** Because of Time, simulation is perferable to reality rlm@435: rlm@436: ** Video game engines are a great starting point rlm@435: rlm@436: ** Bodies are composed of segments connected by joints rlm@435: rlm@436: ** Eyes reuse standard video game components rlm@436: rlm@436: ** Hearing is hard; =CORTEX= does it right rlm@436: rlm@436: ** Touch uses hundreds of hair-like elements rlm@436: rlm@440: ** Proprioception is the sense that makes everything ``real'' rlm@436: rlm@436: ** Muscles are both effectors and sensors rlm@436: rlm@436: ** =CORTEX= brings complex creatures to life! rlm@436: rlm@436: ** =CORTEX= enables many possiblities for further research rlm@435: rlm@435: * Empathy in a simulated worm rlm@435: rlm@436: ** Embodiment factors action recognition into managable parts rlm@435: rlm@436: ** Action recognition is easy with a full gamut of senses rlm@435: rlm@437: ** Digression: bootstrapping touch using free exploration rlm@435: rlm@436: ** \Phi-space describes the worm's experiences rlm@435: rlm@436: ** Empathy is the process of tracing though \Phi-space rlm@435: rlm@440: ** Efficient action recognition =EMPATH= rlm@425: rlm@432: * Contributions rlm@432: - Built =CORTEX=, a comprehensive platform for embodied AI rlm@432: experiments. Has many new features lacking in other systems, such rlm@432: as sound. Easy to model/create new creatures. rlm@432: - created a novel concept for action recognition by using artificial rlm@432: imagination. rlm@426: rlm@436: In the second half of the thesis I develop a computational model of rlm@436: empathy, using =CORTEX= as a base. Empathy in this context is the rlm@436: ability to observe another creature and infer what sorts of sensations rlm@436: that creature is feeling. My empathy algorithm involves multiple rlm@436: phases. First is free-play, where the creature moves around and gains rlm@436: sensory experience. From this experience I construct a representation rlm@436: of the creature's sensory state space, which I call \phi-space. Using rlm@436: \phi-space, I construct an efficient function for enriching the rlm@436: limited data that comes from observing another creature with a full rlm@436: compliment of imagined sensory data based on previous experience. I rlm@436: can then use the imagined sensory data to recognize what the observed rlm@436: creature is doing and feeling, using straightforward embodied action rlm@436: predicates. This is all demonstrated with using a simple worm-like rlm@436: creature, and recognizing worm-actions based on limited data. rlm@432: rlm@436: Embodied representation using multiple senses such as touch, rlm@436: proprioception, and muscle tension turns out be be exceedingly rlm@436: efficient at describing body-centered actions. It is the ``right rlm@436: language for the job''. For example, it takes only around 5 lines of rlm@436: LISP code to describe the action of ``curling'' using embodied rlm@436: primitives. It takes about 8 lines to describe the seemingly rlm@436: complicated action of wiggling. rlm@432: rlm@437: rlm@437: rlm@437: * COMMENT names for cortex rlm@437: - bioland