# HG changeset patch # User Robert McIntyre # Date 1395709175 14400 # Node ID c20de2267d39866ae893b1cd772b27a3263ff6d5 # Parent b01c070b03d4a892b9acfcf5b9aa5f01f85ddd90 completeing first third of first chapter. diff -r b01c070b03d4 -r c20de2267d39 thesis/abstract.org --- a/thesis/abstract.org Sun Mar 23 23:43:20 2014 -0400 +++ b/thesis/abstract.org Mon Mar 24 20:59:35 2014 -0400 @@ -6,11 +6,11 @@ curling and wiggling. To attack the action recognition problem, I developed a computational -model of empathy (=EMPATH=) which allows me to use simple, embodied -representations of actions (which require rich sensory data), even -when that sensory data is not actually available. The missing sense -data is ``imagined'' by the system by combining previous experiences -gained from unsupervised free play. +model of empathy (=EMPATH=) which allows me to recognize actions using +simple, embodied representations of actions (which require rich +sensory data), even when that sensory data is not actually +available. The missing sense data is ``imagined'' by the system by +combining previous experiences gained from unsupervised free play. In order to build this empathic, action-recognizing system, I created a program called =CORTEX=, which is a complete platform for embodied diff -r b01c070b03d4 -r c20de2267d39 thesis/cortex.org --- a/thesis/cortex.org Sun Mar 23 23:43:20 2014 -0400 +++ b/thesis/cortex.org Mon Mar 24 20:59:35 2014 -0400 @@ -10,104 +10,271 @@ By the end of this thesis, you will have seen a novel approach to interpreting video using embodiment and empathy. You will have also seen one way to efficiently implement empathy for embodied - creatures. + creatures. Finally, you will become familiar with =CORTEX=, a + system for designing and simulating creatures with rich senses, + which you may choose to use in your own research. - The core vision of this thesis is that one of the important ways in - which we understand others is by imagining ourselves in their - posistion and empathicaly feeling experiences based on our own past - experiences and imagination. - - By understanding events in terms of our own previous corperal - experience, we greatly constrain the possibilities of what would - otherwise be an unweidly exponential search. This extra constraint - can be the difference between easily understanding what is happening - in a video and being completely lost in a sea of incomprehensible - color and movement. + This is the core vision of my thesis: That one of the important ways + in which we understand others is by imagining ourselves in their + position and emphatically feeling experiences relative to our own + bodies. By understanding events in terms of our own previous + corporeal experience, we greatly constrain the possibilities of what + would otherwise be an unwieldy exponential search. This extra + constraint can be the difference between easily understanding what + is happening in a video and being completely lost in a sea of + incomprehensible color and movement. ** Recognizing actions in video is extremely difficult - Consider for example the problem of determining what is happening in - a video of which this is one frame: + Consider for example the problem of determining what is happening in + a video of which this is one frame: - #+caption: A cat drinking some water. Identifying this action is - #+caption: beyond the state of the art for computers. - #+ATTR_LaTeX: :width 7cm - [[./images/cat-drinking.jpg]] + #+caption: A cat drinking some water. Identifying this action is + #+caption: beyond the state of the art for computers. + #+ATTR_LaTeX: :width 7cm + [[./images/cat-drinking.jpg]] + + It is currently impossible for any computer program to reliably + label such an video as "drinking". And rightly so -- it is a very + hard problem! What features can you describe in terms of low level + functions of pixels that can even begin to describe at a high level + what is happening here? - It is currently impossible for any computer program to reliably - label such an video as "drinking". And rightly so -- it is a very - hard problem! What features can you describe in terms of low level - functions of pixels that can even begin to describe what is - happening here? + Or suppose that you are building a program that recognizes + chairs. How could you ``see'' the chair in figure + \ref{invisible-chair} and figure \ref{hidden-chair}? + + #+caption: When you look at this, do you think ``chair''? I certainly do. + #+name: invisible-chair + #+ATTR_LaTeX: :width 10cm + [[./images/invisible-chair.png]] + + #+caption: The chair in this image is quite obvious to humans, but I + #+caption: doubt that any computer program can find it. + #+name: hidden-chair + #+ATTR_LaTeX: :width 10cm + [[./images/fat-person-sitting-at-desk.jpg]] + + Finally, how is it that you can easily tell the difference between + how the girls /muscles/ are working in figure \ref{girl}? + + #+caption: The mysterious ``common sense'' appears here as you are able + #+caption: to discern the difference in how the girl's arm muscles + #+caption: are activated between the two images. + #+name: girl + #+ATTR_LaTeX: :width 10cm + [[./images/wall-push.png]] - Or suppose that you are building a program that recognizes - chairs. How could you ``see'' the chair in the following pictures? + Each of these examples tells us something about what might be going + on in our minds as we easily solve these recognition problems. + + The hidden chairs show us that we are strongly triggered by cues + relating to the position of human bodies, and that we can + determine the overall physical configuration of a human body even + if much of that body is occluded. - #+caption: When you look at this, do you think ``chair''? I certainly do. - #+ATTR_LaTeX: :width 10cm - [[./images/invisible-chair.png]] + The picture of the girl pushing against the wall tells us that we + have common sense knowledge about the kinetics of our own bodies. + We know well how our muscles would have to work to maintain us in + most positions, and we can easily project this self-knowledge to + imagined positions triggered by images of the human body. + +** =EMPATH= neatly solves recognition problems + + I propose a system that can express the types of recognition + problems above in a form amenable to computation. It is split into + four parts: + + - Free/Guided Play :: The creature moves around and experiences the + world through its unique perspective. Many otherwise + complicated actions are easily described in the language of a + full suite of body-centered, rich senses. For example, + drinking is the feeling of water sliding down your throat, and + cooling your insides. It's often accompanied by bringing your + hand close to your face, or bringing your face close to + water. Sitting down is the feeling of bending your knees, + activating your quadriceps, then feeling a surface with your + bottom and relaxing your legs. These body-centered action + descriptions can be either learned or hard coded. + - Alignment :: When trying to interpret a video or image, the + creature takes a model of itself and aligns it with + whatever it sees. This can be a rather loose + alignment that can cross species, as when humans try + to align themselves with things like ponies, dogs, + or other humans with a different body type. + - Empathy :: The alignment triggers the memories of previous + experience. For example, the alignment itself easily + maps to proprioceptive data. Any sounds or obvious + skin contact in the video can to a lesser extent + trigger previous experience. The creatures previous + experience is chained together in short bursts to + coherently describe the new scene. + - Recognition :: With the scene now described in terms of past + experience, the creature can now run its + action-identification programs on this synthesized + sensory data, just as it would if it were actually + experiencing the scene first-hand. If previous + experience has been accurately retrieved, and if + it is analogous enough to the scene, then the + creature will correctly identify the action in the + scene. + + + For example, I think humans are able to label the cat video as + "drinking" because they imagine /themselves/ as the cat, and + imagine putting their face up against a stream of water and + sticking out their tongue. In that imagined world, they can feel + the cool water hitting their tongue, and feel the water entering + their body, and are able to recognize that /feeling/ as + drinking. So, the label of the action is not really in the pixels + of the image, but is found clearly in a simulation inspired by + those pixels. An imaginative system, having been trained on + drinking and non-drinking examples and learning that the most + important component of drinking is the feeling of water sliding + down one's throat, would analyze a video of a cat drinking in the + following manner: + + 1. Create a physical model of the video by putting a "fuzzy" model + of its own body in place of the cat. Possibly also create a + simulation of the stream of water. + + 2. Play out this simulated scene and generate imagined sensory + experience. This will include relevant muscle contractions, a + close up view of the stream from the cat's perspective, and most + importantly, the imagined feeling of water entering the + mouth. The imagined sensory experience can come from both a + simulation of the event, but can also be pattern-matched from + previous, similar embodied experience. + + 3. The action is now easily identified as drinking by the sense of + taste alone. The other senses (such as the tongue moving in and + out) help to give plausibility to the simulated action. Note that + the sense of vision, while critical in creating the simulation, + is not critical for identifying the action from the simulation. + + For the chair examples, the process is even easier: + + 1. Align a model of your body to the person in the image. + + 2. Generate proprioceptive sensory data from this alignment. - #+caption: The chair in this image is quite obvious to humans, but I - #+caption: doubt that any computer program can find it. - #+ATTR_LaTeX: :width 10cm - [[./images/fat-person-sitting-at-desk.jpg]] + 3. Use the imagined proprioceptive data as a key to lookup related + sensory experience associated with that particular proproceptive + feeling. - Finally, how is it that you can easily tell the difference between - how the girls /muscles/ are working in \ref{girl}? + 4. Retrieve the feeling of your bottom resting on a surface and + your leg muscles relaxed. - #+caption: The mysterious ``common sense'' appears here as you are able - #+caption: to ``see'' the difference in how the girl's arm muscles - #+caption: are activated differently in the two images. - #+name: girl - #+ATTR_LaTeX: :width 10cm - [[./images/wall-push.png]] - + 5. This sensory information is consistent with the =sitting?= + sensory predicate, so you (and the entity in the image) must be + sitting. - These problems are difficult because the language of pixels is far - removed from what we would consider to be an acceptable description - of the events in these images. In order to process them, we must - raise the images into some higher level of abstraction where their - descriptions become more similar to how we would describe them in - English. The question is, how can we raise - + 6. There must be a chair-like object since you are sitting. - I think humans are able to label such video as "drinking" because - they imagine /themselves/ as the cat, and imagine putting their face - up against a stream of water and sticking out their tongue. In that - imagined world, they can feel the cool water hitting their tongue, - and feel the water entering their body, and are able to recognize - that /feeling/ as drinking. So, the label of the action is not - really in the pixels of the image, but is found clearly in a - simulation inspired by those pixels. An imaginative system, having - been trained on drinking and non-drinking examples and learning that - the most important component of drinking is the feeling of water - sliding down one's throat, would analyze a video of a cat drinking - in the following manner: + Empathy offers yet another alternative to the age-old AI + representation question: ``What is a chair?'' --- A chair is the + feeling of sitting. + + My program, =EMPATH= uses this empathic problem solving technique + to interpret the actions of a simple, worm-like creature. - - Create a physical model of the video by putting a "fuzzy" model - of its own body in place of the cat. Also, create a simulation of - the stream of water. + #+caption: The worm performs many actions during free play such as + #+caption: curling, wiggling, and resting. + #+name: worm-intro + #+ATTR_LaTeX: :width 10cm + [[./images/wall-push.png]] - - Play out this simulated scene and generate imagined sensory - experience. This will include relevant muscle contractions, a - close up view of the stream from the cat's perspective, and most - importantly, the imagined feeling of water entering the mouth. + #+caption: This sensory predicate detects when the worm is resting on the + #+caption: ground. + #+name: resting-intro + #+begin_listing clojure + #+begin_src clojure +(defn resting? + "Is the worm resting on the ground?" + [experiences] + (every? + (fn [touch-data] + (< 0.9 (contact worm-segment-bottom touch-data))) + (:touch (peek experiences)))) + #+end_src + #+end_listing - - The action is now easily identified as drinking by the sense of - taste alone. The other senses (such as the tongue moving in and - out) help to give plausibility to the simulated action. Note that - the sense of vision, while critical in creating the simulation, - is not critical for identifying the action from the simulation. + #+caption: Body-centerd actions are best expressed in a body-centered + #+caption: language. This code detects when the worm has curled into a + #+caption: full circle. Imagine how you would replicate this functionality + #+caption: using low-level pixel features such as HOG filters! + #+name: grand-circle-intro + #+begin_listing clojure + #+begin_src clojure +(defn grand-circle? + "Does the worm form a majestic circle (one end touching the other)?" + [experiences] + (and (curled? experiences) + (let [worm-touch (:touch (peek experiences)) + tail-touch (worm-touch 0) + head-touch (worm-touch 4)] + (and (< 0.55 (contact worm-segment-bottom-tip tail-touch)) + (< 0.55 (contact worm-segment-top-tip head-touch)))))) + #+end_src + #+end_listing - cat drinking, mimes, leaning, common sense + #+caption: Even complicated actions such as ``wiggling'' are fairly simple + #+caption: to describe with a rich enough language. + #+name: wiggling-intro + #+begin_listing clojure + #+begin_src clojure +(defn wiggling? + "Is the worm wiggling?" + [experiences] + (let [analysis-interval 0x40] + (when (> (count experiences) analysis-interval) + (let [a-flex 3 + a-ex 2 + muscle-activity + (map :muscle (vector:last-n experiences analysis-interval)) + base-activity + (map #(- (% a-flex) (% a-ex)) muscle-activity)] + (= 2 + (first + (max-indexed + (map #(Math/abs %) + (take 20 (fft base-activity)))))))))) + #+end_src + #+end_listing -** =EMPATH= neatly solves recognition problems + #+caption: The actions of a worm in a video can be recognized by + #+caption: proprioceptive data and sentory predicates by filling + #+caption: in the missing sensory detail with previous experience. + #+name: worm-recognition-intro + #+ATTR_LaTeX: :width 10cm + [[./images/wall-push.png]] - factorization , right language, etc - a new possibility for the question ``what is a chair?'' -- it's the - feeling of your butt on something and your knees bent, with your - back muscles and legs relaxed. + + One powerful advantage of empathic problem solving is that it + factors the action recognition problem into two easier problems. To + use empathy, you need an /aligner/, which takes the video and a + model of your body, and aligns the model with the video. Then, you + need a /recognizer/, which uses the aligned model to interpret the + action. The power in this method lies in the fact that you describe + all actions form a body-centered, rich viewpoint. This way, if you + teach the system what ``running'' is, and you have a good enough + aligner, the system will from then on be able to recognize running + from any point of view, even strange points of view like above or + underneath the runner. This is in contrast to action recognition + schemes that try to identify actions using a non-embodied approach + such as TODO:REFERENCE. If these systems learn about running as viewed + from the side, they will not automatically be able to recognize + running from any other viewpoint. + + Another powerful advantage is that using the language of multiple + body-centered rich senses to describe body-centerd actions offers a + massive boost in descriptive capability. Consider how difficult it + would be to compose a set of HOG filters to describe the action of + a simple worm-creature "curling" so that its head touches its tail, + and then behold the simplicity of describing thus action in a + language designed for the task (listing \ref{grand-circle-intro}): + ** =CORTEX= is a toolkit for building sensate creatures @@ -151,7 +318,7 @@ ** Empathy is the process of tracing though \Phi-space -** Efficient action recognition =EMPATH= +** Efficient action recognition with =EMPATH= * Contributions - Built =CORTEX=, a comprehensive platform for embodied AI diff -r b01c070b03d4 -r c20de2267d39 thesis/cover.tex --- a/thesis/cover.tex Sun Mar 23 23:43:20 2014 -0400 +++ b/thesis/cover.tex Mon Mar 24 20:59:35 2014 -0400 @@ -45,7 +45,7 @@ % however the specifications can change. We recommend that you verify the % layout of your title page with your thesis advisor and/or the MIT % Libraries before printing your final copy. -\title{Solving Problems using Embodiment \& Empathy.} +\title{Solving Problems using Embodiment \& Empathy} \author{Robert Louis M\raisebox{\depth}{\small \underline{\underline{c}}}Intyre} %\author{Robert McIntyre} diff -r b01c070b03d4 -r c20de2267d39 thesis/rlm-cortex-meng.tex --- a/thesis/rlm-cortex-meng.tex Sun Mar 23 23:43:20 2014 -0400 +++ b/thesis/rlm-cortex-meng.tex Mon Mar 24 20:59:35 2014 -0400 @@ -25,7 +25,7 @@ %% Page Intentionally Left Blank'', use the ``leftblank'' option, as %% above. -\documentclass[12pt,twoside,singlespace]{mitthesis} +\documentclass[12pt,twoside,singlespace,vi]{mitthesis} \usepackage[utf8]{inputenc} \usepackage[T1]{fontenc} \usepackage{fixltx2e} diff -r b01c070b03d4 -r c20de2267d39 thesis/to-frames.pl --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/thesis/to-frames.pl Mon Mar 24 20:59:35 2014 -0400 @@ -0,0 +1,15 @@ +#!/bin/perl + +$movie_file = shift(@ARGV); + +# get file name without extension +$movie_file =~ m/^([^.]+)\.[^.]+$/; +$movie_name = $1; + +@mkdir_command = ("mkdir", "-vp", $movie_name); +@ffmpeg_command = ("ffmpeg", "-i", $movie_file, $movie_name."/%07d.png"); + +print "@mkdir_command\n"; +system(@mkdir_command); +print "@ffmpeg_command\n"; +system(@ffmpeg_command); \ No newline at end of file