# HG changeset patch
# User Robert McIntyre <rlm@mit.edu>
# Date 1395709175 14400
# Node ID c20de2267d39866ae893b1cd772b27a3263ff6d5
# Parent  b01c070b03d4a892b9acfcf5b9aa5f01f85ddd90
completeing first third of first chapter.

diff -r b01c070b03d4 -r c20de2267d39 thesis/abstract.org
--- a/thesis/abstract.org	Sun Mar 23 23:43:20 2014 -0400
+++ b/thesis/abstract.org	Mon Mar 24 20:59:35 2014 -0400
@@ -6,11 +6,11 @@
 curling and wiggling.
 
 To attack the action recognition problem, I developed a computational
-model of empathy (=EMPATH=) which allows me to use simple, embodied
-representations of actions (which require rich sensory data), even
-when that sensory data is not actually available. The missing sense
-data is ``imagined'' by the system by combining previous experiences
-gained from unsupervised free play.
+model of empathy (=EMPATH=) which allows me to recognize actions using
+simple, embodied representations of actions (which require rich
+sensory data), even when that sensory data is not actually
+available. The missing sense data is ``imagined'' by the system by
+combining previous experiences gained from unsupervised free play.
 
 In order to build this empathic, action-recognizing system, I created
 a program called =CORTEX=, which is a complete platform for embodied
diff -r b01c070b03d4 -r c20de2267d39 thesis/cortex.org
--- a/thesis/cortex.org	Sun Mar 23 23:43:20 2014 -0400
+++ b/thesis/cortex.org	Mon Mar 24 20:59:35 2014 -0400
@@ -10,104 +10,271 @@
   By the end of this thesis, you will have seen a novel approach to
   interpreting video using embodiment and empathy. You will have also
   seen one way to efficiently implement empathy for embodied
-  creatures.
+  creatures. Finally, you will become familiar with =CORTEX=, a
+  system for designing and simulating creatures with rich senses,
+  which you may choose to use in your own research.
   
-  The core vision of this thesis is that one of the important ways in
-  which we understand others is by imagining ourselves in their
-  posistion and empathicaly feeling experiences based on our own past
-  experiences and imagination.
-
-  By understanding events in terms of our own previous corperal
-  experience, we greatly constrain the possibilities of what would
-  otherwise be an unweidly exponential search. This extra constraint
-  can be the difference between easily understanding what is happening
-  in a video and being completely lost in a sea of incomprehensible
-  color and movement.
+  This is the core vision of my thesis: That one of the important ways
+  in which we understand others is by imagining ourselves in their
+  position and emphatically feeling experiences relative to our own
+  bodies. By understanding events in terms of our own previous
+  corporeal experience, we greatly constrain the possibilities of what
+  would otherwise be an unwieldy exponential search. This extra
+  constraint can be the difference between easily understanding what
+  is happening in a video and being completely lost in a sea of
+  incomprehensible color and movement.
 
 ** Recognizing actions in video is extremely difficult
 
-  Consider for example the problem of determining what is happening in
-  a video of which this is one frame:
+   Consider for example the problem of determining what is happening in
+   a video of which this is one frame:
 
-  #+caption: A cat drinking some water. Identifying this action is 
-  #+caption: beyond the state of the art for computers.
-  #+ATTR_LaTeX: :width 7cm
-  [[./images/cat-drinking.jpg]]
+   #+caption: A cat drinking some water. Identifying this action is 
+   #+caption: beyond the state of the art for computers.
+   #+ATTR_LaTeX: :width 7cm
+   [[./images/cat-drinking.jpg]]
+   
+   It is currently impossible for any computer program to reliably
+   label such an video as "drinking".  And rightly so -- it is a very
+   hard problem! What features can you describe in terms of low level
+   functions of pixels that can even begin to describe at a high level
+   what is happening here?
   
-  It is currently impossible for any computer program to reliably
-  label such an video as "drinking".  And rightly so -- it is a very
-  hard problem! What features can you describe in terms of low level
-  functions of pixels that can even begin to describe what is
-  happening here? 
+   Or suppose that you are building a program that recognizes
+   chairs. How could you ``see'' the chair in figure
+   \ref{invisible-chair} and figure \ref{hidden-chair}?
+   
+   #+caption: When you look at this, do you think ``chair''? I certainly do.
+   #+name: invisible-chair
+   #+ATTR_LaTeX: :width 10cm
+   [[./images/invisible-chair.png]]
+   
+   #+caption: The chair in this image is quite obvious to humans, but I 
+   #+caption: doubt that any computer program can find it.
+   #+name: hidden-chair
+   #+ATTR_LaTeX: :width 10cm
+   [[./images/fat-person-sitting-at-desk.jpg]]
+   
+   Finally, how is it that you can easily tell the difference between
+   how the girls /muscles/ are working in figure \ref{girl}?
+   
+   #+caption: The mysterious ``common sense'' appears here as you are able 
+   #+caption: to discern the difference in how the girl's arm muscles
+   #+caption: are activated between the two images.
+   #+name: girl
+   #+ATTR_LaTeX: :width 10cm
+   [[./images/wall-push.png]]
   
-  Or suppose that you are building a program that recognizes
-  chairs. How could you ``see'' the chair in the following pictures?
+   Each of these examples tells us something about what might be going
+   on in our minds as we easily solve these recognition problems.
+   
+   The hidden chairs show us that we are strongly triggered by cues
+   relating to the position of human bodies, and that we can
+   determine the overall physical configuration of a human body even
+   if much of that body is occluded.
 
-  #+caption: When you look at this, do you think ``chair''? I certainly do.
-  #+ATTR_LaTeX: :width 10cm
-  [[./images/invisible-chair.png]]
+   The picture of the girl pushing against the wall tells us that we
+   have common sense knowledge about the kinetics of our own bodies.
+   We know well how our muscles would have to work to maintain us in
+   most positions, and we can easily project this self-knowledge to
+   imagined positions triggered by images of the human body.
+
+** =EMPATH= neatly solves recognition problems  
+   
+   I propose a system that can express the types of recognition
+   problems above in a form amenable to computation. It is split into
+   four parts:
+
+   - Free/Guided Play :: The creature moves around and experiences the
+        world through its unique perspective. Many otherwise
+        complicated actions are easily described in the language of a
+        full suite of body-centered, rich senses. For example,
+        drinking is the feeling of water sliding down your throat, and
+        cooling your insides. It's often accompanied by bringing your
+        hand close to your face, or bringing your face close to
+        water. Sitting down is the feeling of bending your knees,
+        activating your quadriceps, then feeling a surface with your
+        bottom and relaxing your legs. These body-centered action
+        descriptions can be either learned or hard coded.
+   - Alignment :: When trying to interpret a video or image, the
+                  creature takes a model of itself and aligns it with
+                  whatever it sees. This can be a rather loose
+                  alignment that can cross species, as when humans try
+                  to align themselves with things like ponies, dogs,
+                  or other humans with a different body type.
+   - Empathy :: The alignment triggers the memories of previous
+                experience. For example, the alignment itself easily
+                maps to proprioceptive data. Any sounds or obvious
+                skin contact in the video can to a lesser extent
+                trigger previous experience. The creatures previous
+                experience is chained together in short bursts to
+                coherently describe the new scene.
+   - Recognition :: With the scene now described in terms of past
+                    experience, the creature can now run its
+                    action-identification programs on this synthesized
+                    sensory data, just as it would if it were actually
+                    experiencing the scene first-hand. If previous
+                    experience has been accurately retrieved, and if
+                    it is analogous enough to the scene, then the
+                    creature will correctly identify the action in the
+                    scene.
+		    
+
+   For example, I think humans are able to label the cat video as
+   "drinking" because they imagine /themselves/ as the cat, and
+   imagine putting their face up against a stream of water and
+   sticking out their tongue. In that imagined world, they can feel
+   the cool water hitting their tongue, and feel the water entering
+   their body, and are able to recognize that /feeling/ as
+   drinking. So, the label of the action is not really in the pixels
+   of the image, but is found clearly in a simulation inspired by
+   those pixels. An imaginative system, having been trained on
+   drinking and non-drinking examples and learning that the most
+   important component of drinking is the feeling of water sliding
+   down one's throat, would analyze a video of a cat drinking in the
+   following manner:
+   
+   1. Create a physical model of the video by putting a "fuzzy" model
+      of its own body in place of the cat. Possibly also create a
+      simulation of the stream of water.
+
+   2. Play out this simulated scene and generate imagined sensory
+      experience. This will include relevant muscle contractions, a
+      close up view of the stream from the cat's perspective, and most
+      importantly, the imagined feeling of water entering the
+      mouth. The imagined sensory experience can come from both a
+      simulation of the event, but can also be pattern-matched from
+      previous, similar embodied experience.
+
+   3. The action is now easily identified as drinking by the sense of
+      taste alone. The other senses (such as the tongue moving in and
+      out) help to give plausibility to the simulated action. Note that
+      the sense of vision, while critical in creating the simulation,
+      is not critical for identifying the action from the simulation.
+
+   For the chair examples, the process is even easier:
+
+    1. Align a model of your body to the person in the image.
+
+    2. Generate proprioceptive sensory data from this alignment.
   
-  #+caption: The chair in this image is quite obvious to humans, but I 
-  #+caption: doubt that any computer program can find it.
-  #+ATTR_LaTeX: :width 10cm
-  [[./images/fat-person-sitting-at-desk.jpg]]
+    3. Use the imagined proprioceptive data as a key to lookup related
+       sensory experience associated with that particular proproceptive
+       feeling.
 
-  Finally, how is it that you can easily tell the difference between
-  how the girls /muscles/ are working in \ref{girl}?
+    4. Retrieve the feeling of your bottom resting on a surface and
+       your leg muscles relaxed.
 
-  #+caption: The mysterious ``common sense'' appears here as you are able 
-  #+caption: to ``see'' the difference in how the girl's arm muscles
-  #+caption: are activated differently in the two images.
-  #+name: girl
-  #+ATTR_LaTeX: :width 10cm
-  [[./images/wall-push.png]]
-  
+    5. This sensory information is consistent with the =sitting?=
+       sensory predicate, so you (and the entity in the image) must be
+       sitting.
 
-  These problems are difficult because the language of pixels is far
-  removed from what we would consider to be an acceptable description
-  of the events in these images. In order to process them, we must
-  raise the images into some higher level of abstraction where their
-  descriptions become more similar to how we would describe them in
-  English. The question is, how can we raise 
-  
+    6. There must be a chair-like object since you are sitting.
 
-  I think humans are able to label such video as "drinking" because
-  they imagine /themselves/ as the cat, and imagine putting their face
-  up against a stream of water and sticking out their tongue. In that
-  imagined world, they can feel the cool water hitting their tongue,
-  and feel the water entering their body, and are able to recognize
-  that /feeling/ as drinking. So, the label of the action is not
-  really in the pixels of the image, but is found clearly in a
-  simulation inspired by those pixels. An imaginative system, having
-  been trained on drinking and non-drinking examples and learning that
-  the most important component of drinking is the feeling of water
-  sliding down one's throat, would analyze a video of a cat drinking
-  in the following manner:
+   Empathy offers yet another alternative to the age-old AI
+   representation question: ``What is a chair?'' --- A chair is the
+   feeling of sitting.
+
+   My program, =EMPATH= uses this empathic problem solving technique
+   to interpret the actions of a simple, worm-like creature. 
    
-   - Create a physical model of the video by putting a "fuzzy" model
-     of its own body in place of the cat. Also, create a simulation of
-     the stream of water.
+   #+caption: The worm performs many actions during free play such as 
+   #+caption: curling, wiggling, and resting.
+   #+name: worm-intro
+   #+ATTR_LaTeX: :width 10cm
+   [[./images/wall-push.png]]
 
-   - Play out this simulated scene and generate imagined sensory
-     experience. This will include relevant muscle contractions, a
-     close up view of the stream from the cat's perspective, and most
-     importantly, the imagined feeling of water entering the mouth.
+   #+caption: This sensory predicate detects when the worm is resting on the 
+   #+caption: ground.
+   #+name: resting-intro
+   #+begin_listing clojure
+   #+begin_src clojure
+(defn resting?
+  "Is the worm resting on the ground?"
+  [experiences]
+  (every?
+   (fn [touch-data]
+     (< 0.9 (contact worm-segment-bottom touch-data)))
+   (:touch (peek experiences))))
+   #+end_src
+   #+end_listing
 
-   - The action is now easily identified as drinking by the sense of
-     taste alone. The other senses (such as the tongue moving in and
-     out) help to give plausibility to the simulated action. Note that
-     the sense of vision, while critical in creating the simulation,
-     is not critical for identifying the action from the simulation.
+   #+caption: Body-centerd actions are best expressed in a body-centered 
+   #+caption: language. This code detects when the worm has curled into a 
+   #+caption: full circle. Imagine how you would replicate this functionality
+   #+caption: using low-level pixel features such as HOG filters!
+   #+name: grand-circle-intro
+   #+begin_listing clojure
+   #+begin_src clojure
+(defn grand-circle?
+  "Does the worm form a majestic circle (one end touching the other)?"
+  [experiences]
+  (and (curled? experiences)
+       (let [worm-touch (:touch (peek experiences))
+             tail-touch (worm-touch 0)
+             head-touch (worm-touch 4)]
+         (and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
+              (< 0.55 (contact worm-segment-top-tip    head-touch))))))
+   #+end_src
+   #+end_listing
 
-   cat drinking, mimes, leaning, common sense
+   #+caption: Even complicated actions such as ``wiggling'' are fairly simple
+   #+caption: to describe with a rich enough language.
+   #+name: wiggling-intro
+   #+begin_listing clojure
+   #+begin_src clojure
+(defn wiggling?
+  "Is the worm wiggling?"
+  [experiences]
+  (let [analysis-interval 0x40]
+    (when (> (count experiences) analysis-interval)
+      (let [a-flex 3
+            a-ex   2
+            muscle-activity
+            (map :muscle (vector:last-n experiences analysis-interval))
+            base-activity
+            (map #(- (% a-flex) (% a-ex)) muscle-activity)]
+        (= 2
+           (first
+            (max-indexed
+             (map #(Math/abs %)
+                  (take 20 (fft base-activity))))))))))
+   #+end_src
+   #+end_listing
 
-** =EMPATH= neatly solves recognition problems
+   #+caption: The actions of a worm in a video can be recognized by
+   #+caption: proprioceptive data and sentory predicates by filling
+   #+caption:  in the missing sensory detail with previous experience.
+   #+name: worm-recognition-intro
+   #+ATTR_LaTeX: :width 10cm
+   [[./images/wall-push.png]]
 
-   factorization , right language, etc
 
-   a new possibility for the question ``what is a chair?'' -- it's the
-   feeling of your butt on something and your knees bent, with your
-   back muscles and legs relaxed.
+   
+   One powerful advantage of empathic problem solving is that it
+   factors the action recognition problem into two easier problems. To
+   use empathy, you need an /aligner/, which takes the video and a
+   model of your body, and aligns the model with the video. Then, you
+   need a /recognizer/, which uses the aligned model to interpret the
+   action. The power in this method lies in the fact that you describe
+   all actions form a body-centered, rich viewpoint. This way, if you
+   teach the system what ``running'' is, and you have a good enough
+   aligner, the system will from then on be able to recognize running
+   from any point of view, even strange points of view like above or
+   underneath the runner. This is in contrast to action recognition
+   schemes that try to identify actions using a non-embodied approach
+   such as TODO:REFERENCE. If these systems learn about running as viewed
+   from the side, they will not automatically be able to recognize
+   running from any other viewpoint.
+
+   Another powerful advantage is that using the language of multiple
+   body-centered rich senses to describe body-centerd actions offers a
+   massive boost in descriptive capability. Consider how difficult it
+   would be to compose a set of HOG filters to describe the action of
+   a simple worm-creature "curling" so that its head touches its tail,
+   and then behold the simplicity of describing thus action in a
+   language designed for the task (listing \ref{grand-circle-intro}):
+
 
 ** =CORTEX= is a toolkit for building sensate creatures
 
@@ -151,7 +318,7 @@
 
 ** Empathy is the process of tracing though \Phi-space 
   
-** Efficient action recognition =EMPATH=
+** Efficient action recognition with =EMPATH=
 
 * Contributions
   - Built =CORTEX=, a comprehensive platform for embodied AI
diff -r b01c070b03d4 -r c20de2267d39 thesis/cover.tex
--- a/thesis/cover.tex	Sun Mar 23 23:43:20 2014 -0400
+++ b/thesis/cover.tex	Mon Mar 24 20:59:35 2014 -0400
@@ -45,7 +45,7 @@
 % however the specifications can change.  We recommend that you verify the
 % layout of your title page with your thesis advisor and/or the MIT 
 % Libraries before printing your final copy.
-\title{Solving Problems using Embodiment \& Empathy.}
+\title{Solving Problems using Embodiment \& Empathy}
 \author{Robert Louis M\raisebox{\depth}{\small \underline{\underline{c}}}Intyre}
 %\author{Robert McIntyre}
 
diff -r b01c070b03d4 -r c20de2267d39 thesis/rlm-cortex-meng.tex
--- a/thesis/rlm-cortex-meng.tex	Sun Mar 23 23:43:20 2014 -0400
+++ b/thesis/rlm-cortex-meng.tex	Mon Mar 24 20:59:35 2014 -0400
@@ -25,7 +25,7 @@
 %% Page Intentionally Left Blank'', use the ``leftblank'' option, as
 %% above. 
 
-\documentclass[12pt,twoside,singlespace]{mitthesis}
+\documentclass[12pt,twoside,singlespace,vi]{mitthesis}
 \usepackage[utf8]{inputenc}
 \usepackage[T1]{fontenc}
 \usepackage{fixltx2e}
diff -r b01c070b03d4 -r c20de2267d39 thesis/to-frames.pl
--- /dev/null	Thu Jan 01 00:00:00 1970 +0000
+++ b/thesis/to-frames.pl	Mon Mar 24 20:59:35 2014 -0400
@@ -0,0 +1,15 @@
+#!/bin/perl
+
+$movie_file = shift(@ARGV);
+
+# get file name without extension
+$movie_file =~ m/^([^.]+)\.[^.]+$/;
+$movie_name = $1;
+
+@mkdir_command = ("mkdir", "-vp", $movie_name);
+@ffmpeg_command = ("ffmpeg", "-i", $movie_file, $movie_name."/%07d.png");
+
+print "@mkdir_command\n";
+system(@mkdir_command);
+print "@ffmpeg_command\n";
+system(@ffmpeg_command);
\ No newline at end of file