cortex: thesis/cortex.org comparison

comparison thesis/cortex.org @ 441:c20de2267d39

completeing first third of first chapter.

author	Robert McIntyre <rlm@mit.edu>
date	Mon, 24 Mar 2014 20:59:35 -0400
parents	b01c070b03d4
children	eaf8c591372b

comparison

equal deleted inserted replaced

-:b01c070b03d4
+:c20de2267d39
 * Empathy and Embodiment as problem solving strategies
 By the end of this thesis, you will have seen a novel approach to
 interpreting video using embodiment and empathy. You will have also
 seen one way to efficiently implement empathy for embodied
-creatures.
+creatures. Finally, you will become familiar with =CORTEX=, a
+system for designing and simulating creatures with rich senses,
-The core vision of this thesis is that one of the important ways in
+which you may choose to use in your own research.
-which we understand others is by imagining ourselves in their
-posistion and empathicaly feeling experiences based on our own past
+This is the core vision of my thesis: That one of the important ways
-experiences and imagination.
+in which we understand others is by imagining ourselves in their
+position and emphatically feeling experiences relative to our own
-By understanding events in terms of our own previous corperal
+bodies. By understanding events in terms of our own previous
-experience, we greatly constrain the possibilities of what would
+corporeal experience, we greatly constrain the possibilities of what
-otherwise be an unweidly exponential search. This extra constraint
+would otherwise be an unwieldy exponential search. This extra
-can be the difference between easily understanding what is happening
+constraint can be the difference between easily understanding what
-in a video and being completely lost in a sea of incomprehensible
+is happening in a video and being completely lost in a sea of
-color and movement.
+incomprehensible color and movement.
 ** Recognizing actions in video is extremely difficult
 Consider for example the problem of determining what is happening in
 a video of which this is one frame:
 #+caption: A cat drinking some water. Identifying this action is
 #+caption: beyond the state of the art for computers.
 #+ATTR_LaTeX: :width 7cm
 [[./images/cat-drinking.jpg]]
 It is currently impossible for any computer program to reliably
 label such an video as "drinking".  And rightly so -- it is a very
 hard problem! What features can you describe in terms of low level
-functions of pixels that can even begin to describe what is
+functions of pixels that can even begin to describe at a high level
-happening here?
+what is happening here?
 Or suppose that you are building a program that recognizes
-chairs. How could you ``see'' the chair in the following pictures?
+chairs. How could you ``see'' the chair in figure
+\ref{invisible-chair} and figure \ref{hidden-chair}?
-#+caption: When you look at this, do you think ``chair''? I certainly do.
-#+ATTR_LaTeX: :width 10cm
+#+caption: When you look at this, do you think ``chair''? I certainly do.
-[[./images/invisible-chair.png]]
+#+name: invisible-chair
+#+ATTR_LaTeX: :width 10cm
-#+caption: The chair in this image is quite obvious to humans, but I
+[[./images/invisible-chair.png]]
-#+caption: doubt that any computer program can find it.
-#+ATTR_LaTeX: :width 10cm
+#+caption: The chair in this image is quite obvious to humans, but I
-[[./images/fat-person-sitting-at-desk.jpg]]
+#+caption: doubt that any computer program can find it.
+#+name: hidden-chair
-Finally, how is it that you can easily tell the difference between
+#+ATTR_LaTeX: :width 10cm
-how the girls /muscles/ are working in \ref{girl}?
+[[./images/fat-person-sitting-at-desk.jpg]]
-#+caption: The mysterious ``common sense'' appears here as you are able
+Finally, how is it that you can easily tell the difference between
-#+caption: to ``see'' the difference in how the girl's arm muscles
+how the girls /muscles/ are working in figure \ref{girl}?
-#+caption: are activated differently in the two images.
-#+name: girl
+#+caption: The mysterious ``common sense'' appears here as you are able
-#+ATTR_LaTeX: :width 10cm
+#+caption: to discern the difference in how the girl's arm muscles
-[[./images/wall-push.png]]
+#+caption: are activated between the two images.
+#+name: girl
+#+ATTR_LaTeX: :width 10cm
-These problems are difficult because the language of pixels is far
+[[./images/wall-push.png]]
-removed from what we would consider to be an acceptable description
-of the events in these images. In order to process them, we must
+Each of these examples tells us something about what might be going
-raise the images into some higher level of abstraction where their
+on in our minds as we easily solve these recognition problems.
-descriptions become more similar to how we would describe them in
-English. The question is, how can we raise
+The hidden chairs show us that we are strongly triggered by cues
+relating to the position of human bodies, and that we can
+determine the overall physical configuration of a human body even
-I think humans are able to label such video as "drinking" because
+if much of that body is occluded.
-they imagine /themselves/ as the cat, and imagine putting their face
-up against a stream of water and sticking out their tongue. In that
+The picture of the girl pushing against the wall tells us that we
-imagined world, they can feel the cool water hitting their tongue,
+have common sense knowledge about the kinetics of our own bodies.
-and feel the water entering their body, and are able to recognize
+We know well how our muscles would have to work to maintain us in
-that /feeling/ as drinking. So, the label of the action is not
+most positions, and we can easily project this self-knowledge to
-really in the pixels of the image, but is found clearly in a
+imagined positions triggered by images of the human body.
-simulation inspired by those pixels. An imaginative system, having
-been trained on drinking and non-drinking examples and learning that
+** =EMPATH= neatly solves recognition problems
-the most important component of drinking is the feeling of water
-sliding down one's throat, would analyze a video of a cat drinking
+I propose a system that can express the types of recognition
-in the following manner:
+problems above in a form amenable to computation. It is split into
+four parts:
-- Create a physical model of the video by putting a "fuzzy" model
-of its own body in place of the cat. Also, create a simulation of
+- Free/Guided Play :: The creature moves around and experiences the
-the stream of water.
+world through its unique perspective. Many otherwise
+complicated actions are easily described in the language of a
-- Play out this simulated scene and generate imagined sensory
+full suite of body-centered, rich senses. For example,
-experience. This will include relevant muscle contractions, a
+drinking is the feeling of water sliding down your throat, and
-close up view of the stream from the cat's perspective, and most
+cooling your insides. It's often accompanied by bringing your
-importantly, the imagined feeling of water entering the mouth.
+hand close to your face, or bringing your face close to
+water. Sitting down is the feeling of bending your knees,
-- The action is now easily identified as drinking by the sense of
+activating your quadriceps, then feeling a surface with your
-taste alone. The other senses (such as the tongue moving in and
+bottom and relaxing your legs. These body-centered action
-out) help to give plausibility to the simulated action. Note that
+descriptions can be either learned or hard coded.
-the sense of vision, while critical in creating the simulation,
+- Alignment :: When trying to interpret a video or image, the
-is not critical for identifying the action from the simulation.
+creature takes a model of itself and aligns it with
+whatever it sees. This can be a rather loose
-cat drinking, mimes, leaning, common sense
+alignment that can cross species, as when humans try
+to align themselves with things like ponies, dogs,
-** =EMPATH= neatly solves recognition problems
+or other humans with a different body type.
+- Empathy :: The alignment triggers the memories of previous
-factorization , right language, etc
+experience. For example, the alignment itself easily
+maps to proprioceptive data. Any sounds or obvious
-a new possibility for the question ``what is a chair?'' -- it's the
+skin contact in the video can to a lesser extent
-feeling of your butt on something and your knees bent, with your
+trigger previous experience. The creatures previous
-back muscles and legs relaxed.
+experience is chained together in short bursts to
+coherently describe the new scene.
+- Recognition :: With the scene now described in terms of past
+experience, the creature can now run its
+action-identification programs on this synthesized
+sensory data, just as it would if it were actually
+experiencing the scene first-hand. If previous
+experience has been accurately retrieved, and if
+it is analogous enough to the scene, then the
+creature will correctly identify the action in the
+scene.
+For example, I think humans are able to label the cat video as
+"drinking" because they imagine /themselves/ as the cat, and
+imagine putting their face up against a stream of water and
+sticking out their tongue. In that imagined world, they can feel
+the cool water hitting their tongue, and feel the water entering
+their body, and are able to recognize that /feeling/ as
+drinking. So, the label of the action is not really in the pixels
+of the image, but is found clearly in a simulation inspired by
+those pixels. An imaginative system, having been trained on
+drinking and non-drinking examples and learning that the most
+important component of drinking is the feeling of water sliding
+down one's throat, would analyze a video of a cat drinking in the
+following manner:
+1. Create a physical model of the video by putting a "fuzzy" model
+of its own body in place of the cat. Possibly also create a
+simulation of the stream of water.
+2. Play out this simulated scene and generate imagined sensory
+experience. This will include relevant muscle contractions, a
+close up view of the stream from the cat's perspective, and most
+importantly, the imagined feeling of water entering the
+mouth. The imagined sensory experience can come from both a
+simulation of the event, but can also be pattern-matched from
+previous, similar embodied experience.
+3. The action is now easily identified as drinking by the sense of
+taste alone. The other senses (such as the tongue moving in and
+out) help to give plausibility to the simulated action. Note that
+the sense of vision, while critical in creating the simulation,
+is not critical for identifying the action from the simulation.
+For the chair examples, the process is even easier:
+1. Align a model of your body to the person in the image.
+2. Generate proprioceptive sensory data from this alignment.
+3. Use the imagined proprioceptive data as a key to lookup related
+sensory experience associated with that particular proproceptive
+feeling.
+4. Retrieve the feeling of your bottom resting on a surface and
+your leg muscles relaxed.
+5. This sensory information is consistent with the =sitting?=
+sensory predicate, so you (and the entity in the image) must be
+sitting.
+6. There must be a chair-like object since you are sitting.
+Empathy offers yet another alternative to the age-old AI
+representation question: ``What is a chair?'' --- A chair is the
+feeling of sitting.
+My program, =EMPATH= uses this empathic problem solving technique
+to interpret the actions of a simple, worm-like creature.
+#+caption: The worm performs many actions during free play such as
+#+caption: curling, wiggling, and resting.
+#+name: worm-intro
+#+ATTR_LaTeX: :width 10cm
+[[./images/wall-push.png]]
+#+caption: This sensory predicate detects when the worm is resting on the
+#+caption: ground.
+#+name: resting-intro
+#+begin_listing clojure
+#+begin_src clojure
+(defn resting?
+"Is the worm resting on the ground?"
+[experiences]
+(every?
+(fn [touch-data]
+(< 0.9 (contact worm-segment-bottom touch-data)))
+(:touch (peek experiences))))
+#+end_src
+#+end_listing
+#+caption: Body-centerd actions are best expressed in a body-centered
+#+caption: language. This code detects when the worm has curled into a
+#+caption: full circle. Imagine how you would replicate this functionality
+#+caption: using low-level pixel features such as HOG filters!
+#+name: grand-circle-intro
+#+begin_listing clojure
+#+begin_src clojure
+(defn grand-circle?
+"Does the worm form a majestic circle (one end touching the other)?"
+[experiences]
+(and (curled? experiences)
+(let [worm-touch (:touch (peek experiences))
+tail-touch (worm-touch 0)
+head-touch (worm-touch 4)]
+(and (< 0.55 (contact worm-segment-bottom-tip tail-touch))
+(< 0.55 (contact worm-segment-top-tip    head-touch))))))
+#+end_src
+#+end_listing
+#+caption: Even complicated actions such as ``wiggling'' are fairly simple
+#+caption: to describe with a rich enough language.
+#+name: wiggling-intro
+#+begin_listing clojure
+#+begin_src clojure
+(defn wiggling?
+"Is the worm wiggling?"
+[experiences]
+(let [analysis-interval 0x40]
+(when (> (count experiences) analysis-interval)
+(let [a-flex 3
+a-ex   2
+muscle-activity
+(map :muscle (vector:last-n experiences analysis-interval))
+base-activity
+(map #(- (% a-flex) (% a-ex)) muscle-activity)]
+(= 2
+(first
+(max-indexed
+(map #(Math/abs %)
+(take 20 (fft base-activity))))))))))
+#+end_src
+#+end_listing
+#+caption: The actions of a worm in a video can be recognized by
+#+caption: proprioceptive data and sentory predicates by filling
+#+caption:  in the missing sensory detail with previous experience.
+#+name: worm-recognition-intro
+#+ATTR_LaTeX: :width 10cm
+[[./images/wall-push.png]]
+One powerful advantage of empathic problem solving is that it
+factors the action recognition problem into two easier problems. To
+use empathy, you need an /aligner/, which takes the video and a
+model of your body, and aligns the model with the video. Then, you
+need a /recognizer/, which uses the aligned model to interpret the
+action. The power in this method lies in the fact that you describe
+all actions form a body-centered, rich viewpoint. This way, if you
+teach the system what ``running'' is, and you have a good enough
+aligner, the system will from then on be able to recognize running
+from any point of view, even strange points of view like above or
+underneath the runner. This is in contrast to action recognition
+schemes that try to identify actions using a non-embodied approach
+such as TODO:REFERENCE. If these systems learn about running as viewed
+from the side, they will not automatically be able to recognize
+running from any other viewpoint.
+Another powerful advantage is that using the language of multiple
+body-centered rich senses to describe body-centerd actions offers a
+massive boost in descriptive capability. Consider how difficult it
+would be to compose a set of HOG filters to describe the action of
+a simple worm-creature "curling" so that its head touches its tail,
+and then behold the simplicity of describing thus action in a
+language designed for the task (listing \ref{grand-circle-intro}):
 ** =CORTEX= is a toolkit for building sensate creatures
 Hand integration demo
 ** \Phi-space describes the worm's experiences
 ** Empathy is the process of tracing though \Phi-space
-** Efficient action recognition =EMPATH=
+** Efficient action recognition with =EMPATH=
 * Contributions
 - Built =CORTEX=, a comprehensive platform for embodied AI
 experiments. Has many new features lacking in other systems, such
 as sound. Easy to model/create new creatures.

Mercurial > cortex

comparison thesis/cortex.org @ 441:c20de2267d39