Mercurial > cortex
changeset 447:284316604be0
minor changes from Dylan.
author | Robert McIntyre <rlm@mit.edu> |
---|---|
date | Tue, 25 Mar 2014 11:30:15 -0400 |
parents | 3e91585b2a1c |
children | af13fc73e851 |
files | thesis/cortex.org |
diffstat | 1 files changed, 76 insertions(+), 71 deletions(-) [+] |
line wrap: on
line diff
1.1 --- a/thesis/cortex.org Tue Mar 25 03:24:28 2014 -0400 1.2 +++ b/thesis/cortex.org Tue Mar 25 11:30:15 2014 -0400 1.3 @@ -10,9 +10,9 @@ 1.4 By the end of this thesis, you will have seen a novel approach to 1.5 interpreting video using embodiment and empathy. You will have also 1.6 seen one way to efficiently implement empathy for embodied 1.7 - creatures. Finally, you will become familiar with =CORTEX=, a 1.8 - system for designing and simulating creatures with rich senses, 1.9 - which you may choose to use in your own research. 1.10 + creatures. Finally, you will become familiar with =CORTEX=, a system 1.11 + for designing and simulating creatures with rich senses, which you 1.12 + may choose to use in your own research. 1.13 1.14 This is the core vision of my thesis: That one of the important ways 1.15 in which we understand others is by imagining ourselves in their 1.16 @@ -26,8 +26,8 @@ 1.17 1.18 ** Recognizing actions in video is extremely difficult 1.19 1.20 - Consider for example the problem of determining what is happening in 1.21 - a video of which this is one frame: 1.22 + Consider for example the problem of determining what is happening 1.23 + in a video of which this is one frame: 1.24 1.25 #+caption: A cat drinking some water. Identifying this action is 1.26 #+caption: beyond the state of the art for computers. 1.27 @@ -35,14 +35,14 @@ 1.28 [[./images/cat-drinking.jpg]] 1.29 1.30 It is currently impossible for any computer program to reliably 1.31 - label such a video as "drinking". And rightly so -- it is a very 1.32 + label such a video as ``drinking''. And rightly so -- it is a very 1.33 hard problem! What features can you describe in terms of low level 1.34 functions of pixels that can even begin to describe at a high level 1.35 what is happening here? 1.36 1.37 - Or suppose that you are building a program that recognizes 1.38 - chairs. How could you ``see'' the chair in figure 1.39 - \ref{invisible-chair} and figure \ref{hidden-chair}? 1.40 + Or suppose that you are building a program that recognizes chairs. 1.41 + How could you ``see'' the chair in figure \ref{invisible-chair} and 1.42 + figure \ref{hidden-chair}? 1.43 1.44 #+caption: When you look at this, do you think ``chair''? I certainly do. 1.45 #+name: invisible-chair 1.46 @@ -69,9 +69,9 @@ 1.47 on in our minds as we easily solve these recognition problems. 1.48 1.49 The hidden chairs show us that we are strongly triggered by cues 1.50 - relating to the position of human bodies, and that we can 1.51 - determine the overall physical configuration of a human body even 1.52 - if much of that body is occluded. 1.53 + relating to the position of human bodies, and that we can determine 1.54 + the overall physical configuration of a human body even if much of 1.55 + that body is occluded. 1.56 1.57 The picture of the girl pushing against the wall tells us that we 1.58 have common sense knowledge about the kinetics of our own bodies. 1.59 @@ -85,58 +85,54 @@ 1.60 problems above in a form amenable to computation. It is split into 1.61 four parts: 1.62 1.63 - - Free/Guided Play :: The creature moves around and experiences the 1.64 - world through its unique perspective. Many otherwise 1.65 - complicated actions are easily described in the language of a 1.66 - full suite of body-centered, rich senses. For example, 1.67 - drinking is the feeling of water sliding down your throat, and 1.68 - cooling your insides. It's often accompanied by bringing your 1.69 - hand close to your face, or bringing your face close to 1.70 - water. Sitting down is the feeling of bending your knees, 1.71 - activating your quadriceps, then feeling a surface with your 1.72 - bottom and relaxing your legs. These body-centered action 1.73 + - Free/Guided Play (Training) :: The creature moves around and 1.74 + experiences the world through its unique perspective. Many 1.75 + otherwise complicated actions are easily described in the 1.76 + language of a full suite of body-centered, rich senses. For 1.77 + example, drinking is the feeling of water sliding down your 1.78 + throat, and cooling your insides. It's often accompanied by 1.79 + bringing your hand close to your face, or bringing your face 1.80 + close to water. Sitting down is the feeling of bending your 1.81 + knees, activating your quadriceps, then feeling a surface with 1.82 + your bottom and relaxing your legs. These body-centered action 1.83 descriptions can be either learned or hard coded. 1.84 - - Alignment :: When trying to interpret a video or image, the 1.85 - creature takes a model of itself and aligns it with 1.86 - whatever it sees. This can be a rather loose 1.87 - alignment that can cross species, as when humans try 1.88 - to align themselves with things like ponies, dogs, 1.89 - or other humans with a different body type. 1.90 - - Empathy :: The alignment triggers the memories of previous 1.91 - experience. For example, the alignment itself easily 1.92 - maps to proprioceptive data. Any sounds or obvious 1.93 - skin contact in the video can to a lesser extent 1.94 - trigger previous experience. The creatures previous 1.95 - experience is chained together in short bursts to 1.96 - coherently describe the new scene. 1.97 - - Recognition :: With the scene now described in terms of past 1.98 - experience, the creature can now run its 1.99 - action-identification programs on this synthesized 1.100 - sensory data, just as it would if it were actually 1.101 - experiencing the scene first-hand. If previous 1.102 - experience has been accurately retrieved, and if 1.103 - it is analogous enough to the scene, then the 1.104 - creature will correctly identify the action in the 1.105 - scene. 1.106 - 1.107 - 1.108 + - Alignment (Posture imitation) :: When trying to interpret a video 1.109 + or image, the creature takes a model of itself and aligns it 1.110 + with whatever it sees. This alignment can even cross species, 1.111 + as when humans try to align themselves with things like 1.112 + ponies, dogs, or other humans with a different body type. 1.113 + - Empathy (Sensory extrapolation) :: The alignment triggers 1.114 + associations with sensory data from prior experiences. For 1.115 + example, the alignment itself easily maps to proprioceptive 1.116 + data. Any sounds or obvious skin contact in the video can to a 1.117 + lesser extent trigger previous experience. Segments of 1.118 + previous experiences are stitched together to form a coherent 1.119 + and complete sensory portrait of the scene. 1.120 + - Recognition (Classification) :: With the scene described in terms 1.121 + of first person sensory events, the creature can now run its 1.122 + action-identification programs on this synthesized sensory 1.123 + data, just as it would if it were actually experiencing the 1.124 + scene first-hand. If previous experience has been accurately 1.125 + retrieved, and if it is analogous enough to the scene, then 1.126 + the creature will correctly identify the action in the scene. 1.127 + 1.128 For example, I think humans are able to label the cat video as 1.129 - "drinking" because they imagine /themselves/ as the cat, and 1.130 + ``drinking'' because they imagine /themselves/ as the cat, and 1.131 imagine putting their face up against a stream of water and 1.132 sticking out their tongue. In that imagined world, they can feel 1.133 the cool water hitting their tongue, and feel the water entering 1.134 - their body, and are able to recognize that /feeling/ as 1.135 - drinking. So, the label of the action is not really in the pixels 1.136 - of the image, but is found clearly in a simulation inspired by 1.137 - those pixels. An imaginative system, having been trained on 1.138 - drinking and non-drinking examples and learning that the most 1.139 - important component of drinking is the feeling of water sliding 1.140 - down one's throat, would analyze a video of a cat drinking in the 1.141 - following manner: 1.142 + their body, and are able to recognize that /feeling/ as drinking. 1.143 + So, the label of the action is not really in the pixels of the 1.144 + image, but is found clearly in a simulation inspired by those 1.145 + pixels. An imaginative system, having been trained on drinking and 1.146 + non-drinking examples and learning that the most important 1.147 + component of drinking is the feeling of water sliding down one's 1.148 + throat, would analyze a video of a cat drinking in the following 1.149 + manner: 1.150 1.151 - 1. Create a physical model of the video by putting a "fuzzy" model 1.152 - of its own body in place of the cat. Possibly also create a 1.153 - simulation of the stream of water. 1.154 + 1. Create a physical model of the video by putting a ``fuzzy'' 1.155 + model of its own body in place of the cat. Possibly also create 1.156 + a simulation of the stream of water. 1.157 1.158 2. Play out this simulated scene and generate imagined sensory 1.159 experience. This will include relevant muscle contractions, a 1.160 @@ -184,13 +180,12 @@ 1.161 #+ATTR_LaTeX: :width 15cm 1.162 [[./images/worm-intro-white.png]] 1.163 1.164 - #+caption: The actions of a worm in a video can be recognized by 1.165 - #+caption: proprioceptive data and sentory predicates by filling 1.166 - #+caption: in the missing sensory detail with previous experience. 1.167 + #+caption: =EMPATH= recognized and classified each of these poses by 1.168 + #+caption: inferring the complete sensory experience from 1.169 + #+caption: proprioceptive data. 1.170 #+name: worm-recognition-intro 1.171 #+ATTR_LaTeX: :width 15cm 1.172 [[./images/worm-poses.png]] 1.173 - 1.174 1.175 One powerful advantage of empathic problem solving is that it 1.176 factors the action recognition problem into two easier problems. To 1.177 @@ -198,22 +193,23 @@ 1.178 model of your body, and aligns the model with the video. Then, you 1.179 need a /recognizer/, which uses the aligned model to interpret the 1.180 action. The power in this method lies in the fact that you describe 1.181 - all actions form a body-centered, rich viewpoint. This way, if you 1.182 + all actions form a body-centered, viewpoint You are less tied to 1.183 + the particulars of any visual representation of the actions. If you 1.184 teach the system what ``running'' is, and you have a good enough 1.185 aligner, the system will from then on be able to recognize running 1.186 from any point of view, even strange points of view like above or 1.187 underneath the runner. This is in contrast to action recognition 1.188 schemes that try to identify actions using a non-embodied approach 1.189 - such as TODO:REFERENCE. If these systems learn about running as viewed 1.190 - from the side, they will not automatically be able to recognize 1.191 - running from any other viewpoint. 1.192 + such as TODO:REFERENCE. If these systems learn about running as 1.193 + viewed from the side, they will not automatically be able to 1.194 + recognize running from any other viewpoint. 1.195 1.196 Another powerful advantage is that using the language of multiple 1.197 body-centered rich senses to describe body-centerd actions offers a 1.198 massive boost in descriptive capability. Consider how difficult it 1.199 would be to compose a set of HOG filters to describe the action of 1.200 - a simple worm-creature "curling" so that its head touches its tail, 1.201 - and then behold the simplicity of describing thus action in a 1.202 + a simple worm-creature ``curling'' so that its head touches its 1.203 + tail, and then behold the simplicity of describing thus action in a 1.204 language designed for the task (listing \ref{grand-circle-intro}): 1.205 1.206 #+caption: Body-centerd actions are best expressed in a body-centered 1.207 @@ -293,8 +289,8 @@ 1.208 that creature is feeling. My empathy algorithm involves multiple 1.209 phases. First is free-play, where the creature moves around and gains 1.210 sensory experience. From this experience I construct a representation 1.211 -of the creature's sensory state space, which I call \phi-space. Using 1.212 -\phi-space, I construct an efficient function for enriching the 1.213 +of the creature's sensory state space, which I call \Phi-space. Using 1.214 +\Phi-space, I construct an efficient function for enriching the 1.215 limited data that comes from observing another creature with a full 1.216 compliment of imagined sensory data based on previous experience. I 1.217 can then use the imagined sensory data to recognize what the observed 1.218 @@ -313,4 +309,13 @@ 1.219 1.220 1.221 * COMMENT names for cortex 1.222 - - bioland 1.223 \ No newline at end of file 1.224 + - bioland 1.225 + 1.226 + 1.227 + 1.228 + 1.229 +# An anatomical joke: 1.230 +# - Training 1.231 +# - Skeletal imitation 1.232 +# - Sensory fleshing-out 1.233 +# - Classification