CORTEX
aurellem ☉
Artificial Imagination
Imagine watching a video of someone skateboarding. When you watch the video, you can imagine yourself skateboarding, and your knowledge of the human body and its dynamics guides your interpretation of the scene. For example, even if the skateboarder is partially occluded, you can infer the positions of his arms and body from your own knowledge of how your body would be positioned if you were skateboarding. If the skateboarder suffers an accident, you wince in sympathy, imagining the pain your own body would experience if it were in the same situation. This empathy with other people guides our understanding of whatever they are doing because it is a powerful constraint on what is probable and possible. In order to make use of this powerful empathy constraint, I need a system that can generate and make sense of sensory data from the many different senses that humans possess. The two key proprieties of such a system are embodiment and imagination.
What is imagination?
One kind of imagination is sympathetic imagination: you imagine yourself in the position of something/someone you are observing. This type of imagination comes into play when you follow along visually when watching someone perform actions, or when you sympathetically grimace when someone hurts themselves. This type of imagination uses the constraints you have learned about your own body to highly constrain the possibilities in whatever you are seeing. It uses all your senses to including your senses of touch, proprioception, etc. Humans are flexible when it comes to "putting themselves in another's shoes," and can sympathetically understand not only other humans, but entities ranging animals to cartoon characters to single dots on a screen!
Another kind of imagination is predictive imagination: you construct scenes in your mind that are not entirely related to whatever you are observing, but instead are predictions of the future or simply flights of fancy. You use this type of imagination to plan out multi-step actions, or play out dangerous situations in your mind so as to avoid messing them up in reality.
Of course, sympathetic and predictive imagination blend into each other and are not completely separate concepts. One dimension along which you can distinguish types of imagination is dependence on raw sense data. Sympathetic imagination is highly constrained by your senses, while predictive imagination can be more or less dependent on your senses depending on how far ahead you imagine. Daydreaming is an extreme form of predictive imagination that wanders through different possibilities without concern for whether they are related to whatever is happening in reality.
For this thesis, I will mostly focus on sympathetic imagination and the constraint it provides for understanding sensory data.
What problems can imagination solve?
Consider a video of a cat drinking some water.
A cat drinking some water. Identifying this action is beyond the state of the art for computers.
It is currently impossible for any computer program to reliably label such an video as "drinking". I think humans are able to label such video as "drinking" because they imagine themselves as the cat, and imagine putting their face up against a stream of water and sticking out their tongue. In that imagined world, they can feel the cool water hitting their tongue, and feel the water entering their body, and are able to recognize that feeling as drinking. So, the label of the action is not really in the pixels of the image, but is found clearly in a simulation inspired by those pixels. An imaginative system, having been trained on drinking and non-drinking examples and learning that the most important component of drinking is the feeling of water sliding down one's throat, would analyze a video of a cat drinking in the following manner:
- Create a physical model of the video by putting a "fuzzy" model of its own body in place of the cat. Also, create a simulation of the stream of water.
- Play out this simulated scene and generate imagined sensory experience. This will include relevant muscle contractions, a close up view of the stream from the cat's perspective, and most importantly, the imagined feeling of water entering the mouth.
- The action is now easily identified as drinking by the sense of taste alone. The other senses (such as the tongue moving in and out) help to give plausibility to the simulated action. Note that the sense of vision, while critical in creating the simulation, is not critical for identifying the action from the simulation.
More generally, I expect imaginative systems to be particularly good at identifying embodied actions in videos.
Cortex
The previous example involves liquids, the sense of taste, and imagining oneself as a cat. For this thesis I constrain myself to simpler, more easily digitizable senses and situations.
My system, Cortex
performs imagination in two different simplified
worlds: worm world and stick figure world. In each of these
worlds, entities capable of imagination recognize actions by
simulating the experience from their own perspective, and then
recognizing the action from a database of examples.
In order to serve as a framework for experiments in imagination,
Cortex
requires simulated bodies, worlds, and senses like vision,
hearing, touch, proprioception, etc.
A Video Game Engine takes care of some of the groundwork
When it comes to simulation environments, the engines used to create the worlds in video games offer top-notch physics and graphics support. These engines also have limited support for creating cameras and rendering 3D sound, which can be repurposed for vision and hearing respectively. Physics collision detection can be expanded to create a sense of touch.
jMonkeyEngine3 is one such engine for creating video games in Java. It uses OpenGL to render to the screen and uses screengraphs to avoid drawing things that do not appear on the screen. It has an active community and several games in the pipeline. The engine was not built to serve any particular game but is instead meant to be used for any 3D game. I chose jMonkeyEngine3 it because it had the most features out of all the open projects I looked at, and because I could then write my code in Clojure, an implementation of LISP that runs on the JVM.
CORTEX
Extends jMonkeyEngine3 to implement rich senses
Using the game-making primitives provided by jMonkeyEngine3, I have
constructed every major human sense except for smell and
taste. Cortex
also provides an interface for creating creatures
in Blender, a 3D modeling environment, and then "rigging" the
creatures with senses using 3D annotations in Blender. A creature
can have any number of senses, and there can be any number of
creatures in a simulation.
The senses available in Cortex
are:
A roadmap for Cortex
experiments
Worm World
Worms in Cortex
are segmented creatures which vary in length and
number of segments, and have the senses of vision, proprioception,
touch, and muscle tension.
This is the tactile-sensor-profile for the upper segment of a worm. It defines regions of high touch sensitivity (where there are many white pixels) and regions of low sensitivity (where white pixels are sparse).
YouTube
The worm responds to touch.
YouTube
Proprioception in a worm. The proprioceptive readout is in the upper left corner of the screen.
A worm is trained in various actions such as sinusoidal movement, curling, flailing, and spinning by directly playing motor contractions while the worm "feels" the experience. These actions are recorded both as vectors of muscle tension, touch, and proprioceptive data, but also in higher level forms such as frequencies of the various contractions and a symbolic name for the action.
Then, the worm watches a video of another worm performing one of the actions, and must judge which action was performed. Normally this would be an extremely difficult problem, but the worm is able to greatly diminish the search space through sympathetic imagination. First, it creates an imagined copy of its body which it observes from a third person point of view. Then for each frame of the video, it maneuvers its simulated body to be in registration with the worm depicted in the video. The physical constraints imposed by the physics simulation greatly decrease the number of poses that have to be tried, making the search feasible. As the imaginary worm moves, it generates imaginary muscle tension and proprioceptive sensations. The worm determines the action not by vision, but by matching the imagined proprioceptive data with previous examples.
By using non-visual sensory data such as touch, the worms can also answer body related questions such as "did your head touch your tail?" and "did worm A touch worm B?"
The proprioceptive information used for action identification is body-centric, so only the registration step is dependent on point of view, not the identification step. Registration is not specific to any particular action. Thus, action identification can be divided into a point-of-view dependent generic registration step, and a action-specific step that is body-centered and invariant to point of view.
Stick Figure World
This environment is similar to Worm World, except the creatures are more complicated and the actions and questions more varied. It is an experiment to see how far imagination can go in interpreting actions.