Prediction as a Rule for Unsupervised Learning in Deep Neural Networks
Abstract
While machine learning systems have recently achieved impressive, (super)human-level performance in several tasks, they have often relied on unnatural amounts of supervision - e.g. large numbers of labeled images or continuous scores in video games. In contrast, human learning is largely unsupervised, driven by observation and interaction with the world. Emulating this type of learning in machines is an open challenge, and one that is critical for general artificial intelligence. Here, we explore prediction of future frames in video sequences as an unsupervised learning rule. A key insight here is that in order to be able to predict how the visual world will change over time, an agent must have at least some implicit model of object structure and the possible transformations objects can undergo. To this end, we have designed several models capable of accurate prediction in complex sequences.Our first model consists of a recurrent extension to the standard autoencoder framework. Trained end-to-end to predict the movement of synthetic stimuli, we find that the model learns a representation of the underlying latent parameters of the 3D objects themselves. Importantly, we find that this representation is naturally tolerant to object transformations, and generalizes well to new tasks, such as classification of static images. Similar models trained solely with a reconstruction loss fail to generalize as effectively. In addition, we explore the use of an adversarial loss, as in a Generative Adversarial Network, illustrating its complementary effects to traditional pixel losses for the task of next-frame prediction.
Next, we propose a novel architecture based on the concept of predictive coding from the neuroscience literature. The model, which we informally call the “PredNet”, is trained to continually make hierarchical predictions of future video frames. Top-down and lateral connections convey these predictions, and residual errors are propagated forward. We again find that the model learns a robust representation of the underlying stimuli in artificial video sequences. The model can also scale to complex natural image streams (car-mounted camera videos), capturing key aspects of both egocentric movement and the movement of objects in the visual scene. In this setting, the representation learned is useful for estimating the steering angle of the car.
Finally, we examine a variety of neural phenomena through the lens of our predictive coding model. First, we demonstrate that our model exhibits extra-classical receptive field effects commonly observed in biological visual processing, specifically end-stopping and surround suppression. These effects are disrupted when the recurrent connections in the model are silenced. Going beyond simple stimuli, we find that our model expresses a norm-based coding of faces, akin to neurophysiology findings in macaques. Lastly, our model provides insight to the well-studied flash-lag illusion. Trained on natural stimuli, the model's outputted predictions align with the common percept of the illusion, providing an empirical explanation of the effect. Altogether, our results suggest that prediction is a prominent component of neural processing. Combined with the machine learning experiments, our efforts demonstrate the potential of prediction as powerful source of unsupervised learning in artificial and biological deep neural networks.
Terms of Use
This article is made available under the terms and conditions applicable to Other Posted Material, as set forth at http://nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of-use#LAACitable link to this page
http://nrs.harvard.edu/urn-3:HUL.InstRepos:39987892
Collections
- FAS Theses and Dissertations [6136]
Contact administrator regarding this item (to report mistakes or request changes)