#49 – Meta-Gradients in RL – Dr. Tomas Zahavy (DeepMind)

The race is on, we are on a collective mission to understand and create artificial general intelligence. Dr. Tom Zahavy, a Research Scientist at DeepMind thinks that reinforcement learning is the most general learning framework that we have today, and in his opinion it could lead to artificial general intelligence. He thinks there are no tasks which could not be solved by simply maximising a reward.

Back in 2012 when Tom was an undergraduate, before the deep learning revolution he attended an online lecture on how CNNs automaticaly discover representations. This was an epiphany for Tom. He decided in that very moment that he was going to become an ML researcher. Tom’s view is that the ability to recognise patterns and discover structure is the most important aspect of intelligence. This has been his quest ever since. He is particularly focused on using diversity preservation and metagradients to discover this structure.

In this discussion we dive deep into meta gradients in reinforcement learning.

Tim Introduction [00:00:00]
Main show kick off [00:07:15]
On meta gradients [00:09:27]
Taxonomy of meta gradient methods developed in recent years [00:11:43]
Why don’t you just do one big learning run? [00:13:58]
Transfer learning / life long learning [00:16:01]
Does the meta algorithm also have hyperparameters? [00:17:55]
Are monolithic learning architectures bad then? [00:19:45]
Why not have the learning agent (self-) modify its own parameters? [00:24:44]
Learning optimizers using evolutionary approaches [00:26:29]
Which parameters should we leave alone in meta optimization? [00:28:24]
Evolutionary methods are great in this space! Diversity preservation [00:30:42]
Approaches to divergence, intrinsic control [00:33:25]
How to decide on parameters to optimise and build a meta learning framework [00:35:55]
Proxy models to move from discrete domain to differentiable domain [00:39:32]
Multi lifetime training — picking environments [00:43:35]
2016 Minecraft paper [00:46:07]
Lifelong learning [00:49:54]
Corporations are real world AIs. Could we recognise non-human AGIs? [00:52:09]
Tim invokes Francois Chollet, of course! [00:55:09]
But David Silver says that reward is all you need? [00:56:57]
Program centric generalization [00:59:59]
Sara Hooker — The hardware lottery, JAX, Bitter Lesson [01:02:10]
Concerning trends in the community right now? [01:05:15]
Unexplored areas in ML research? [01:06:47]
Should Ph.D Students be going into Meta Gradient work? [01:08:18]
Is RL too hard for the average person to embark on? [01:10:45]
People back in the 80s had a pretty good idea already, concept papers were cool [01:15:16]
Non-stationary data, do you have to re-train the model all the time [01:17:36]
Graying the Blackbox paper and visualizing the structure of DQNs with tSNE [01:19:16]

Transcript: https://docs.google.com/document/d/142599ttt1O7gWed45_uGKlp1D8iVNkYeLF9waNgpgtA/edit?usp=sharing

Meta-Policy Gradients: A Survey [Robert Lange]
https://roberttlange.github.io/posts/2020/12/meta-policy-gradients/

A Self-Tuning Actor-Critic Algorithm [Tom Zahavy et al]
https://arxiv.org/abs/2002.12928

Graying the black box: Understanding DQNs [Zahavy et al]
https://utstat.toronto.edu/droy/icml16/publish/zahavy16.pdf

Is a picture worth a thousand words? [Tom Zahavy et al]
https://arxiv.org/abs/1611.09534

Diversity is All You Need: Learning Skills without a Reward Function [Benjamin Eysenbach et al]
https://arxiv.org/abs/1802.06070

Evolutionary principles in self-referential learning [Jürgen Schmidhuber]
https://people.idsia.ch//~juergen/diploma.html

Learn What Not to Learn: Action Elimination with Deep Reinforcement Learning [Tom Zahavy et al]
https://arxiv.org/abs/1809.02121

Training Learned Optimizers with Randomly Initialized Learned Optimizers [Luke Metz et al]
https://arxiv.org/pdf/2101.07367.pdf

A Deep Hierarchical Approach to Lifelong Learning in Minecraft [Chen Tessler, ..Tom Zahavy et al]
https://arxiv.org/abs/1604.07255

MUTUAL INFORMATION STATE INTRINSIC CONTROL [Rui Zhao et al]
https://openreview.net/pdf?id=OthEq8I5v1

Mutual Information-based State-Control for Intrinsically Motivated Reinforcement Learning [Rui Zhao et al]
https://arxiv.org/abs/2002.01963

Rainbow: Combining Improvements in Deep Reinforcement Learning [Matteo Hessel et al]
https://arxiv.org/abs/1710.02298

Variational Intrinsic Control
https://arxiv.org/abs/1611.07507

Meta-Gradient Reinforcement Learning [Zhongwen Xu et al]
https://arxiv.org/abs/1805.09801

On Learning Intrinsic Rewards for Policy Gradient Methods [Zeyu Zheng, Junhyuk Oh, Satinder Singh]
https://arxiv.org/abs/1804.06459

Visuals and music: melodysheep
Please support them on patreon and buy their soundtrack as we did @ https://melodysheep.bandcamp.com/album/life-beyond-chapter-1-original-soundtrack
LIFE BEYOND: Chapter 1: https://www.youtube.com/watch?v=SUelbSa-OkA
Keep in mind that MLST is 100% non-monetized, non-commercial and educational

YouTube Source for this AI Video

AI video(s) you might be interested in …