Add RL Unplugged data loading code and examples

PiperOrigin-RevId: 321746296
This commit is contained in:
Sergio Gomez
2020-07-17 11:07:55 +01:00
committed by Saran Tunyasuvunakool
parent bd29e1b710
commit 1ea4cc033c
12 changed files with 3021 additions and 5 deletions
+1 -1
View File
@@ -24,7 +24,7 @@ https://deepmind.com/research/publications/
## Projects
* [RL Unplugged: Benchmarks for Offline Reinforcement Learning] (rl_unplugged)
* [RL Unplugged: Benchmarks for Offline Reinforcement Learning](rl_unplugged)
* [Disentangling by Subspace Diffusion (GEOMANCER)](geomancer)
* [What can I do here? A theory of affordances in reinforcmenet learning](affordances_theory), ICML 2020
* [Scaling data-driven robotics with reward sketching and batch reinforcement learning](sketchy), RSS 2020
+25 -4
View File
@@ -26,8 +26,6 @@ In this suite of benchmarks, we try to focus on the following problems:
The data is available under
[RL Unplugged GCP bucket](https://console.cloud.google.com/storage/browser/rl_unplugged).
Data loading code and examples will be available soon.
## Atari Dataset
We are releasing a large and diverse dataset of gameplay following the protocol
@@ -40,7 +38,7 @@ transition include stacks of four frames to be able to do frame-stacking with
our baselines. We release datasets for 46 Atari games. For details on how the
dataset was generated, please refer to the paper.
## Deepmind Locomotion Dataset
## DeepMind Locomotion Dataset
These tasks are made up of the corridor locomotion tasks involving the CMU
Humanoid, for which prior efforts have either used motion capture data [Merel et
@@ -51,7 +49,7 @@ Locomotion tasks feature the combination of challenging high-DoF continuous
control along with perception from rich egocentric observations. For details on
how the dataset was generated, please refer to the paper.
## Deepmind Control Suite Dataset
## DeepMind Control Suite Dataset
DeepMind Control Suite [Tassa et al., 2018] is a set of control tasks
implemented in MuJoCo [Todorov et al., 2012]. We consider a subset of the tasks
@@ -73,6 +71,29 @@ We release 8 datasets in total -- with no combined challenge and easy combined
challenge on the cartpole, walker, quadruped, and humanoid tasks. For details on
how the dataset was generated, please refer to the paper.
## Running the code
### Installation
* Install dependencies: `pip install requirements.txt`
* (Optional) Setup MuJoCo license key for DM Control environments
([instructions](https://github.com/deepmind/dm_control#requirements-and-installation)).
* (Optional) Install
[realworldrl_suite](https://github.com/google-research/realworldrl_suite#installation).
### Atari example
```
mkdir -p /tmp/dataset/Asterix
gsutil cp gs://rl_unplugged/atari/Asterix/run_1-00000-of-00100 \
/tmp/dataset/Asterix/run_1-00000-of-00001
python atari_example.py --path=/tmp/dataset --game=Asterix
```
This copies a single shard from one of the Asterix datasets from GCP to a local
folder, and then runs a script that loads a single example and runs a step on
the Atari environment.
## Citation
Please use the following bibtex for citations:
+252
View File
@@ -0,0 +1,252 @@
# Lint as: python3
# Copyright 2020 DeepMind Technologies Limited.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Atari RL Unplugged datasets.
Examples in the dataset represent SARSA transitions stored during a
DQN training run as described in https://arxiv.org/pdf/1907.04543.
For every training run we have recorded all 50 million transitions corresponding
to 200 million environment steps (4x factor because of frame skipping). There
are 5 separate datasets for each of the 45 games.
Every transition in the dataset is a tuple containing the following features:
* o_t: Observation at time t. Observations have been processed using the
canonical Atari frame processing, including 4x frame stacking. The shape
of a single observation is [84, 84, 4].
* a_t: Action taken at time t.
* r_t: Reward after a_t.
* d_t: Discount after a_t.
* o_tp1: Observation at time t+1.
* a_tp1: Action at time t+1.
* extras:
* episode_id: Episode identifier.
* episode_return: Total episode return computed using per-step [-1, 1]
clipping.
"""
import functools
import os
from typing import Dict
from acme import wrappers
import dm_env
from dm_env import specs
from dopamine.discrete_domains import atari_lib
import reverb
import tensorflow as tf
# 9 tuning games.
TUNING_SUITE = [
'BeamRider',
'DemonAttack',
'DoubleDunk',
'IceHockey',
'MsPacman',
'Pooyan',
'RoadRunner',
'Robotank',
'Zaxxon',
]
# 36 testing games.
TESTING_SUITE = [
'Alien',
'Amidar',
'Assault',
'Asterix',
'Atlantis',
'BankHeist',
'BattleZone',
'Boxing',
'Breakout',
'Carnival',
'Centipede',
'ChopperCommand',
'CrazyClimber',
'Enduro',
'FishingDerby',
'Freeway',
'Frostbite',
'Gopher',
'Gravitar',
'Hero',
'Jamesbond',
'Kangaroo',
'Krull',
'KungFuMaster',
'NameThisGame',
'Phoenix',
'Pong',
'Qbert',
'Riverraid',
'Seaquest',
'SpaceInvaders',
'StarGunner',
'TimePilot',
'UpNDown',
'VideoPinball',
'WizardOfWor',
'YarsRevenge',
]
# Total of 45 games.
ALL = TUNING_SUITE + TESTING_SUITE
def _decode_frames(pngs):
"""Decode PNGs.
Args:
pngs: String Tensor of size (4,) containing PNG encoded images.
Returns:
4 84x84 grayscale images packed in a (84, 84, 4) uint8 Tensor.
"""
# Statically unroll png decoding
frames = [tf.image.decode_png(pngs[i], channels=1) for i in range(4)]
frames = tf.concat(frames, axis=2)
frames.set_shape((84, 84, 4))
return frames
def _make_reverb_sample(o_t,
a_t,
r_t,
d_t,
o_tp1,
a_tp1,
extras):
"""Create Reverb sample with offline data.
Args:
o_t: Observation at time t.
a_t: Action at time t.
r_t: Reward at time t.
d_t: Discount at time t.
o_tp1: Observation at time t+1.
a_tp1: Action at time t+1.
extras: Dictionary with extra features.
Returns:
Replay sample with fake info: key=0, probability=1, table_size=0.
"""
info = reverb.SampleInfo(key=tf.constant(0, tf.uint64),
probability=tf.constant(1.0, tf.float64),
table_size=tf.constant(0, tf.int64),
priority=tf.constant(1.0, tf.float64))
data = (o_t, a_t, r_t, d_t, o_tp1, a_tp1, extras)
return reverb.ReplaySample(info=info, data=data)
def _tf_example_to_reverb_sample(tf_example
):
"""Create a Reverb replay sample from a TF example."""
# Parse tf.Example.
feature_description = {
'o_t': tf.io.FixedLenFeature([4], tf.string),
'o_tp1': tf.io.FixedLenFeature([4], tf.string),
'a_t': tf.io.FixedLenFeature([], tf.int64),
'a_tp1': tf.io.FixedLenFeature([], tf.int64),
'r_t': tf.io.FixedLenFeature([], tf.float32),
'd_t': tf.io.FixedLenFeature([], tf.float32),
'episode_id': tf.io.FixedLenFeature([], tf.int64),
'episode_return': tf.io.FixedLenFeature([], tf.float32),
}
data = tf.io.parse_single_example(tf_example, feature_description)
# Process data.
o_t = _decode_frames(data['o_t'])
o_tp1 = _decode_frames(data['o_tp1'])
a_t = tf.cast(data['a_t'], tf.int32)
a_tp1 = tf.cast(data['a_tp1'], tf.int32)
episode_id = tf.bitcast(data['episode_id'], tf.uint64)
# Build Reverb replay sample.
extras = {
'episode_id': episode_id,
'return': data['episode_return']
}
return _make_reverb_sample(o_t, a_t, data['r_t'], data['d_t'], o_tp1, a_tp1,
extras)
def dataset(path,
game,
run,
num_shards = 100,
shuffle_buffer_size = 100000):
"""TF dataset of Atari SARSA tuples."""
path = os.path.join(path, f'{game}/run_{run}')
filenames = [f'{path}-{i:05d}-of-{num_shards:05d}' for i in range(num_shards)]
file_ds = tf.data.Dataset.from_tensor_slices(filenames)
file_ds = file_ds.repeat().shuffle(num_shards)
example_ds = file_ds.interleave(
functools.partial(tf.data.TFRecordDataset, compression_type='GZIP'),
cycle_length=tf.data.experimental.AUTOTUNE,
block_length=5)
example_ds = example_ds.shuffle(shuffle_buffer_size)
return example_ds.map(_tf_example_to_reverb_sample,
num_parallel_calls=tf.data.experimental.AUTOTUNE)
class AtariDopamineWrapper(dm_env.Environment):
"""Wrapper for Atari Dopamine environmnet."""
def __init__(self, env, max_episode_steps=108000):
self._env = env
self._max_episode_steps = max_episode_steps
self._episode_steps = 0
self._reset_next_episode = True
def reset(self):
self._episode_steps = 0
self._reset_next_step = False
observation = self._env.reset()
return dm_env.restart(observation.squeeze(-1))
def step(self, action):
if self._reset_next_step:
return self.reset()
observation, reward, terminal, _ = self._env.step(action.item())
observation = observation.squeeze(-1)
discount = 1 - float(terminal)
self._episode_steps += 1
if terminal:
self._reset_next_episode = True
return dm_env.termination(reward, observation)
elif self._episode_steps == self._max_episode_steps:
self._reset_next_episode = True
return dm_env.truncation(reward, observation, discount)
else:
return dm_env.transition(reward, observation, discount)
def observation_spec(self):
space = self._env.observation_space
return specs.Array(space.shape[:-1], space.dtype)
def action_spec(self):
return specs.DiscreteArray(self._env.action_space.n)
def environment(game):
"""Atari environment."""
env = atari_lib.create_atari_environment(game_name=game,
sticky_actions=True)
env = AtariDopamineWrapper(env)
env = wrappers.FrameStackingWrapper(env, num_frames=4)
return wrappers.SinglePrecisionWrapper(env)
+407
View File
@@ -0,0 +1,407 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KDiJzbb8KFvP"
},
"source": [
"Copyright 2020 DeepMind Technologies Limited.\n",
"\n",
"Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use\n",
"this file except in compliance with the License. You may obtain a copy of the\n",
"License at\n",
"\n",
"[https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)\n",
"\n",
"Unless required by applicable law or agreed to in writing, software distributed\n",
"under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR\n",
"CONDITIONS OF ANY KIND, either express or implied. See the License for the\n",
"specific language governing permissions and limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "ULdrhOaVbsdO"
},
"source": [
"# RL Unplugged: Offline DQN - Atari\n",
"## Guide to training an Acme DQN agent on Atari data.\n",
"# \u003ca href=\"https://colab.research.google.com/github/deepmind/deepmind_research/blob/master/rl_unplugged/atari_dqn.ipynb\" target=\"_parent\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\u003c/a\u003e\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "xaJxoatMhJ71"
},
"source": [
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "KH3O0zcXUeun"
},
"outputs": [],
"source": [
"!pip install dm-acme\n",
"!pip install dm-acme[reverb]\n",
"!pip install dm-acme[tf]\n",
"!pip install dm-sonnet\n",
"!pip install dopamine-rl==3.0.1\n",
"!pip install atari-py\n",
"!git clone https://github.com/deepmind/deepmind-research.git\n",
"%cd deepmind-research"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "c-H2d6UZi7Sf"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "HJ74Id-8MERq"
},
"outputs": [],
"source": [
"import copy\n",
"\n",
"import acme\n",
"from acme.agents.tf import actors\n",
"from acme.agents.tf.dqn import learning as dqn\n",
"from acme.tf import utils as acme_utils\n",
"from acme.utils import loggers\n",
"from rl_unplugged import atari\n",
"import sonnet as snt\n",
"import tensorflow as tf"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "JrOSnoWiY4Xl"
},
"source": [
"## Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "Vi3_H_h1zy_0"
},
"outputs": [],
"source": [
"game = 'Pong' #@param\n",
"run = 1 #@param\n",
"\n",
"tmp_path = '/tmp/atari'\n",
"gs_path = 'gs://rl_unplugged/atari'\n",
"\n",
"!mkdir -p {tmp_path}/{game}\n",
"!gsutil cp {gs_path}/{game}/run_{run}-00000-of-00001 {tmp_path}/{game}"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "a9vF7LtYvLzy"
},
"source": [
"## Dataset and environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "01AHHNd9cEX2"
},
"outputs": [],
"source": [
"batch_size = 10 #@param\n",
"\n",
"def discard_extras(sample):\n",
" return sample._replace(data=sample.data[:5])\n",
"\n",
"dataset = atari.dataset(path=tmp_path, game='Pong', run=1, num_shards=1)\n",
"# Small batch size, experiments in the paper were run with batch size 256.\n",
"dataset = dataset.map(discard_extras).batch(batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KoYBhjPtI_N6"
},
"source": [
""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "4b4_rHwCmQg-"
},
"outputs": [],
"source": [
"environment = atari.environment(game='Pong')"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "BukOfOsmtSQn"
},
"source": [
"## DQN learner"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 83,
"status": "ok",
"timestamp": 1593614657342,
"user": {
"displayName": "",
"photoUrl": "",
"userId": ""
},
"user_tz": -60
},
"id": "3Jcjk1w6oHVX",
"outputId": "1746b0bb-5a5c-45dd-b5a1-c77852545e12"
},
"outputs": [
{
"data": {
"text/plain": [
"TensorSpec(shape=(6,), dtype=tf.float32, name=None)"
]
},
"execution_count": 20,
"metadata": {
"tags": []
},
"output_type": "execute_result"
}
],
"source": [
"# Get total number of actions.\n",
"num_actions = environment.action_spec().num_values\n",
"\n",
"# Create the Q network.\n",
"network = snt.Sequential([\n",
" lambda x: tf.image.convert_image_dtype(x, tf.float32),\n",
" snt.Conv2D(32, [8, 8], [4, 4]),\n",
" tf.nn.relu,\n",
" snt.Conv2D(64, [4, 4], [2, 2]),\n",
" tf.nn.relu,\n",
" snt.Conv2D(64, [3, 3], [1, 1]),\n",
" tf.nn.relu,\n",
" snt.Flatten(),\n",
" snt.nets.MLP([512, num_actions])\n",
"])\n",
"acme_utils.create_variables(network, [environment.observation_spec()])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "9CD2sNK-oA9S"
},
"outputs": [],
"source": [
"# Create a logger.\n",
"logger = loggers.TerminalLogger(label='learner', time_delta=1.)\n",
"\n",
"# Create the DQN learner.\n",
"learner = dqn.DQNLearner(\n",
" network=network,\n",
" target_network=copy.deepcopy(network),\n",
" discount=0.99,\n",
" learning_rate=3e-4,\n",
" importance_sampling_exponent=0.2,\n",
" target_update_period=2500,\n",
" dataset=dataset,\n",
" logger=logger)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "oKeGQxzitXYC"
},
"source": [
"## Training loop"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 51
},
"colab_type": "code",
"executionInfo": {
"elapsed": 4694,
"status": "ok",
"timestamp": 1593614662237,
"user": {
"displayName": "",
"photoUrl": "",
"userId": ""
},
"user_tz": -60
},
"id": "VWZd5N-Qoz82",
"outputId": "5ee2ce7c-b3fe-483b-8893-5a6e13519f48"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Learner] Loss = 0.003 | Steps = 1 | Walltime = 0\n",
"[Learner] Loss = 0.004 | Steps = 54 | Walltime = 1.126\n"
]
}
],
"source": [
"for _ in range(100):\n",
" learner.step()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "qFQDrp0CgIzU"
},
"source": [
"## Evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 102
},
"colab_type": "code",
"executionInfo": {
"elapsed": 15099,
"status": "ok",
"timestamp": 1593614677360,
"user": {
"displayName": "",
"photoUrl": "",
"userId": ""
},
"user_tz": -60
},
"id": "DWYHBalygIDF",
"outputId": "4ec412c3-810a-4208-b521-919a8ece40df"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Evaluation] Episode Length = 842 | Episode Return = -20.000 | Episodes = 1 | Steps = 842 | Steps Per Second = 265.850\n",
"[Evaluation] Episode Length = 792 | Episode Return = -21.000 | Episodes = 2 | Steps = 1634 | Steps Per Second = 270.043\n",
"[Evaluation] Episode Length = 812 | Episode Return = -21.000 | Episodes = 3 | Steps = 2446 | Steps Per Second = 274.792\n",
"[Evaluation] Episode Length = 812 | Episode Return = -21.000 | Episodes = 4 | Steps = 3258 | Steps Per Second = 270.967\n",
"[Evaluation] Episode Length = 812 | Episode Return = -21.000 | Episodes = 5 | Steps = 4070 | Steps Per Second = 274.253\n"
]
}
],
"source": [
"# Create a logger.\n",
"logger = loggers.TerminalLogger(label='evaluation', time_delta=1.)\n",
"\n",
"# Create an environment loop.\n",
"policy_network = snt.Sequential([\n",
" network,\n",
" lambda q: tf.argmax(q, axis=-1),\n",
"])\n",
"loop = acme.EnvironmentLoop(\n",
" environment=environment,\n",
" actor=actors.FeedForwardActor(policy_network=policy_network),\n",
" logger=logger)\n",
"\n",
"loop.run(5)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"last_runtime": {
"build_target": "",
"kind": "local"
},
"name": "RL Unplugged: Offline DQN - Atari",
"provenance": [
{
"file_id": "1g9yTbTuk9aeERxWflOWqUGpx2M3osx0l",
"timestamp": 1593685504110
}
]
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
+54
View File
@@ -0,0 +1,54 @@
# Lint as: python3
# Copyright 2020 DeepMind Technologies Limited.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""Atari dataset example.
Instructions:
> mkdir -p /tmp/dataset/Asterix
> gsutil cp gs://rl_unplugged/atari/Asterix/run_1-00000-of-00100 \
/tmp/dataset/Asterix/run_1-00000-of-00001
> python atari_example.py --path=/tmp/dataset --game=Asterix
"""
from absl import app
from absl import flags
from acme import specs
import tree
from rl_unplugged import atari
flags.DEFINE_string('path', '/tmp/dataset', 'Path to dataset.')
flags.DEFINE_string('game', 'Asterix', 'Game.')
FLAGS = flags.FLAGS
def main(_):
ds = atari.dataset(FLAGS.path, FLAGS.game, 1,
num_shards=1,
shuffle_buffer_size=1)
for sample in ds.take(1):
print('Data spec')
print(tree.map_structure(lambda x: (x.dtype, x.shape), sample.data))
env = atari.environment(FLAGS.game)
print('Environment spec')
print(specs.make_environment_spec(env))
print('Environment observation')
timestep = env.reset()
print(tree.map_structure(lambda x: (x.dtype, x.shape), timestep.observation))
if __name__ == '__main__':
app.run(main)
File diff suppressed because it is too large Load Diff
+475
View File
@@ -0,0 +1,475 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "KDiJzbb8KFvP"
},
"source": [
"Copyright 2020 DeepMind Technologies Limited.\n",
"\n",
"Licensed under the Apache License, Version 2.0 (the \"License\"); you may not use\n",
"this file except in compliance with the License. You may obtain a copy of the\n",
"License at\n",
"\n",
"[https://www.apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0)\n",
"\n",
"Unless required by applicable law or agreed to in writing, software distributed\n",
"under the License is distributed on an \"AS IS\" BASIS, WITHOUT WARRANTIES OR\n",
"CONDITIONS OF ANY KIND, either express or implied. See the License for the\n",
"specific language governing permissions and limitations under the License."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "zzJlIvx4tnrM"
},
"source": [
"# RL Unplugged: Offline D4PG - DM control\n",
"\n",
"## Guide to training an Acme D4PG agent on DM control data.\n",
"# \u003ca href=\"https://colab.research.google.com/github/deepmind/deepmind_research/blob/master/rl_unplugged/dm_control_d4gp.ipynb\" target=\"_parent\"\u003e\u003cimg src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/\u003e\u003c/a\u003e\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "o1eig5zGEL4y"
},
"source": [
"## Installation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "WbpMoLbgEL41"
},
"outputs": [],
"source": [
"!pip install dm-acme\n",
"!pip install dm-acme[reverb]\n",
"!pip install dm-acme[tf]\n",
"!pip install dm-sonnet\n",
"!git clone https://github.com/deepmind/deepmind-research.git\n",
"%cd deepmind-research"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "04bMANoeEPM3"
},
"source": [
"### dm_control\n",
"\n",
"More detailed instructions in [this tutorial](https://colab.research.google.com/github/deepmind/dm_control/blob/master/tutorial.ipynb#scrollTo=YvyGCsgSCxHQ)."
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "VEEj3Qw60y73"
},
"source": [
"#### Institutional MuJoCo license."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "IbZxYDxzoz5R"
},
"outputs": [],
"source": [
"#@title Edit and run\n",
"mjkey = \"\"\"\n",
"\n",
"REPLACE THIS LINE WITH YOUR MUJOCO LICENSE KEY\n",
"\n",
"\"\"\".strip()\n",
"\n",
"mujoco_dir = \"$HOME/.mujoco\"\n",
"\n",
"# Install OpenGL deps\n",
"!apt-get update \u0026\u0026 apt-get install -y --no-install-recommends \\\n",
" libgl1-mesa-glx libosmesa6 libglew2.0\n",
"\n",
"# Fetch MuJoCo binaries from Roboti\n",
"!wget -q https://www.roboti.us/download/mujoco200_linux.zip -O mujoco.zip\n",
"!unzip -o -q mujoco.zip -d \"$mujoco_dir\"\n",
"\n",
"# Copy over MuJoCo license\n",
"!echo \"$mjkey\" \u003e \"$mujoco_dir/mjkey.txt\"\n",
"\n",
"\n",
"# Configure dm_control to use the OSMesa rendering backend\n",
"%env MUJOCO_GL=osmesa\n",
"\n",
"# Install dm_control, including extra dependencies needed for the locomotion\n",
"# mazes.\n",
"!pip install dm_control[locomotion_mazes]"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "-_7tVg-zzjzW"
},
"source": [
"#### Machine-locked MuJoCo license."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "form",
"colab": {},
"colab_type": "code",
"id": "OvMLEDE-D9oF"
},
"outputs": [],
"source": [
"#@title Add your MuJoCo License and run\n",
"mjkey = \"\"\"\n",
"\"\"\".strip()\n",
"\n",
"mujoco_dir = \"$HOME/.mujoco\"\n",
"\n",
"# Install OpenGL dependencies\n",
"!apt-get update \u0026\u0026 apt-get install -y --no-install-recommends \\\n",
" libgl1-mesa-glx libosmesa6 libglew2.0\n",
"\n",
"# Get MuJoCo binaries\n",
"!wget -q https://www.roboti.us/download/mujoco200_linux.zip -O mujoco.zip\n",
"!unzip -o -q mujoco.zip -d \"$mujoco_dir\"\n",
"\n",
"# Copy over MuJoCo license\n",
"!echo \"$mjkey\" \u003e \"$mujoco_dir/mjkey.txt\"\n",
"\n",
"# Install dm_control\n",
"!pip install dm_control[locomotion_mazes]\n",
"\n",
"# Configure dm_control to use the OSMesa rendering backend\n",
"%env MUJOCO_GL=osmesa"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "IE2nV9Hivnv5"
},
"source": [
"## Imports"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "RI7NgnJIvs4s"
},
"outputs": [],
"source": [
"import collections\n",
"import copy\n",
"from typing import Mapping, Sequence\n",
"\n",
"import acme\n",
"from acme import specs\n",
"from acme.agents.tf import actors\n",
"from acme.agents.tf import d4pg\n",
"from acme.tf import networks\n",
"from acme.tf import utils as tf2_utils\n",
"from acme.utils import loggers\n",
"from acme.wrappers import single_precision\n",
"from acme.tf import utils as tf2_utils\n",
"import numpy as np\n",
"from rl_unplugged import dm_control_suite\n",
"import sonnet as snt\n",
"import tensorflow as tf"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "a2PCwF3bwBII"
},
"source": [
"## Data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"cellView": "both",
"colab": {},
"colab_type": "code",
"id": "VaEJbXjampPy"
},
"outputs": [],
"source": [
"task_name = 'cartpole_swingup' #@param\n",
"tmp_path = '/tmp/dm_control_suite'\n",
"gs_path = 'gs://rl_unplugged/dm_control_suite'\n",
"\n",
"!mkdir -p {tmp_path}/{task_name}\n",
"!gsutil cp {gs_path}/{task_name}/* {tmp_path}/{task_name}\n",
"\n",
"num_shards_str, = !ls {tmp_path}/{task_name}/* | wc -l\n",
"num_shards = int(num_shards_str)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "mQ1as51Mww7X"
},
"source": [
"## Dataset and environment"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "5kHzJpfcw306"
},
"outputs": [],
"source": [
"batch_size = 10 #@param\n",
"\n",
"task = dm_control_suite.ControlSuite(task_name)\n",
"\n",
"environment = task.environment\n",
"environment_spec = specs.make_environment_spec(environment)\n",
"\n",
"dataset = dm_control_suite.dataset(\n",
" '/tmp',\n",
" data_path=task.data_path,\n",
" shapes=task.shapes,\n",
" uint8_features=task.uint8_features,\n",
" num_threads=1,\n",
" batch_size=batch_size,\n",
" num_shards=num_shards)\n",
"\n",
"def discard_extras(sample):\n",
" return sample._replace(data=sample.data[:5])\n",
"\n",
"dataset = dataset.map(discard_extras).batch(batch_size)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "adb0cyE5qu9G"
},
"source": [
"## D4PG learner"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {},
"colab_type": "code",
"id": "83naOY7a_A4I"
},
"outputs": [],
"source": [
"# Create the networks to optimize.\n",
"action_spec = environment_spec.actions\n",
"action_size = np.prod(action_spec.shape, dtype=int)\n",
"\n",
"policy_network = snt.Sequential([\n",
" tf2_utils.batch_concat,\n",
" networks.LayerNormMLP(layer_sizes=(300, 200, action_size)),\n",
" networks.TanhToSpec(spec=environment_spec.actions)])\n",
"\n",
"critic_network = snt.Sequential([\n",
" networks.CriticMultiplexer(\n",
" observation_network=tf2_utils.batch_concat,\n",
" action_network=tf.identity,\n",
" critic_network=networks.LayerNormMLP(\n",
" layer_sizes=(400, 300),\n",
" activate_final=True)),\n",
" # Value-head gives a 51-atomed delta distribution over state-action values.\n",
" networks.DiscreteValuedHead(vmin=-150., vmax=150., num_atoms=51)])\n",
"\n",
"# Create the target networks\n",
"target_policy_network = copy.deepcopy(policy_network)\n",
"target_critic_network = copy.deepcopy(critic_network)\n",
"\n",
"# Create variables.\n",
"tf2_utils.create_variables(network=policy_network,\n",
" input_spec=[environment_spec.observations])\n",
"tf2_utils.create_variables(network=critic_network,\n",
" input_spec=[environment_spec.observations,\n",
" environment_spec.actions])\n",
"tf2_utils.create_variables(network=target_policy_network,\n",
" input_spec=[environment_spec.observations])\n",
"tf2_utils.create_variables(network=target_critic_network,\n",
" input_spec=[environment_spec.observations,\n",
" environment_spec.actions])\n",
"\n",
"# The learner updates the parameters (and initializes them).\n",
"learner = d4pg.D4PGLearner(\n",
" policy_network=policy_network,\n",
" critic_network=critic_network,\n",
" target_policy_network=target_policy_network,\n",
" target_critic_network=target_critic_network,\n",
" dataset=dataset,\n",
" discount=0.99,\n",
" target_update_period=100)"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "PYkjKaduy_xj"
},
"source": [
"## Training loop"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 34
},
"colab_type": "code",
"executionInfo": {
"elapsed": 3493,
"status": "ok",
"timestamp": 1593622068277,
"user": {
"displayName": "",
"photoUrl": "",
"userId": ""
},
"user_tz": -60
},
"id": "HbQOyCG4zCwa",
"outputId": "cfb99d00-da2d-4ce8-e010-034a26e2ada0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Learner] Critic Loss = 3.919 | Policy Loss = 0.326 | Steps = 1 | Walltime = 0\n"
]
}
],
"source": [
"for _ in range(100):\n",
" learner.step()"
]
},
{
"cell_type": "markdown",
"metadata": {
"colab_type": "text",
"id": "LJ_XsuQSzFSV"
},
"source": [
"## Evaluation"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"colab": {
"height": 51
},
"colab_type": "code",
"executionInfo": {
"elapsed": 4197,
"status": "ok",
"timestamp": 1593620604870,
"user": {
"displayName": "",
"photoUrl": "",
"userId": ""
},
"user_tz": -60
},
"id": "blvNCANKb22J",
"outputId": "af5ae073-9847-45cc-e51e-a803fc2148b0"
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[Evaluation] Episode Length = 1000 | Episode Return = 129.717 | Episodes = 2 | Steps = 2000 | Steps Per Second = 1480.399\n",
"[Evaluation] Episode Length = 1000 | Episode Return = 34.790 | Episodes = 4 | Steps = 4000 | Steps Per Second = 1449.009\n"
]
}
],
"source": [
"# Create a logger.\n",
"logger = loggers.TerminalLogger(label='evaluation', time_delta=1.)\n",
"\n",
"# Create an environment loop.\n",
"loop = acme.EnvironmentLoop(\n",
" environment=environment,\n",
" actor=actors.FeedForwardActor(policy_network),\n",
" logger=logger)\n",
"\n",
"loop.run(5)"
]
}
],
"metadata": {
"colab": {
"collapsed_sections": [],
"last_runtime": {
"build_target": "",
"kind": "local"
},
"name": "RL Unplugged: Offline D4PG - DM control",
"provenance": [
{
"file_id": "1OerSIsTjv4d3rQCjAsi0ljPaLan87juJ",
"timestamp": 1593080049369
}
]
},
"kernelspec": {
"display_name": "Python 3",
"name": "python3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
+71
View File
@@ -0,0 +1,71 @@
# Lint as: python3
# Copyright 2020 DeepMind Technologies Limited.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""DM control suite and locomotion dataset examples.
Example:
Instructions:
> export TMP_PATH=/tmp/dataset
> export TASK_NAME=humanoid_run
> mkdir -p $TMP_PATH/$TASK_NAME
> gsutil cp gs://rl_unplugged/dm_control_suite/$TASK_NAME/train-00000-of-00100 \
$TMP_PATH/dm_control_suite/$TASK_NAME/train-00000-of-00001
> python dm_control_suite_example.py --path=$TMP_PATH \
--task_class=control_suite --task_name=$TASK_NAME
"""
from absl import app
from absl import flags
import tree
from rl_unplugged import dm_control_suite
flags.DEFINE_string('path', '/tmp/dataset', 'Path to dataset.')
flags.DEFINE_string('task_name', 'humanoid_run', 'Game.')
flags.DEFINE_enum('task_class', 'control_suite',
['humanoid', 'rodent', 'control_suite'],
'Task classes.')
FLAGS = flags.FLAGS
def main(_):
if FLAGS.task_class == 'control_suite':
task = dm_control_suite.ControlSuite(task_name=FLAGS.task_name)
elif FLAGS.task_class == 'humanoid':
task = dm_control_suite.CmuThirdParty(task_name=FLAGS.task_name)
elif FLAGS.task_class == 'rodent':
task = dm_control_suite.Rodent(task_name=FLAGS.task_name)
ds = dm_control_suite.dataset(root_path=FLAGS.path,
data_path=task.data_path,
shapes=task.shapes,
num_threads=1,
batch_size=2,
uint8_features=task.uint8_features,
num_shards=1,
shuffle_buffer_size=10)
for sample in ds.take(1):
print('Data spec')
print(tree.map_structure(lambda x: (x.dtype, x.shape), sample.data))
environment = task.environment
timestep = environment.reset()
print(tree.map_structure(lambda x: (x.dtype, x.shape), timestep.observation))
if __name__ == '__main__':
app.run(main)
+58
View File
@@ -0,0 +1,58 @@
absl-py==0.9.0
astunparse==1.6.3
atari-py==0.2.6
cachetools==4.1.1
certifi==2020.6.20
chardet==3.0.4
cloudpickle==1.3.0
decorator==4.4.2
dm-acme==0.1.7
dm-control==0.0.319497192
dm-env==1.2
dm-reverb-nightly==0.1.0.dev20200616
dm-sonnet==2.0.0
dm-tree==0.1.5
dopamine-rl==3.0.1
future==0.18.2
gast==0.3.3
gin-config==0.3.0
glfw==1.11.2
google-auth==1.18.0
google-auth-oauthlib==0.4.1
google-pasta==0.2.0
grpcio==1.30.0
gym==0.17.2
h5py==2.10.0
idna==2.10
Keras-Preprocessing==1.1.2
lxml==4.5.1
Markdown==3.2.2
numpy==1.19.0
oauthlib==3.1.0
opencv-python==4.3.0.36
opt-einsum==3.2.1
Pillow==7.2.0
pkg-resources==0.0.0
portpicker==1.3.1
protobuf==3.12.2
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyglet==1.5.0
PyOpenGL==3.1.5
pyparsing==2.4.7
requests==2.24.0
requests-oauthlib==1.3.0
rsa==4.6
scipy==1.4.1
six==1.15.0
tabulate==0.8.7
tb-nightly==2.3.0a20200706
tensorboard-plugin-wit==1.7.0
termcolor==1.1.0
tf-estimator-nightly==2.4.0.dev2020070701
tf-nightly==2.3.0.dev20200616
tfp-nightly==0.11.0.dev20200707
trfl==1.1.0
urllib3==1.25.9
Werkzeug==1.0.1
wrapt==1.12.1
+193
View File
@@ -0,0 +1,193 @@
# Lint as: python3
# Copyright 2020 DeepMind Technologies Limited.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Real World RL for RL Unplugged datasets.
Examples in the dataset represent SARS transitions stored when running a
partially online trained agent as described in https://arxiv.org/abs/1904.12901.
We release 8 datasets in total -- with no combined challenge and easy combined
challenge on the cartpole, walker, quadruped, and humanoid tasks. For details
on how the dataset was generated, please refer to the paper.
Every transition in the dataset is a tuple containing the following features:
* o_t: Observation at time t. Observations have been processed using the
canonical
* a_t: Action taken at time t.
* r_t: Reward at time t.
* d_t: Discount at time t.
* o_tp1: Observation at time t+1.
* a_tp1: Action taken at time t+1. This is set to equal to the last action
for the last timestep.
Note that this serves as an example. For optimal data loading speed, consider
separating out data preprocessing from the data loading loop during training,
e.g. saving the preprocessed data.
"""
import collections
import functools
import os
from typing import Any, Dict, Optional, Sequence
from acme import wrappers
import dm_env
import realworldrl_suite.environments as rwrl_envs
import reverb
import tensorflow as tf
import tree
DELIMITER = ':'
# Control suite tasks have 1000 timesteps per episode. One additional timestep
# accounts for the very first observation where no action has been taken yet.
DEFAULT_NUM_TIMESTEPS = 1001
def _decombine_key(k, delimiter = DELIMITER):
return k.split(delimiter)
def tf_example_to_feature_description(example,
num_timesteps=DEFAULT_NUM_TIMESTEPS):
"""Takes a string tensor encoding an tf example and returns its features."""
if not tf.executing_eagerly():
raise AssertionError(
'tf_example_to_reverb_sample() only works under eager mode.')
example = tf.train.Example.FromString(example.numpy())
ret = {}
for k, v in example.features.feature.items():
l = len(v.float_list.value)
if l % num_timesteps:
raise ValueError('Unexpected feature length %d. It should be divisible '
'by num_timesteps: %d' % (l, num_timesteps))
size = l // num_timesteps
ret[k] = tf.io.FixedLenFeature([num_timesteps, size], tf.float32)
return ret
def tree_deflatten_with_delimiter(
flat_dict, delimiter = DELIMITER):
"""De-flattens a dict to its originally nested structure.
Does the opposite of {combine_nested_keys(k) :v
for k, v in tree.flatten_with_path(nested_dicts)}
Example: {'a:b': 1} -> {'a': {'b': 1}}
Args:
flat_dict: the keys of which equals the `path` separated by `delimiter`.
delimiter: the delimiter that separates the keys of the nested dict.
Returns:
An un-flattened dict.
"""
root = collections.defaultdict(dict)
for delimited_key, v in flat_dict.items():
keys = _decombine_key(delimited_key, delimiter=delimiter)
node = root
for k in keys[:-1]:
node = node[k]
node[keys[-1]] = v
return dict(root)
def get_slice_of_nested(nested, start,
end):
return tree.map_structure(lambda item: item[start:end], nested)
def repeat_last_and_append_to_nested(nested):
return tree.map_structure(
lambda item: tf.concat((item, item[-1:]), axis=0), nested)
def tf_example_to_reverb_sample(example,
feature_description,
num_timesteps=DEFAULT_NUM_TIMESTEPS):
"""Converts the episode encoded as a tf example into SARSA reverb samples."""
example = tf.io.parse_single_example(example, feature_description)
kv = tree_deflatten_with_delimiter(example)
output = (
get_slice_of_nested(kv['observation'], 0, num_timesteps - 1),
get_slice_of_nested(kv['action'], 1, num_timesteps),
kv['reward'][1:num_timesteps],
# The two fields below aren't needed for learning,
# but are kept here to be compatible with acme learner format.
kv['discount'][1:num_timesteps],
get_slice_of_nested(kv['observation'], 1, num_timesteps),
repeat_last_and_append_to_nested(
get_slice_of_nested(kv['action'], 2, num_timesteps)))
ret = tf.data.Dataset.from_tensor_slices(output)
ret = ret.map(lambda *x: reverb.ReplaySample(info=b'None', data=x)) # pytype: disable=wrong-arg-types
return ret
def dataset(path,
combined_challenge,
domain,
task,
difficulty,
num_shards = 100,
shuffle_buffer_size = 100000):
"""TF dataset of RWRL SARSA tuples."""
path = os.path.join(
path,
f'combined_challenge_{combined_challenge}/{domain}/{task}/'
f'offline_rl_challenge_{difficulty}'
)
filenames = [
f'{path}/episodes.tfrecord-{i:05d}-of-{num_shards:05d}'
for i in range(num_shards)
]
file_ds = tf.data.Dataset.from_tensor_slices(filenames)
file_ds = file_ds.repeat().shuffle(num_shards)
tf_example_ds = file_ds.interleave(
tf.data.TFRecordDataset,
cycle_length=tf.data.experimental.AUTOTUNE,
block_length=5)
# Take one item to get the output types and shapes.
example_item = None
for example_item in tf.data.TFRecordDataset(filenames[:1]).take(1):
break
if example_item is None:
raise ValueError('Empty dataset')
feature_description = tf_example_to_feature_description(example_item)
reverb_ds = tf_example_ds.interleave(
functools.partial(
tf_example_to_reverb_sample, feature_description=feature_description),
num_parallel_calls=tf.data.experimental.AUTOTUNE,
deterministic=False)
reverb_ds = reverb_ds.prefetch(100)
reverb_ds = reverb_ds.shuffle(shuffle_buffer_size)
return reverb_ds
def environment(
combined_challenge,
domain,
task,
log_output = None,
environment_kwargs = None):
"""RWRL environment."""
env = rwrl_envs.load(
domain_name=domain,
task_name=task,
log_output=log_output,
environment_kwargs=environment_kwargs,
combined_challenge=combined_challenge)
return wrappers.SinglePrecisionWrapper(env)
File diff suppressed because it is too large Load Diff
+51
View File
@@ -0,0 +1,51 @@
# Lint as: python3
# pylint: disable=line-too-long
# Copyright 2020 DeepMind Technologies Limited.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
r"""RWRL dataset example.
Instructions:
> export TMP_PATH=/tmp/dataset/rwrl
> export DATA_PATH=combined_challenge_easy/quadruped/walk/offline_rl_challenge_easy
> mkdir -p $TMP_PATH/$DATA_PATH
> gsutil cp gs://rl_unplugged/rwrl/$DATA_PATH/episodes.tfrecord-00001-of-00015 \
$TMP_PATH/$DATA_PATH/episodes.tfrecord-00000-of-00001
> python rwrl_example.py --path=$TMP_PATH
"""
# pylint: enable=line-too-long
from absl import app
from absl import flags
import tree
from rl_unplugged import rwrl
flags.DEFINE_string('path', '/tmp/dataset', 'Path to dataset.')
def main(_):
ds = rwrl.dataset(
flags.FLAGS.path,
combined_challenge='easy',
domain='quadruped',
task='walk',
difficulty='easy',
num_shards=1,
shuffle_buffer_size=1)
for replay_sample in ds.take(1):
print(tree.map_structure(lambda x: (x.dtype, x.shape), replay_sample.data))
break
if __name__ == '__main__':
app.run(main)