add explanation and requirements for running the UVFA approximation for the future tasks paper

PiperOrigin-RevId: 336880872
2026-05-09 21:07:49 +08:00 · 2020-10-13 16:29:27 +01:00
parent 0b9372d5e6
commit 2e48a73ee4
2 changed files with 118 additions and 94 deletions
@@ -2,6 +2,8 @@

 Side effects are unnecessary disruptions to the agent's environment while completing a task. Instead of trying to explicitly penalize all possible side effects, we give the agent a general penalty for impacting the environment, defined as a deviation from some baseline state. For example, a reversibility penalty measures unreachability (deviation) of the starting state (baseline). This code implements a tabular Q-learning agent with different impact penalties. Each penalty consists of a deviation measure (none, unreachability, relative reachability, or attainable utility), a baseline (starting state, inaction, or stepwise inaction), and some other design choices. This is the code for the paper [Penalizing side effects using stepwise relative reachability](https://arxiv.org/abs/1806.01186) by Krakovna et al (2019).

+In our latest paper "Avoiding Side Effects By Considering Future Tasks" by Krakovna et al (NeurIPS 2020), the agent receives an auxiliary reward for preserving the ability to perform future tasks. This approach is equivalent to relative reachability with an inaction baseline in deterministic environments. The UVFA approximation for the auxiliary reward is included as an option for the deviation measure.
+
 ## Instructions

 Clone the repository:
@@ -14,20 +16,28 @@ Run the agent with a given penalty on an AI Safety Gridworlds environment:

 `python -m side_effects_penalties.run_experiment -baseline <X> -dev_measure <Y> -env_name <Z> -suffix <S>`

-The following parameters can be specified for the side effects penalty:
+The following settings can be specified for the side effects penalty:
 * Baseline state (`-baseline`): starting state (`start`), inaction (`inaction`),
  stepwise inaction with rollouts (`stepwise`), stepwise inaction without
  rollouts (`step_noroll`)
 * Deviation measure (`-dev_measure`): none (`none`), unreachability (`reach`),
-  relative reachability (`rel_reach`), attainable utility (`att_util`)
-* Discount factor for the deviation measure value function (`-value_discount`)
+  relative reachability (`rel_reach`), attainable utility (`att_util`),
+  UVFA approximation of relative reachability (`uvfa_rel_reach`)
 * Summary function to apply to the relative reachability or attainable utility
  deviation measure (`-dev_fun`): max (0, x) (`truncation`) or |x| (`absolute`)
+* Discount factor for rewards (`discount`). We use `discount=0.95` for the UVFA
+  approximation of relative reachability.
+* Discount factor for the deviation measure value function (`-value_discount`).
+  Should be the same as `discount` unless using an undiscounted reachability
+  measure.
 * Weight for the side effects penalty relative to the reward (`-beta`)
+* Penalty for nonterminal states relative to terminal states (`-nonterminal'):
+  1 (`full`) is used in the stepwise relative reachability paper, while
+  (1-discount) (`disc`) is used in the future tasks paper.

-Other arguments:
-* AI Safety Gridworlds environment name (`-env_name`)
+Other settings include:
 * Number of episodes (`-num_episodes`)
+* AI Safety Gridworlds environment name (`-env_name`)
 * Filename suffix for saving result files (`-suffix`)

 ### Plotting the results
@@ -53,17 +63,20 @@ learning curve plot.

 * Python 2.7 or 3 (tested with Python 2.7.15 and 3.6.7)
 * [AI Safety Gridworlds](https://github.com/deepmind/ai-safety-gridworlds) suite
-of safety environments
+  of safety environments
 * [Abseil](https://github.com/abseil/abseil-py) Python common libraries
 * Numpy
+* Tensorflow 1
+* Sonnet
 * Pandas
 * Six
 * Matplotlib
 * Seaborn

+
 ## Citing this work

-If you use this code in your work, please cite the accompanying paper:
+If you use this code in your work, please cite one of the accompanying papers:

 `@article{srr2019,
  title = {Penalizing Side Effects using Stepwise Relative Reachability},
@@ -72,3 +85,10 @@ If you use this code in your work, please cite the accompanying paper:
  volume = {abs/1806.01186},
  year = {2019}
 }`
+
+`@inproceedings{ft2020,
+  title = {Avoiding Side Effects By Considering Future Tasks},
+  author = {Victoria Krakovna and Laurent Orseau and Richard Ngo and Miljan Martic and Shane Legg},
+  booktitle = {Neural Information Processing Systems},
+  year = {2020}
+}`
@@ -1,111 +1,115 @@
-absl-py==0.7.1
-activity-log-manager==0.8.0
-apt-xapian-index==0.49
+absl-py==0.10.0
+apparmor==2.13.4
 asn1crypto==0.24.0
-attrs==18.2.0
-Automat==0.6.0
-backports.functools-lru-cache==1.5
-bcrypt==3.1.6
-beautifulsoup4==4.6.3
+attrs==19.3.0
+bcrypt==3.1.7
+beautifulsoup4==4.9.1
 blinker==1.4
-ccsm==0.9.13.1
-certifi==2018.8.24
+Brlapi==0.7.0
+certifi==2020.4.5.1
 chardet==3.0.4
-Click==7.0
-colorama==0.3.7
-compizconfig-python==0.9.13.1
-configparser==3.5.0b2
-constantly==15.1.0
+chrome-gnome-shell==0.0.0
+cloudpickle==1.6.0
 CredentialKit==0.7
-cryptography==2.3
-cycler==0.10.0
-defer==1.0.6
-defusedxml==0.5.0
-dirspec==13.10
-duplicity==0.7.18.2
+credentialkit-client==1
+cryptography==2.8
+cupshelpers==1.0
+dbus-python==1.2.16
+decorator==4.4.2
+distro==1.5.0
+distro-info==0.23
+dm-tree==0.1.5
+duplicity==0.8.12.0
 entrypoints==0.3
-enum34==1.1.6
-fanotify==0.1
-fasteners==0.12.0
-fpconst==0.7.2
-future==0.16.0
-glinux-identity==1
-goobuntu-config-tools==0.1
-goobuntu-sso-watcher==0.1
-goobuntu-welcome==11
-googlenetworkaccess==0.1
-gpg==1.12.0
-gprof2dot==2017.9.19
-hg-evolve==9.2.0.dev0
+enum34==1.1.10
+evdev==1.3.0
+extras==1.0.0
+fasteners==0.14.1
+fixtures==3.0.0
+future==0.18.2
+gast==0.4.0
+gbulb==0.6.1
+gpg===1.13.1-unknown
+hg-evolve==10.1.0.dev0
 html5lib==1.0.1
-httplib2==0.11.3
-hyperlink==17.3.1
-idna==2.6
-incremental==16.10.1
-inotifyx==0.2.0
-ipaddress==1.0.17
-IPy==0.83
-kernel-pruner==47
-keyring==17.1.1
-keyrings.alt==3.1.1
-kiwisolver==1.1.0
+httplib2==0.18.1
+idna==2.9
+iniparse==0.4
+IPy==1.0
+jeepney==0.4.3
+kernel-pruner==56
+keyring==18.0.1
+keyrings.alt==3.4.0
+LibAppArmor==2.13.4
+linecache2==1.0.0
 lockfile==0.12.2
-lxml==4.3.2
-lz4==1.1.0+dfsg
-matplotlib==2.2.4
-mercurial==5.1.1+194.5ca351ba2478
-monotonic==1.0
-mox==0.5.3
-numpy==1.16.4
-oauthlib==2.1.0
+louis==3.14.0
+lxml==4.5.0
+lz4==3.0.2+dfsg
+mercurial==5.5.1+348.80bf7b1ada15
+monotonic==1.5
+mox3==1.0.0
+networkx==1.8.1
+numpy==1.18.4
+oauthlib==3.1.0
+obno==39
 olefile==0.46
+onboard==1.4.1
 PAM==0.4.2
-pandas==0.24.2
-paramiko==2.4.2
-parse==1.6.6
+pandas==1.1.3
+paramiko==2.6.0
+pbr==5.4.5
 pexpect==4.6.0
-Pillow==4.3.0
-protobuf==3.6.1
-psutil==5.5.1
+Pillow==7.2.0
+protobuf==3.11.4
+psutil==5.6.7
 pyasn1==0.4.2
 pyasn1-modules==0.2.1
 pycairo==1.16.2
 pycrypto==2.6.1
 pycups==1.9.73
 pycurl==7.43.0.2
-PyGObject==3.30.4
+Pygments==2.3.1
+PyGObject==3.36.0
 pyinotify==0.9.6
-PyJWT==1.7.0
-PyKCS11==1.2.4
-PyNaCl==1.3.0
-pyOpenSSL==19.0.0
-pyparsing==2.4.2
-pyserial==3.4
-pysmbc==1.0.15.6
-python-apt==1.8.4
+PyJWT==1.7.1
+PyKCS11==1.5.8
+PyNaCl==1.4.0
+pyOpenSSL==19.1.0
+pyparsing==2.4.7
+pysmbc==1.0.22
+python-apt==2.1.3
 python-augeas==0.5.0
-python-dateutil==2.8.0
-python-debian==0.1.34
+python-dateutil==2.8.1
+python-debian==0.1.37
+python-mimeparse==1.6.0
 python-networkmanager==2.1
-python2-pythondialog==3.3.0
-pytz==2019.2
+python-pam==1.8.4
+python-xapp==2.0.1
+python-xlib==0.27
+pytz==2020.1
 pyudev==0.21.0
+pyusb==1.0.2
 pyxattr==0.6.1
-pyxdg==0.25
-PyYAML==3.13
+pyxdg==0.26
+PyYAML==5.3.1
+reboot-enforcer==0.1
+reconfigure==0.1.81
 rekey==1
-reportlab==3.5.13
-requests==2.21.0
-scipy==1.2.2
+requests==2.23.0
 scour==0.37
-seaborn==0.9.0
-SecretStorage==2.3.1
-service-identity==16.0.0
-six==1.12.0
-SOAPpy==0.12.22
-subprocess32==3.5.4
-Twisted==18.9.0
-urllib3==1.24.1
+SecretStorage==3.1.2
+setproctitle==1.1.10
+six==1.15.0
+sonnet==0.1.6
+soupsieve==2.0.1
+tensorflow-probability==0.11.1
+testtools==2.3.0
+tinycss==0.4
+tinycss2==1.0.2
+traceback2==1.4.0
+ufw==0.36
+unittest2==1.1.0
+urllib3==1.25.9
 webencodings==0.5.1
-wstools==0.4.3
-zope.interface==4.3.2
+youtube-dl==2020.6.16.1