This is the code for the paper Penalizing side effects using stepwise relative reachability by Krakovna et al (2019). It implements a tabular Q-learning agent with different penalties for side effects. Each side effects penalty consists of a deviation measure (none, unreachability, relative reachability, or attainable utility) and a baseline (starting state, inaction, or stepwise inaction).

Instructions

Clone the repository:

git clone https://github.com/deepmind/deepmind-research/side_effects_penalties.git

Running an agent with a side effects penalty

Run the agent with a given penalty on an AI Safety Gridworlds environment:

python -m side_effects_penalties.run_experiment -baseline <X> -dev_measure <Y> -env_name <Z> -suffix <S>

The following parameters can be specified for the side effects penalty:

Baseline state (-baseline): starting state (start), inaction (inaction), stepwise inaction with rollouts (stepwise), stepwise inaction without rollouts (step_noroll)
Deviation measure (-dev_measure): none (none), unreachability (reach), relative reachability (rel_reach), attainable utility (att_util)
Discount factor for the deviation measure value function (-value_discount)
Summary function to apply to the relative reachability or attainable utility deviation measure (-dev_fun): max (0, x) (truncation) or |x| (absolute)
Weight for the side effects penalty relative to the reward (-beta)

Other arguments:

AI Safety Gridworlds environment name (-env_name)
Number of episodes (-num_episodes)
Filename suffix for saving result files (-suffix)

Plotting the results

Make a summary data frame from the result files generated by run_experiment:

python -m side_effects_penalties.results_summary -compare_penalties -input_suffix <S>

Arguments:

-bar_plot: make a data frame for a bar plot (True) or learning curve plot (False)
-compare_penalties: compare different penalties using the best beta value for each penalty (True), or compare different beta values for a given penalty (False)
If compare_penalties=False, specify the penalty parameters (-dev_measure, -dev_fun and -value_discount)
Environment name (-env_name)
Filename suffix for loading result files (-input_suffix)
Filename suffix for the summary data frame (-output_suffix)

Import the summary data frame into plot_results.ipynb and make a bar plot or learning curve plot.

Dependencies

Python 2.7 or 3 (tested with Python 2.7.15 and 3.6.7)
AI Safety Gridworlds suite of safety environments
Abseil Python common libraries
Numpy
Pandas
Six
Matplotlib
Seaborn

Citing this work

If you use this code in your work, please cite the accompanying paper:

@article{srr2019, title = {Penalizing Side Effects using Stepwise Relative Reachability}, author = {Victoria Krakovna and Laurent Orseau and Ramana Kumar and Miljan Martic and Shane Legg}, journal = {CoRR}, volume = {abs/1806.01186}, year = {2019} }