Policy Gradient in Reinforcement Learning


title: Policy Gradient in Reinforcement Learning
categories: Reinforcement Learning


Since it’s hard to display formula here, I reposted it here .

This article assumes you already have some knowledge about reinforcement learning and machine learning.

Direct Parameterize Policy

There are several branches of methods in reinforcement learning. Apart from Q-learning, where you approximate the Q-function of the state and action, you can direct parameterize the policy.

Given a vector of parameters $latex\vec{\theta}$ on policy, you have a policy $latex\pi(a |s,\vec{\theta}) $ which give you the probability of certain action $latexa$ under state $latexs$. Then you sample an action from that distribution, take this action, observe the next state and reward. After you run the same policy for a period, you evaluate your current policy with data collected and figure out which actions are responsible for better reward. Then you increase the probability of ‘good’ policy and decrease the ‘bad’ policy.

Parameterize the Probability of Actions under Given states

Obviously, the input to the policy $latex\pi(a|s,\vec{\theta})$ is the state we are currently in. Instead of using a policy that gives us a deterministic action, here we make the policy output a distribution on action. The reason is that by making the policy random we are actually exploring the world. Just like in Q-learning we use epsilon-greedy algorithm.

If the action space is discrete, where you have a fixed number of actions, the output of the $latex\pi(a|s,\vec{\theta})$ can be log probability of different action, like the output of neural network in classification problem. If the action space is continuous space, one way is to assume you action follows the normal distribution, where the mean of the normal distribution is calculted by your function. The variance of the normal distribution can be fixed or parameterized, too.

So, if you use a linear function to parameterize the policy, your code looks like this,

# discrete action
a_logit = state * W
# continuous action
a_mean = state * W

Of course, neural networks are good choices to fit the policy, too.

Measure ‘Goodness’ of Policy

As we hope that the ‘good’ actions happen more likely and ‘bad’ actions happen less likely we need to define a baseline for whether the action is good or not. A near optimal baseline is the expectation of discounted return we get just now using previous policy. This is exactly the value of the policy which we used to collect the data. Below is a simple implementation to calculate the discounted return along a path,

# discount factor
gamma = 0.99
for path in paths:
# calculate discounted return along a path
path_len = len(path['reward'])
reward = path['reward']
# our discounted return
return_t = []
# do backward
# first just add the discounted return in reverse order
# the discounted return of last step
return_t.append(reward[-1])
# remaining ones
for i in range(1, path_len):
return_t.append(return_t[i-1] * gamma + reward[path_len-i])
# reverse the return
return_t = return_t[:,:,-1]
path['return_t'] = return_t

Because we can only experience finite number of states or small number of states if the state space is very large, we can estimate the baseline by fit another linear function or neural networks.

Next, we can use the baseline to evaluate our current policy. First, run the current policy in the environment to collect data. Then calculate the discounted return for this path, and use baseline function to predict the value of each state in this path. Use the actual discount return to to subtract the predicted return and we know how good an action is compared to our old policy. The values we get is called advantage values. Code is

import numpy as np
# for every path
for path in paths:
adv_t = path['return_t'] - value_function.predict(np.vstack(path['state']))
path['adv_t'] = adv_t

# it is good to normalize the advantage value
# concatenate all the advantage value to a numpy array
adv_t_all = np.hstack([path['adv_t'] for path in paths])
# normalize advantage value. 1e-8 for stabilizing the division
adv_t_all = (adv_t_all - adv_t_all.mean()) / (adv_t_all.std() + 1e-8)

And then we use the advantage values to help update the policy. We also need to use the new data to retrain the baseline function and use it as a baseline for the updated policy.

Define Loss

Now, we have the advantage value to measures the actions. The loss for each action is the product between log-probability of actions taken and the advantage value.

surr_loss = - a_logprob * adv_t_all

Then we can compute the gradient of our policy by minimizing this loss.

If you wonder why we use the log-probability of action, I will derive it as last.

Combine them all

Now you have everything you need for updating your policy. In all, the step to run the policy gradient method in an environment is

  1. Define an initial policy and initial value function
  2. Run the polity in some environment to collect some data. The data contain every state you experienced, the respect reward. Each episode is called a path.
  3. Use your value function and discounted return to calculate the advantage value
  4. Calculate loss and update your policy
  5. Fit your value function to new paths
  6. Repeat 2-5 until your policy is good enough

Advanced Policy Update – Constrain Policy Update with KL Divergence

Because we use the value function trained on the old policy to predict the value of the new policy, we are introducing some bias, to ensure that our policy does not change too much. We can constrain the step size of policy update. When we find we made a big change to the policy, the decrease the step size. If the change is small, we increase the step size. A measurement of this change is KL divergence between our old policy and our updated policy. The formula to compute the KL divergence between two probability distribution is

$KL(p_1(x), p_2(x)) = \int{p_1(x)\log p_1(x)} - \int{p_1(x)\log p_2(x)}$

For discrete action, the KL divergence is calculated as

kl = a_prob_old * (a_logprob_old / a_logprob)
# take the mean of KL of all action
Kl = kl.mean()

For action sampled from the normal distribution, the solution is a little bit complicated. After some derivation, we code like this

kl = np.log(a_std / a_std_old) + (a_std_old ** 2 + (a_mean_old - a_mean) ** 2) / (2 * a_std) - 0.5

Now, we can adaptively change the step size to update our policy.

# adaptively change step size of update, e.g. learning rate
if kl > desired_kl * 2:
stepsize /= 1.5
print('stepsize -> %s'%stepsize)
elif kl < desired_kl / 2: stepsize *= 1.5 print('stepsize -> %s'%stepsize)
else:
print('stepsize OK')

Derivation of Loss

Suppose you have a path $latex\tau$, the probability of which occurs under a certain policy is $latex\pi({\tau|\theta})$, the discounted return for this path is $latexR(\tau)$. The expected return of this policy is $latexE_{\tau \sim \pi(\theta)}[R(\tau)])$. Now, we want to improve this policy, e.g. improve the expectation. Let take the gradient of the expectation, respect to $latex\theta$,
$\begin{aligned} \bigtriangledown _\theta{E_{\tau \sim \pi(\theta)}[R(\tau)]} & = \bigtriangledown _\theta \int{\pi(\tau, \theta)R(\tau)d\tau} \\ & = \int{\bigtriangledown _\theta \pi(\tau, \theta)R(\tau)} d\tau \\ & = \int{\pi(\tau, \theta) \frac{\bigtriangledown _\theta \pi (\tau, \theta)}{\pi (\tau, \theta)}R(\tau)} d\tau\\ & = \int{\pi (\tau, \theta) [\bigtriangledown _\theta \log( \pi( \tau, \theta)) R(\tau)}] d\tau \\ & = E_{\tau \sim \pi (\theta)}[\bigtriangledown \log( \pi (\tau, \theta))R( \tau)] \end{aligned} $
The first and last step is exacly the definition of epectation. We do the 4th step because
$\bigtriangledown _\theta \log (\pi (\tau, \theta)) = \frac {1}{\pi (\tau, \theta)} \bigtriangledown _\theta \pi (\tau, \theta) $
This fomula tell us that to adjust our policy, we can adjust the log-probability of the path times something. Subtract it with our baseline function $latexB( \tau) $, we have
$\begin{aligned} \bigtriangledown _\theta E [R(\tau) - B(\tau)] & = E [ \bigtriangledown _\theta \log ( \pi ( \tau, \theta)) (R(\tau) - B(\tau))] \\ & = E [ \bigtriangledown _\theta \log( \pi( \tau, \theta))A(\tau)] \end{aligned} $
This is the loss we are actually using.

A small review on image retrieve using deep features

With the rising of deep learning, a series of methods were brought up to do Content-Based-Image-Retrieve (CBIR). In my capstone project of Machine Learning Ninodegree in Udacity, I decided to explore some methods.

Generally, I had read several ways to search for visually similar images. One way in [1], [2], [3] is to use the deep learning model trained on classification datasets. Another way is to train a model on a special dataset which label some of images as similar images. More complex methods may involve new network structures or new loss functions. Here, I use the first method because it is simple and easy to understand.

This method is very simple and performs acceptably. You can just use some model trained on imagenet directly. For example, [1], [2], [3] used such method. After you attain an imagenet model, you use the model to extract features from the images. Then you calculate the distance between images to find the similar ones. These features are usually generated from the intermediate layer of CNN model, such as output of last or second from last convolutional layer after or before pooling layer or first or second fully connected layer of VGG16 network. Usually, the dimension of these features is still very large, so some aggregating method is used.

One way is to use PCA directly. It is used in [1]. Another way is to create a scheme to summarize the features and then do PCA on the features. Several schemes were brought up and I find the scheme in [2] is simple yet work very well. In the simplest version, they used the output of last pooling layer of VGG network and summed up the activation by spatial location to get a 512 dimensional vector for each image and performed PCA and whitening on it. Before and after PCA transform the features were normalized and it was shown that normalization was very important when distance metric was the Euclidean distance. Possible reason is the varied luminance condition in difference images.

Last, the paper [1] also shows that by retraining the classification model on a relevant data the performance can be improved.

Reference:

[1] Babenko, Artem, et al. “Neural codes for image retrieval.” European conference on computer vision. Springer International Publishing, 2014.

[2] Kalantidis, Yannis, Clayton Mellina, and Simon Osindero. “Cross-dimensional weighting for aggregated deep convolutional features.” European Conference on Computer Vision. Springer International Publishing, 2016.

[3] Babenko, Artem, and Victor Lempitsky. “Aggregating local deep features for image retrieval.” Proceedings of the IEEE international conference on computer vision. 2015.

[4] Wan, Ji, et al. “Deep learning for content-based image retrieval: A comprehensive study.” Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014.

What have I learnt in doing my dissertation?

While most of my classmates spent around 3 to 4 mouths on their master dissertations, I almost spend half a year on it. After I did my dissertation, I feel that I have to write something about what I have learnt from it. In my opinion, these experiences can be applied to different research areas, too.

1. Start by repeating other’s results

The first thing I have learnt is that I should first reproduce the original results reported in the paper before I start my research. Only by reproducing the result can you make sure that you truly understand the paper. By just reading the papers many times you can only give yourself an illusion that you master the method. I once read one paper for more than ten times and even implement an algorithm according its formula, I still found I did not understand it fully.  As a result, I spent weeks debugging on my wrong algorithm and thought after fixing these bugs, my code could definitely work. However, I never got this code to work. It was not until I  read the paper again some days later that I found my code did not follow what the paper said exactly. People may think it is a waste of time to repeat the work that had been done by others. But in practice it is the only way to test whether you truly understand the paper. And in my experience, you will find that after you are able to repeat the previous results, you only need a little time to adapt it to new problems.

Another reason for reproducing the original results is that by reproducing it you will be more confident when you apply it. If you directly go to your problem and everything works fine, then you are very lucky. However, in most cases you could be stuck at some point. And then you suffering begins. You start to doubt a lot of things. Can this method work here? Have I done something wrong? Have I left out any important details? You will feel everything is in a mess. However, if you have reproduced the results before, you can be very confident where the true problem is.

Moreover, the papers make mistakes, too. They might fail to convey the methods clearly. They may use some vague description. However, by reproducing the results, you can verify whether you catch the idea correctly. The weakness of the method can also be revealed by researching into it by yourself.

2. Develop your code step by step and test them step by step

This suggestion is more about writing code. Sometimes you have to write very long and complicated code, even for academic research. Debugging on complicated program is another painful thing apart from debugging on the code and problems which you are not familiar with. So, now, every time I realize that the program is going to be a little bit complicated, I will try to make the small part of the code that I am going to add contains no bug and after I add the code my program can still work properly. It is easy to break the rule and go too far without testing your program because people are usually too confident. They always think their code is perfect. However, it is not true. Sometimes I also cannot resist the temptation to finish my code right now and do not inspect it step by step. And I always immediately begin to suffer the consequences once I run the code. There are always some problems such as that I run into tough bugs or strange results once I launch it. And, since I add so many new components to the code I really cannot remember what might cause the problem, especially when there is no bug in the program but the results are just not correct. You have to go back to the beginning of the code and check it step by step, print out the results and investigate whether it is reasonable. To avoid these reworks, it is best to check the result immediately after you finish a small segment of code.

3. When you are about to run a program that need a long time

There are several things worth attention before you are going to launch a long-running program. First, you should make sure that your long job is bug-free before you run the program formally and go to have a rest. Even though computers can work 24 hours a day, it cannot work on program containing bugs. So, it is import to check the results of first several loops to make sure that they are doing what you want. Besides,  it is essential to find some ways to check that your program will stop if the terminating conditions are satisfied. The problems and bugs can often be avoided by just checking the initial loops and last loops.

Second, if you are dealing with matrix or numerical problem, vectorize your code. It is easy to use for-loop to do everything, but for-loop is slow. You need to spend some efforts to make your program run faster. By vectorizing your code you will gain huge speedup . It is worth the effort. In Matlab or Numpy (if you use python), it is very easy.

If you have access to computers with a lot of processors or even a computer cluster, do not waste their computational ability by just running your program in a single thread. In most cases, it is very easy to split your jobs into several independent parts and combine the results together at last. In Matlab you can use batch processing. In Python you can use multiprocessing. It is not so hard as it sounds to be to make a program run in parallel. And just by using batch processing in a lot of cases you can have almost linear speedup on a multiprocessor system. But before making your code run in parallel, again, test each part before combining them. Because debugging a batch program is not so intuitive as debugging a regular program.

4. Gradually increase the scale of the problem

When I was doing the dissertation, another thing that wasted me a lot of time was that I chose a very, very large dataset with ten millions of nodes at the beginning and used it to test the algorithm which I just ran on a dataset of hundreds of nodes. The dataset was so large that even after compressing,  it still took tens of GBs of hard disk. Even though I had access to a super powerful computer with 128GB memory, it still took a long time to make the data format compatible to my algorithm. Only loading the dataset into the memory would take tens of minutes and after loading it a quarter of the memory was already occupied. At last, I spent several weeks to preprocess my data, but made no progress. After my supervisor got to know this, he commented ‘it was impossible!’ and I immediately realized that I was not on the correct way. I switched to a small dataset and begin to make progress. Later, after I studied Machine Learning, I got to know that if I was going to work on a large dataset, I should pick up a small subset of the data and test my idea first. If everything works fine, then I can switch to a larger dataset.

5. Check if the author of the paper has the source code

It was not until I almost finished my version of the code for the method which my studies were based on that I accidently found that the author had already shared his code online. I hadn’t needed to write the code myself. Even though writing the code myself made me very familiar with every detail of the paper, it was not so necessary in every case. Therefore, ever since, I decide I should search online or contact the author for the code first when I am about to investigate another algorithm. And usually they will be willing to share their code with you. Though they are not able to provide code in the programming language I need every time, the code in other programming language they provide makes it easy for me to rewrite it in another language compared with writing the code after reading the papers and formulas.

Wrap-up

To wrap up, the most import things I have learnt is that people are always too confident when doing new things. The correct way to do new things is always starting with something small and not new. Then gradually add new components to it and make sure it works properly every time you add new things. When you are going to conduct researches, it is good to first try to reproduce the original results to make sure you fully understand the papers and then move forward. So is dealing with large datasets, use small dataset to get fast feedback at first, and then shift to the big ones if everything works fine.

How to use AWS EC2 spot instance to do Deep Learning?

If you need

While using EC2 to do development, the standard assumption is that after setting up the developing environment, you will not change it. However, if you just start learning deep learning, how do you know what package you need to install. When the format of input data changes, you may need to add new python packages.

The way to solve this problem is to create an EBS volume to store the AMI and new package that you install and use it as a root device to boot your instance every time. EC2 do not support

Create the EBS Volume You Work on (Skip if you already have)

First, you start a spot instance from any AMI you want. But remember, do not choose to delete the volume after terminating the instance. This volume will be the true root volume in the future. After configuring your environment or manipulating the data in the hard way, you can terminate the instance. The data and packages are stored in the volume.

Create a Special AMI for Initializing the Spot Instancess

Boot another instance from the AMI you used just now. This instance is used to create another AMI that has lower boot priority than previous EBS volume. After connecting to the instance, type in

$sudo e2label /dev/xvda1

This command will show the label that decides whether you will boot from this disk. Usually, you will get

$ sudo e2label /dev/xvda1 / 

or

$ sudo e2label /dev/xvda1 clouldrootfs

Using the following command to change the label to something else, such as ‘old/’

$ sudo e2label /dev/xvdf1 old/ 

To check the change, use

$ sudo e2label /dev/xvdf1 old/ 

Last, take a snapshot of this volume and save it as a new AMI. Here, we call it ‘fake-AMI’.

Attach the EBS volume and Boot into Correct Volume

The next time you start a spot instance, first check the availability zone of the EBS volume where your environment and data is stored. Choose to start the instance in the same availability zone as your EBS volume as you can only attach the EBS volume to the instance if they are in the same availability zone.

Use the ‘fake-AMI’ we created just now to boot into the instance. As no other volume is available in the instance it can only boot into this AMI. Attach the above EBS volume to your instance and reboot it, you will find the instance boot into this EBS volume and you have all your environment and package restored!

Is EC2 instance really fast?

When I began to learning deep learning, I was told that if I did not have super powerful GPU myself, I could use amazon AWS EC2 g2 and p2 instance. After doing experiments on EC2, I was quite doubtful about its speed, so I did some test on it and compared to my laptop which uses GTX 860M.

I test the performance on deep learning framework MXNet using the neural-style transfer transfer example. The code can be found at https://github.com/dmlc/mxnet/tree/master/example/neural-style

I use command


time python nstyle.py

to test the run time and get the following results:

Platform:  MXNet + ubuntu 14.04 + CUDA 8.0 + cuDNN 5.1

The EC2 instances are on-demand instances on shared hardware, since they are a lot cheaper than dedicated one.

my laptop(GTX860m)           6m 58s

g2.xlarge(K520)                       11m 17s

p2.xlarge(K80)                         8m 48s

More detailed edition:

real time        user time        system time

my laptop(GTX860m)       6m 58s           7m 22s             1m 12s

g2.xlarge(K520)                  11m 17s           10m 58s            5m 25s

p2.xlarge(K80)                    8m 48s              8m 26s            3m 5s

It seems that system time for EC2 instance is significantly large. What does this mean? But there is still some situation you may find the g2.xlarge instance is better because

But there is still some situation when you may find the g2.xlarge instance is better. The biggest advantage of this EC2 instance is its very large memory size. The memory is 61 GB and GPU memory is 24GB, which mean that you can load almost everything you like into the memory without any worry. If you are using your laptop, you may need to consider the size of your memory while loading data as deep learning datasets are usually very large!

Clustering Properties

There does exist an algorithm has all below properties.

Richness

For any assignment of objects to clusters, there is some distance matrix such that the clustering scheme return that clustering. Or there is no assignment that cannot be produced by that clustering scheme.

Scale-invariance

Scaling distance by a positive value does not change the clustering.

Consistency

Shrinking intracluster distances and expanding intercluster distance does not change the clustering.

Recall and Precision

Image we keep showing A to our algorithm. Recall is the correct rates in this situation.

Sometime we predict something which is not A as A. The frequency of correct prediction among all predictions whose result is A is called precision

Mode, Mean and Medium

Mode describe the most frequent value/bin, it is not stable. That means mode vary a lot when the sample number is samll.

Mean is more stable than mode but is easily affacted by extrame value( outliers).

Median is most robust.

Maximum Independent Set

Assume you want to buy a book. If you buy it, you friends can borrow it from you, which means they do not have to buy it but can enjoy the effect as if as someone of their neighbor in the network buy it. Or vice versa.

In an abstract manner, you buy a book, you get payoff of

(1-c)

If you friends buy the book, you get pay off of

1

If you and your friends all do not buy the book, anyone’s payoff is

0

you will have a payoff graph like this

201610201501

Noted that the nodes denoted by 0 or 1 consititute two maximum independent set, which means the largest possible independent sets for the graph. Independent set means that no node in this set are adjacent.