lijiwei3将RL用于dialogue

Qlearning和policygradient学习

Qlearning:

Initialize Q arbitrarily //随机初始化Q值
Repeat (for each episode): //每一次游戏,从小鸟出生到死亡是一个episode
    Initialize S //小鸟刚开始飞,S为初始位置的状态
    Repeat (for each step of episode):
        根据当前Q和位置S,使用一种策略,得到动作A //这个策略可以是ε-greedy等
        做了动作A,小鸟到达新的位置S',并获得奖励R //奖励可以是1,50或者-1000
        Q(S,A) ← (1-α)*Q(S,A) + α*[R + γ*maxQ(S',a)] //在Q中更新S
        S ← S'
    until S is terminal //即到小鸟死亡为止

这个更直白
在这里插入图片描述
在这里插入图片描述
Q是最大未来奖励,从而作为最佳决策矩阵。(根据行状态对应的最大值,选择行为A的最大值。)
S是状态,A是S状态对应的行为。R是当前选择A之后产生的reward。
alpha是学习速率,表示决策更新的程度。
y是一个小于1的参数,代表未来状态的最佳选择(那时候Q最大)对现在的影响。

Q的理解:每一步有自己的reward,然而Q也是S和A都有关系。为获得更多的奖励,我们往往不能只看当前奖励,更要看将来的奖励。从当前时间 t 开始,总的将来的奖励为:在这里插入图片描述
y的意义:Environment 往往是随机的,执行特定的动作不一定得到特定的状态。 γ 等于1,意味着环境是确定的,相同的动作总会获得相同的奖励。

所以说r+y*max(Q(t+1))是新的Q,减去原来的Q,表示更新的差值。

https://blog.csdn.net/itplus/article/details/9361915

介绍policy gradient:

可以在连续的空间选择action.

选出行为->反向传递->奖惩信息决定反向传递的多少。

状态和策略都是有分布的。

好!接下来看lijiwei3的代码

想想要实现哪些功能。

def train():
    with tf.Session() as sess:
        st_model = create_st_model(sess, gst_config, True, gst_config.name_model)
        bk_model = create_st_model(sess, gbk_config, True, gbk_config.name_model)
        cc_model = create_st_model(sess, gcc_config, True, gcc_config.name_model)
        rl_model = create_rl_model(sess, grl_config, False, grl_config.name_model)
   
    while True:
            # Get a batch and make a step.
            start_time = time.time()
            encoder_inputs, decoder_inputs, target_weights, batch_source_encoder, _ = \
                rl_model.get_batch(train_set,bucket_id)

            updata, norm, step_loss = rl_model.step_rl(sess, st_model=st_model, bk_model=bk_model, encoder_inputs=encoder_inputs,
                                               decoder_inputs=decoder_inputs, target_weights=target_weights,
                                               batch_source_encoder=batch_source_encoder, bucket_id=bucket_id)

            step_time += (time.time() - start_time) / grl_config.steps_per_checkpoint
            loss += step_loss / grl_config.steps_per_checkpoint
            current_step += 1
            ```
            ```
    2.
def create_st_model(session, st_config, forward_only, name_scope):
    with tf.variable_scope(name_or_scope=name_scope):
        st_model = gst_rnn_model.gst_model(gst_config=st_config, name_scope=name_scope, forward_only=forward_only)
        ckpt = tf.train.get_checkpoint_state(os.path.join(st_config.train_dir, "checkpoints"))
        if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
            print("Read %s model from %s" % (name_scope, ckpt.model_checkpoint_path))
            st_model.saver.restore(session, ckpt.model_checkpoint_path)
        else:
            print("Creating %s model with fresh parameters" % name_scope)
            global_variables = [gv for gv in tf.global_variables() if name_scope in gv.name]
            session.run(tf.variables_initializer(global_variables))
            print("Created %s model with fresh parameters" % name_scope)
        return st_model


 3.
def create_rl_model(session, rl_config, forward_only, name_scope):
    with tf.variable_scope(name_or_scope=name_scope):
        rl_model = grl_rnn_model.grl_model(grl_config=rl_config, name_scope=name_scope, forward=forward_only)
        ckpt = tf.train.get_checkpoint_state(os.path.join(rl_config.train_dir, "checkpoints"))
        if ckpt and tf.train.checkpoint_exists(ckpt.model_checkpoint_path):
            print("Read %s model from %s" % (name_scope, ckpt.model_checkpoint_path))
            rl_model.saver.restore(session, ckpt.model_checkpoint_path)
        else:
            print("Creating %s model with fresh parameters" % name_scope)
            global_variables = [gv for gv in tf.global_variables() if name_scope in gv.name]
            session.run(tf.variables_initializer(global_variables))
            print("Created %s model with fresh parameters" % name_scope)
        return rl_model

create_st_model和create_rl_model实现功能:

就是字面意思的create了相关的model,所以重要的只有两个model,gst_model和grl_model.

grl_model又从grl_seq2seq来,不是很明白为什么要改。

def _extract_argmax_and_embed(embedding, output_projection=None, update_embedding=True):
def loop_function(prev, _):
def rnn_decoder(decoder_inputs, initial_state, cell, loop_function=None,
scope=None):

def beam_rnn_decoder
def embedding_rnn_decoder
def embedding_rnn_seq2seq
def attention_decoder
def embedding_attention_decoder
def embedding_attention_seq2seq
def decoder
def sequence_loss_by_example
def sequence_loss
def model_with_buckets
def decode_model_with_buckets

12.11

所以rl用于seq2seq&&传统的seq2seq没有太大区别,不过是在更新的时候theta的梯度变化了。是为了max(R)。

猜你喜欢

转载自blog.csdn.net/yagreenhand/article/details/86503372
RL