paddle学习笔记之准备数据

准备数据方面有两种方式：

paddlepaddle Fluid支持两种传入数据的方式：

1：python Reader同步方式：用户使用fluid.layers.data配置数据输入层，并在fluid.Executor或fluid.ParallelExecutor中使用，使用executor.run(feed=...)传入训练数据。

2：py_reader接口异步方式：用户需要使用fluid.layers.py_reader配置数据输入层，然后使用py_reader的decorate_paddle_reader或者decorate_tensor_provider方法配置数据源，再通过fluid.layers.read_file读取数据

使用fluid.layer.data()配置神经网络中需要的数据层。具体代码如下：

# -*- coding=utf-8
import paddle.fluid as fluid
import numpy as np
#shape default has -1 demsion
image=fluid.layers.data(name="image",shape=[3,224,224])
label=fluid.layers.data(name="label",shape=[1],dtype="int64")

#use image and label as layer input
prediction=fluid.layers.fc(input="image",size=1000,act="softmax")
loss=fluid.layers.cross_entropy(input=prediction,label=label)
#now run the train data
# we can use gpu or cpu by fulid.CPUPlace()
exe=fluid.Executor(fluid.CPUPlace())
exe.run(feed={
    "image":np.random.random(size=(32,3,224,224)).astype("float34"),
    "label":np.random.random(size=(32,1)).astype('int64')
})

使用PyReader对象读取训练和测试数据代码如下：

py_reader=fluid.layers.py_reader(
    capacity=64,
    shapes=[(-1,3,64,64),(-1,1)],#(-1,1)respect the label demenion
    dtypes=['float32','int64'],
    name='py_reader',
    use_double_buffer=True

)
def network(is_train):
    reader=fluid.layers.py_reader(
        capacity=10,
        shapes=((-1,784),(-1,1)),
        dtypes=('float32','int64'),
        name="train_reader" if is_train else "test_reader",
        use_double_buffer=True


    )
    img,label=fluid.layers.read_file(reader)
    return  loss, reader
train_prog=fluid.Program()
train_startup=fluid.Program()# init parameters
with fluid.program_guard(train_prog,train_startup):
    with fluid.unique_name.guard():
        train_loss,trian_reader=network(True)
        adam=fluid.optimizer.Adam(learning_rate=0.01).minimize(train_loss)
test_prog=fluid.Program()
test_startup=fluid.Program()
with fluid.program_guard(test_prog,test_startup):
    with fluid.unique_name.guard():
        test_loss,test_reader=network(False)

#use pyReader to train and test the model
place=fluid.CUDAPlace(0)
startup_exe=fluid.Executor(place)
startup_exe.run(train_startup)
startup_exe.run(test_startup)

trainer=fluid.ParallelExecutor(
    use_cuda=True,loss_name=train_loss.name,main_program=train_prog
)

tester=fluid.ParallelExecutor(
    use_cuda=True, loss_name=test_loss.name, main_program=test_prog

)

train_reader.decorate_paddle_reader(
    paddle.v2.reader.shuffle(paddle.batch(mnist.train(),512),buf_size=8192)
)
test_reader.decorate_paddle_reader(
    paddle.batch(),512)
for epoch_id in xrange(10):
    train_reader.start()
    try:
        while True:
            print'trian_loss',np.array(trainer.run(fetch_list=[train_loss.name]))
    except fluid.core.EOFException:
        print('end of epoch',epoch_id)
        train_reader.reset()
    test_reader.start()
    try:
        while True:
            print('test loss',np.array(
                tester.run(fetch_list=[test_loss.name])
            ))
    except fluid.core.EOFException:
        print('end of testing',
              test_reader.reset())
#在epoch开始之前，调用start（）方法启动PyReader对象，在每个epoch结束时，read_file抛出fluid.core.EOFException异常，
# 在捕获异常后调用reset方法，重置pyreader对象状态以便重新启用

Paddle Fluid支持使用pyreader，实现python往C++端输入数据的功能，与使用np不同，使用pyreader时，python端导入数据的过程和C++端executor.run()读取数据的过程是异步进行的。

除以上两种外还有一种LoD-Tensor简介

LoD-Tensor是fluid特有的概念，它在Tensor基础上附加了序列信息，支持处理变长数据：

大部分训练数据是变长序列，需要统一长度，对于小于这一长度的序列数据使用0填充。

在Fluid中，由于LoD-Tensor的存在，我们不要求每个mini-batch中的序列数据必须保持长度一致，因此您不需要执行填充操作，也可以满足处理NLP等具有序列要求的任务需求。

在 Fluid 中 LoD-Tensor 的序列信息有两种表述形式：原始长度和偏移量。在 Paddle 内部采用偏移量的形式表述 LoD-Tensor，以获得更快的序列访问速度；在 python API中采用原始长度的形式表述 LoD-Tensor 方便用户理解和计算，并将原始长度称为：recursive_sequence_lengths 。

paddle学习笔记之准备数据

猜你喜欢