Tensorflow 2.0 学习(chapter 4)

从切片创建数据集
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
for item in dataset:
    print(item)
dataset = dataset.repeat(3).batch(7) # 重复和批

使用 numpy 数组创建数据集, 也可以用 List, Dict 等其它数据结构.

数据集.interleave
dataset2 = dataset.interleave(lambda v: tf.data.Dataset.from_tensor_slices(v), # =map_fn
                              cycle_length=5, 
                              block_length=6)
for item in dataset2.batch(6):
    print(item.numpy())

interleave 对每个数据集的每个元素都映射为一个数据集

常见的情况是把文件名对应的文件内容读取出来.

参数: (注: 将数据集A中的每个元素a都映射为一个数据集B, 数据集B中包含多个元素b)

  1. map_fn 映射函数, 将一个元素(a)映射为一个数据集(B).
  2. cycle_length 并行程度, 每次同时从(A)取多少个元素(a). 默认值为CPU核心数.
  3. block_length 每次从一个数据集(B)中取多少个元素(b).

效果: 每次从 dataset 的每个 item 取一部分数据. 刚开始的时有规律地取, 最后不够了再填补

同时变量多组数据
# 一次遍历两个数据, x 和 y
x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset3 = tf.data.Dataset.from_tensor_slices((x, y))
for itemx, itemy in dataset3:
    print(itemx.numpy(), itemy.numpy())
    
# 一次遍历多个数据, 用 Dict
dataset4 = tf.data.Dataset.from_tensor_slices({"feature": x, "label":y })
for item in dataset4:
    print(item["feature"].numpy(), item["label"].numpy())
一些Numpy操作
  • numpy.arange([start, ]stop, [step, ]dtype=None)

    功能类似 start:stop:step

  • numpy.array_splitaryindexs_or_sectionsaxis = 0

    将一个数组拆分为多个子数组

  • numpy.c_()

    Translates slice objects to concatenation along the second axis.

    朝着列号增加的方向拼接矩阵

  • numpy.r_()

    Translates slice objects to concatenation along the first axis.

    朝着行号增加的方向拼接矩阵

数据集.list_files
file_name_dataset = tf.data.Dataset.list_files("train\\*")    # Dataset Adapter
for file_name in file_name_dataset: # 包含许多 Tensor, 每个 Tensor 都是文件名
    print(file_name)

返回符合指定glob特征的文件名列表. 函数声明如下

@staticmethod
list_files(file_pattern, shuffle=None, seed=None)

The file_pattern argument should be a small number of glob patterns.

file_pattern: A string, a list of strings, or a tf.Tensor of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.

NOTE: The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed or shuffle=False to get results in a deterministic order.

数据集.map

Maps map_func across the elements of this dataset.

对数据集中的所有元素按照map_func映射.

This transformation applies map_func to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input. map_func can be used to change both the values and the structure of a dataset’s elements. For example, adding 1 to each element, or projecting a subset of element components. (给每个元素+1或取子集)

TextLineDataset

A Dataset comprising(包括) lines from one or more text files.

# Preprocess 4 files concurrently, and interleave blocks of 16 records 
# from each file. 
filenames = ["/var/data/file1.txt", "/var/data/file2.txt", 
             "/var/data/file3.txt", "/var/data/file4.txt"] 
dataset = tf.data.Dataset.from_tensor_slices(filenames) 
def parse_fn(filename): 
  return tf.data.Dataset.range(10) 
dataset = dataset.interleave(lambda x: 
    tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1), 
    cycle_length=4, block_length=16)

上面的例子, 首先建立4个文件名, 然后将每个文件都映射为一个数据集, 并对数据集的每个元素(即每一行)进行操作.

解析CSV
sample_str = ",".join([repr(x) for x in range(1,6)]) # 1,2,3,4,5
record_defaults = [tf.constant(0, dtype=tf.int32), 0, np.nan, "hello", tf.constant([])]
# tf.int32 => 1,    int32
# 0        => 2,    int32
# nan      => 3.0,  float32, 自动推测
# "str"    => b"4", string
# 空数组    => 5.0,  float32, 自动推测
parsed_fields = tf.io.decode_csv(sample_str, record_defaults=record_defaults) # 解析CSV
pprint(parsed_fields)

解析一行CSV记录. Convert CSV records to tensors. Each column maps to one tensor.

tf.io.decode_csv(
    records, record_defaults, field_delim=',', use_quote_delim=True, na_value='',
    select_cols=None, name=None
)
  • records: string类型的Tensor. 每个string都应该是csv的一行.
  • records_defaults: A list of Tensor objects with specific types. Acceptable types are float32, float64, int32, int64, string. One tensor per column of the input record, with either a scalar default value for that column or an empty vector if the column is required. 输入记录的每列一个张量,具有该列的标量默认值,或者如果需要该列,则为空向量。
def parse_csv_line(line, n_fields=9):
    defs = [np.nan] * n_fields
    parsed_fields = tf.io.decode_csv(line, record_defaults=defs)
    x = tf.stack(parsed_fields[0:-1])   # 把一个个独立的张量变成一个 N 个元素的张量
    y = tf.stack(parsed_fields[-1])
    return x, y

def csv_reader_dataset(filenames, n_readers=5, batch_size=32, n_parse_threads=4, 
                       shuffle_buffer_size=10000):
    return tf.data.Dataset.list_files(filenames)\
        .repeat() \
        .interleave(
            lambda file_name: tf.data.TextLineDataset(file_name).skip(1),
            cycle_length=n_readers
        )\
        .shuffle(shuffle_buffer_size)\
        .map(map_func=parse_csv_line, num_parallel_calls=n_parse_threads) \
        .batch(batch_size)
tf.Example

A tf.Example is a {"string": tf.train.Feature} mapping.

Feature可以是以下实例:

  1. tf.train.BytesList (the following types can be coerced): string, byte
  2. tf.train.FloatList (the following types can be coerced): float (float32), double (float64)
  3. tf.train.Int64List (the following types can be coerced): bool, enum, int32, uint32, int64, uint64
favorite_books = [name.encode("utf-8") for name in ["maching learning", "cc150"]]
favorite_books_bytelist = tf.train.BytesList(value=favorite_books)
hours_floatlist = tf.train.FloatList(value=[15.5, 17, 12, 11, 8])
age_int64list = tf.train.Int64List(value=[42])
feature = {
        "favorite_books" : tf.train.Feature(bytes_list=favorite_books_bytelist),
        "hours": tf.train.Feature(float_list=hours_floatlist),
        "age": tf.train.Feature(int64_list=age_int64list)
}
features = tf.train.Features(feature=feature)
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))

看代码, 似乎是一个Example包含一个Features, 一个Features包含一个Dict<string, Feature>, 下面是一个Example的内容.

feature {
  key: "age"
  value {
    int64_list {
      value: 42
    }
  }
}
feature {
  key: "favorite_books"
  value {
    bytes_list {
      value: "maching learning"
      value: "cc150"
    }
  }
}
feature {
  key: "hours"
  value {
    float_list {
      value: 15.5
      value: 17.0
      value: 12.0
      value: 11.0
      value: 8.0
    }
  }
}
TFRecord的保存
serialized_example = example_proto.SerializeToString()
options = tf.io.TFRecordOptions(compression_type="GZIP") # 压缩选项(可选)
with tf.io.TFRecordWriter(file_name_full_path, options=options) as writer: # 可删options
    writer.write(serialized_example)
    writer.write(another_example) # 可以保存多个Example
TFRecordDataset
expected_features = {
    "favorite_books" : tf.io.VarLenFeature(dtype=tf.string), 
    "hours": tf.io.VarLenFeature(dtype=tf.float32), # 变长特征(Float)
    "age": tf.io.FixedLenFeature(shape=[], dtype=tf.int64) # 定长特征(Int)
}
dataset = tf.data.TFRecordDataset([file_name_full_path], , compression_type="GZIP")
for serialized_example_tensor in dataset: # serialized_example_tensor 是字符串 Tensor
    example = tf.io.parse_single_example(serialized_example_tensor, expected_features)
    books = tf.sparse.to_dense(example["favorite_books"], default_value=b"")
    books = [book.numpy().decode('utf-8') for book in books]

TFRecordDataset: A Dataset comprising records from one or more TFRecord files.

发布了80 篇原创文章 · 获赞 22 · 访问量 5万+

猜你喜欢

转载自blog.csdn.net/u010099177/article/details/104682747