文章目录
从切片创建数据集
dataset = tf.data.Dataset.from_tensor_slices(np.arange(10))
for item in dataset:
print(item)
dataset = dataset.repeat(3).batch(7) # 重复和批
使用 numpy 数组创建数据集, 也可以用 List, Dict 等其它数据结构.
数据集.interleave
dataset2 = dataset.interleave(lambda v: tf.data.Dataset.from_tensor_slices(v), # =map_fn
cycle_length=5,
block_length=6)
for item in dataset2.batch(6):
print(item.numpy())
interleave 对每个数据集的每个元素都映射为一个数据集
常见的情况是把文件名对应的文件内容读取出来.
参数: (注: 将数据集A中的每个元素a都映射为一个数据集B, 数据集B中包含多个元素b)
map_fn
映射函数, 将一个元素(a)映射为一个数据集(B).cycle_length
并行程度, 每次同时从(A)取多少个元素(a). 默认值为CPU核心数.block_length
每次从一个数据集(B)中取多少个元素(b).
效果: 每次从 dataset 的每个 item 取一部分数据. 刚开始的时有规律地取, 最后不够了再填补
同时变量多组数据
# 一次遍历两个数据, x 和 y
x = np.array([[1, 2], [3, 4], [5, 6]])
y = np.array(['cat', 'dog', 'fox'])
dataset3 = tf.data.Dataset.from_tensor_slices((x, y))
for itemx, itemy in dataset3:
print(itemx.numpy(), itemy.numpy())
# 一次遍历多个数据, 用 Dict
dataset4 = tf.data.Dataset.from_tensor_slices({"feature": x, "label":y })
for item in dataset4:
print(item["feature"].numpy(), item["label"].numpy())
一些Numpy操作
-
numpy
.arange
([start
, ]stop
, [step
, ]dtype=None
)功能类似 start:stop:step
-
numpy
.array_split
(ary
,indexs_or_sections
,axis = 0
)将一个数组拆分为多个子数组
-
numpy.c_()
Translates slice objects to concatenation along the second axis.
朝着列号增加的方向拼接矩阵
-
numpy.r_()
Translates slice objects to concatenation along the first axis.
朝着行号增加的方向拼接矩阵
数据集.list_files
file_name_dataset = tf.data.Dataset.list_files("train\\*") # Dataset Adapter
for file_name in file_name_dataset: # 包含许多 Tensor, 每个 Tensor 都是文件名
print(file_name)
返回符合指定glob特征的文件名列表. 函数声明如下
@staticmethod
list_files(file_pattern, shuffle=None, seed=None)
The file_pattern
argument should be a small number of glob patterns.
file_pattern
: A string, a list of strings, or a tf.Tensor
of string type (scalar or vector), representing the filename glob (i.e. shell wildcard) pattern(s) that will be matched.
NOTE: The default behavior of this method is to return filenames in a non-deterministic random shuffled order. Pass a seed
or shuffle=False
to get results in a deterministic order.
数据集.map
Maps map_func
across the elements of this dataset.
对数据集中的所有元素按照map_func映射.
This transformation applies map_func
to each element of this dataset, and returns a new dataset containing the transformed elements, in the same order as they appeared in the input. map_func
can be used to change both the values and the structure of a dataset’s elements. For example, adding 1 to each element, or projecting a subset of element components. (给每个元素+1或取子集)
TextLineDataset
A Dataset
comprising(包括) lines from one or more text files.
# Preprocess 4 files concurrently, and interleave blocks of 16 records
# from each file.
filenames = ["/var/data/file1.txt", "/var/data/file2.txt",
"/var/data/file3.txt", "/var/data/file4.txt"]
dataset = tf.data.Dataset.from_tensor_slices(filenames)
def parse_fn(filename):
return tf.data.Dataset.range(10)
dataset = dataset.interleave(lambda x:
tf.data.TextLineDataset(x).map(parse_fn, num_parallel_calls=1),
cycle_length=4, block_length=16)
上面的例子, 首先建立4个文件名, 然后将每个文件都映射为一个数据集, 并对数据集的每个元素(即每一行)进行操作.
解析CSV
sample_str = ",".join([repr(x) for x in range(1,6)]) # 1,2,3,4,5
record_defaults = [tf.constant(0, dtype=tf.int32), 0, np.nan, "hello", tf.constant([])]
# tf.int32 => 1, int32
# 0 => 2, int32
# nan => 3.0, float32, 自动推测
# "str" => b"4", string
# 空数组 => 5.0, float32, 自动推测
parsed_fields = tf.io.decode_csv(sample_str, record_defaults=record_defaults) # 解析CSV
pprint(parsed_fields)
解析一行CSV记录. Convert CSV records to tensors. Each column maps to one tensor.
tf.io.decode_csv(
records, record_defaults, field_delim=',', use_quote_delim=True, na_value='',
select_cols=None, name=None
)
- records: string类型的Tensor. 每个string都应该是csv的一行.
- records_defaults: A list of
Tensor
objects with specific types. Acceptable types arefloat32
,float64
,int32
,int64
,string
. One tensor per column of the input record, with either a scalar default value for that column or an empty vector if the column is required. 输入记录的每列一个张量,具有该列的标量默认值,或者如果需要该列,则为空向量。
def parse_csv_line(line, n_fields=9):
defs = [np.nan] * n_fields
parsed_fields = tf.io.decode_csv(line, record_defaults=defs)
x = tf.stack(parsed_fields[0:-1]) # 把一个个独立的张量变成一个 N 个元素的张量
y = tf.stack(parsed_fields[-1])
return x, y
def csv_reader_dataset(filenames, n_readers=5, batch_size=32, n_parse_threads=4,
shuffle_buffer_size=10000):
return tf.data.Dataset.list_files(filenames)\
.repeat() \
.interleave(
lambda file_name: tf.data.TextLineDataset(file_name).skip(1),
cycle_length=n_readers
)\
.shuffle(shuffle_buffer_size)\
.map(map_func=parse_csv_line, num_parallel_calls=n_parse_threads) \
.batch(batch_size)
tf.Example
A tf.Example
is a {"string": tf.train.Feature}
mapping.
Feature可以是以下实例:
tf.train.BytesList
(the following types can be coerced):string
,byte
tf.train.FloatList
(the following types can be coerced):float
(float32
),double
(float64
)tf.train.Int64List
(the following types can be coerced):bool
,enum
,int32
,uint32
,int64
,uint64
favorite_books = [name.encode("utf-8") for name in ["maching learning", "cc150"]]
favorite_books_bytelist = tf.train.BytesList(value=favorite_books)
hours_floatlist = tf.train.FloatList(value=[15.5, 17, 12, 11, 8])
age_int64list = tf.train.Int64List(value=[42])
feature = {
"favorite_books" : tf.train.Feature(bytes_list=favorite_books_bytelist),
"hours": tf.train.Feature(float_list=hours_floatlist),
"age": tf.train.Feature(int64_list=age_int64list)
}
features = tf.train.Features(feature=feature)
example_proto = tf.train.Example(features=tf.train.Features(feature=feature))
看代码, 似乎是一个Example包含一个Features, 一个Features包含一个Dict<string, Feature>
, 下面是一个Example的内容.
feature {
key: "age"
value {
int64_list {
value: 42
}
}
}
feature {
key: "favorite_books"
value {
bytes_list {
value: "maching learning"
value: "cc150"
}
}
}
feature {
key: "hours"
value {
float_list {
value: 15.5
value: 17.0
value: 12.0
value: 11.0
value: 8.0
}
}
}
TFRecord的保存
serialized_example = example_proto.SerializeToString()
options = tf.io.TFRecordOptions(compression_type="GZIP") # 压缩选项(可选)
with tf.io.TFRecordWriter(file_name_full_path, options=options) as writer: # 可删options
writer.write(serialized_example)
writer.write(another_example) # 可以保存多个Example
TFRecordDataset
expected_features = {
"favorite_books" : tf.io.VarLenFeature(dtype=tf.string),
"hours": tf.io.VarLenFeature(dtype=tf.float32), # 变长特征(Float)
"age": tf.io.FixedLenFeature(shape=[], dtype=tf.int64) # 定长特征(Int)
}
dataset = tf.data.TFRecordDataset([file_name_full_path], , compression_type="GZIP")
for serialized_example_tensor in dataset: # serialized_example_tensor 是字符串 Tensor
example = tf.io.parse_single_example(serialized_example_tensor, expected_features)
books = tf.sparse.to_dense(example["favorite_books"], default_value=b"")
books = [book.numpy().decode('utf-8') for book in books]
TFRecordDataset: A Dataset
comprising records from one or more TFRecord files.