(Guide) Data input pipeline : tf.data - part03 (Reading input data)

2020. 5. 24. 16:20

Reading input data

Consuming Numpy arrays (Numpy 배열로부터 Dataset 만들기)

입력 데이터가 메모리에 있고 이를 Dataset으로 만드는 가장 쉬운 방법은 tf.Tensor 객체변환하고 Dataset.from_tensor_slices()를 사용하는 것이다.

train, test = tf.keras.datasets.fashion_mnist.load_data()

>>> 위 데이터셋이 numpy array로 다운로드 된다.

Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-labels-idx1-ubyte.gz
32768/29515 [=================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/train-images-idx3-ubyte.gz
26427392/26421880 [==============================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-labels-idx1-ubyte.gz
8192/5148 [===============================================] - 0s 0us/step
Downloading data from https://storage.googleapis.com/tensorflow/tf-keras-datasets/t10k-images-idx3-ubyte.gz
4423680/4422102 [==============================] - 0s 0us/step

images, labels = train
images = images/255

dataset = tf.data.Dataset.from_tensor_slices((images, labels))
dataset

note : 위 예제에서의 코드는 tf.constant()연산과 같이 Tensorflow.graph에 feature와 label가 포함될 것이다. 이는 작은 데이터셋의 경우에는 잘 동작하지만, 메모리를 낭비하게 된다. 이는 constant 배열은 실행중 여러번 복사되기 때문이다. 또한 tf.GraphDef protocal buffer도 2GB의 제약이 있다.

Consuming Python generators (Python 제너레이터로부터 Dataset 만들기)

tf.data.Dataset을 만드는 또다른 방법 중 하나는 python generator를 data source로 사용하는 것이다.

주의 : 이 방법은 편리하지만, 이식성(Portability)와 확장성(Sacalibility)이 제한된다. 이는 generator를 생성한 python process에서 실행되어야 한다. (Python GIL의 적용을 받음)

def count(stop):
i = 0
while i<stop:
yield i
i += 1

for n in count(5):
print(n)

0
1
2
3
4

Dataset.from_generator 생성자는 python generator를 tf.data.Dataset으로 변환한다.

@staticmethod
from_generator(
generator, output_types, output_shapes=None, args=None
)

Args

generator iter() protocol을 지원하는 객체를 반환하는 callable 객체. 만약 인자가 지정되지 않으면, generator는 인자가 없어야 한다. 반대로 인자에 값이 있는 것 만큼 많은 인자가 있어야 한다.

output_types Generator에 의해 생성된 인자의 각 component와 일치하는 tf.DType 객체의 nested structure.

output_shapes (Optional) Generator에 의해 생성된 인자의 각 component와 일치하는 tf.TensorShape 객체의 nested structure.

args (Optional.) Numpy-array 인자로써 평가되고 generator에 전달되어 질 tf.Tensor 객체 tuple.

Returns

Dataset A Dataset.

생성자는 iterator가 아닌 입력으로써 callable(위에서 generator 인자)을 갖는다. 이는 generator가 끝에 도착했을때, 재시작 할 수 있게 한다. 생성자는 optional args를 갖으며, 이는 callable의 인자로 전달된다.

output_types 인자는 필수로, tf.data가 내부적으로 tf.Graph를 만들고 내부적으로 graph edge (간선, 변 - https://ratsgo.github.io/data%20structure&algorithm/2017/11/18/graph/ 참조)가 tf.dtype이 필요하기 때문이다.

ds_counter = tf.data.Dataset.from_generator(count, args=[25], output_types=tf.int32, output_shapes = (), )

for count_batch in ds_counter.repeat().batch(10).take(10):
print(count_batch.numpy())

[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 0 1 2 3 4]
[ 5 6 7 8 9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]
[0 1 2 3 4 5 6 7 8 9]
[10 11 12 13 14 15 16 17 18 19]
[20 21 22 23 24 0 1 2 3 4]
[ 5 6 7 8 9 10 11 12 13 14]
[15 16 17 18 19 20 21 22 23 24]

output_shapes 인자는 optional이지만, 많은 tensorflow 연산이 unknown rank인 tensor를 지원하지 않으므로 필요하다. 만약 특정 축(axis)의 길이가 unknown 또는 가변이라면, output_shapes를 None으로 설정하면 된다.

다음 예제에서는 generator가 array tuple을 반환한다. 이 tuple의 두번째 array 값은 unknown 길이의 vector이다.

def gen_series():
i = 0
while True:
    size = np.random.randint(0, 10)
    yield i, np.random.normal(size=(size,))
    i += 1

for i, series in gen_series():
print(i, ":", str(series))
if i > 5:
    break

0 : [-1.3464]
1 : [1.1378 1.4187]
2 : []
3 : [0.5072]
4 : []
5 : [ 0.9612 -0.9169 -1.3101 -0.6701 -0.1871 -1.3154]
6 : [ 0.0797 -1.3373 -1.1804 1.2345 -1.1356 0.1872 -1.0534 1.6253]

반환되는 tuple의 첫번째는 int32, 두번째는 float32로 첫번째는 scalar, shape()이고, 두번째는 unknown length 벡터, shape(None,) 이다.

ds_series = tf.data.Dataset.from_generator(
    gen_series,
    output_types=(tf.int32, tf.float32),
    output_shapes=((), (None,)))
ds_series

이제 일반적인 tf.data.Dataset과 같이 사용이 가능하다. 여기서 Variable shape을 포함하는 Dataset은 batch시, 출력길이를 맞추는 Dataset.padded_batch를 사용해야 한다.

ds_series_batch = ds_series.shuffle(20).padded_batch(10)

ids, sequence_batch = next(iter(ds_series_batch))
print(ids.numpy())
print()
print(sequence_batch.numpy())

[15 20 12 22 7 13 1 25 6 19]

[[ 0.      0.      0.      0.      0.      0.      0.      0.      0.    ]
[ 0.3469 0.4085 0.      0.      0.      0.      0.      0.      0.    ]
[-1.0471 0.9392 0.5535 0.0488 0.7002 -0.0405 0.765 -1.9565 0.8094]
[ 0.9449 0.      0.      0.      0.      0.      0.      0.      0.    ]

[ 1.2106 0.5748 0.2427 1.8562 0. 0. 0. 0. 0. ]

[-0.7365 0.4269 -1.5557 -0.2955 -0.6937 0.4122 -0.2994 -0.2061 0. ]

[ 1.4302 0. 0. 0. 0. 0. 0. 0. 0. ]

[ 0.3324 0. 0. 0. 0. 0. 0. 0. 0. ]

[ 0.0063 -0.6145 0.6581 -1.4268 0. 0. 0. 0. 0. ]

[-0.4615 0.2173 0. 0. 0. 0. 0. 0. 0. ]]

padded_batch(
batch_size, padded_shapes=None, padding_values=None, drop_remainder=False
)

Args

batch_size tf.int64 scalar tf.Tensor는 하나의 배치에 묶기 위한 데이터셋의 연이은 요소의 수를 나타낸다.

padded_shapes (Optional.) tensor-like 객체인 tf.TensorShape 또는 tf.int64 벡터의 nested structure는 각 component의 각 입력 요소가 배치되기 위해 우선 채워져야하는 모양을 나타낸다. 어떤 unknown 차원도 각 배치에서 그 차원의 최대로 채워질 것이다. 만약 설정되지 않는다면, component의 모든 차원은 배치에서 최대 크기로 채워진다. padded_shapes는 만약 component가 unknown rank라면 반드시 설정되어야 한다.

padding_values (Optional.) scalar 모양 tf.Tensor의 nested structure는 각 component에 사용되기 위한 padding 값을 나타낸다. None은 nested structure가 기본값으로 채워지는 것을 나타낸다. 기본값은 numeric은 0이고 string은 empty이다.

drop_remainder (Optional.) tf.bool scalar tf.Tensor는 마지막 배치가 batch_size 요소보다 더 작은 경우에 drop할지를 나타낸다. 기본 동작은 drop하지 않는 것이다.

Returns

Dataset A Dataset.

Raises

ValueError 만약 component가 unknown rank이고 padded_shaped 인자가 설정되지 않은 경우

위 예에서 padded_shapes 인자의 경우, tuple의 두번째 인자가 unknown rank이기 때문에 반드시 설정되어야 한다. 이 경우, padded_shapes=((), (10, )) 이어야 한다.

더 실질적인 예로 tf.data.Dataset으로 preprocessing.image.ImageGenerator를 감싸보자

1. 데이터를 다운로드한다.

flowers = tf.keras.utils.get_file(
    'flower_photos',
    'https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz',
    untar=True)

Downloading data from https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz
228818944/228813984 [==============================] - 6s 0us/step

2. image.ImageDataGenerator를 만든다.

img_gen = tf.keras.preprocessing.image.ImageDataGenerator(rescale=1./255, rotation_range=20)

images, labels = next(img_gen.flow_from_directory(flowers))

Found 3670 images belonging to 5 classes.

print(images.dtype, images.shape)
print(labels.dtype, labels.shape)

float32 (32, 256, 256, 3)

float32 (32, 5)

3. Dataset을 생성한다.

ds = tf.data.Dataset.from_generator(
    img_gen.flow_from_directory, args=[flowers],
    output_types=(tf.float32, tf.float32),
    output_shapes=([32,256,256,3], [32,5])
)
ds

Dead & Street