(Guide) Data input pipeline : tf.data - part02 (Dataset structure)

2020. 5. 24. 14:18

예제를 실행하기 위해 아래 패키지를 import한다.

import tensorflow as tf

import pathlib
import os
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

# numpy 출력 자리수 설정
np.set_printoptions(precision=4)

Data structure

Dataset은 nested structure이고, 이를 구성하는 개별 components는

tf.Tensor, tf.sparse.SparseTensor, tf.RaggedTenor, tf.TensorArray, tf.data.Dataset을 포함하는

tf.TypeSpec으로 나타낼 수 있는 어떤 타입이라도 가능하다.

Dataset.element_spec 속성은 해당 데이터셋 요소의 type을 나타낸다.

이 속성은 Tensorflow value type을 나타내는 tf.TypeSpec 객체의 nested structure를 변환하고 single 혹은 tuple component 또는 component의 nested tuple일 수 있는 요소의 구조를 나타낸다.

dataset1 = tf.data.Dataset.from_tensor_slices(tf.random.uniform([4, 10]))
dataset1.element_spec

TensorSpec(shape=(10,), dtype=tf.float32, name=None)

dataset2 = tf.data.Dataset.from_tensor_slices(
(tf.random.uniform([4]),
tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))
dataset2.element_spec

(TensorSpec(shape=(), dtype=tf.float32, name=None),
TensorSpec(shape=(100,), dtype=tf.int32, name=None))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))
dataset3.element_spec

(TensorSpec(shape=(10,), dtype=tf.float32, name=None),
(TensorSpec(shape=(), dtype=tf.float32, name=None),
TensorSpec(shape=(100,), dtype=tf.int32, name=None)))

# Dataset containing a sparse tensor.
dataset4 = tf.data.Dataset.from_tensors(tf.SparseTensor(indices=[[0, 0], [1, 2]], values=[1, 2], dense_shape=[3, 4]))
dataset4.element_spec

SparseTensorSpec(TensorShape([3, 4]), tf.int32)

# Use value_type to see the type of value represented by the element spec
dataset4.element_spec.value_type

tensorflow.python.framework.sparse_tensor.SparseTensor

Dataset 변환은 어떤 구조의 Dataset도 가능하다.

예를 들면, Dataset.map(), Dataset.fileter()변환은 Dataset 각 요소에 함수를 적용하고 요소의 구조(element structure)는 함수의 인자를 결정한다.

dataset1 = tf.data.Dataset.from_tensor_slices(
    tf.random.uniform([4, 10], minval=1, maxval=10, dtype=tf.int32))

dataset2 = tf.data.Dataset.from_tensor_slices(
   (tf.random.uniform([4]),
    tf.random.uniform([4, 100], maxval=100, dtype=tf.int32)))

dataset3 = tf.data.Dataset.zip((dataset1, dataset2))

dataset1
dataset2

dataset3

for z in dataset1:
print(z.numpy())

[8 1 9 4 8 4 3 5 3 2]
[6 7 1 1 7 5 4 4 3 5]
[4 7 3 5 5 3 8 7 1 4]
[6 9 6 4 6 9 5 9 4 1]

dataset1_map = dataset1.map(lambda x : x + 1

[9 2 10 5 9 5 4 6 4 3]
[7 8 2 2 8 6 5 5 4 6]
[5 8 4 6 6 4 9 8 2 5]
[7 10 7 5 7 10 6 10 5 2]

dataset1_filter = dataset1.filter(lambda x : x<3)

<< Error >>

Dataset.filter() 의 경우, 조건식의 결과가 Boolean의 값이어야만 한다. 하지만, 위의 경우, lambda x : x<3에서 x는 각 row 값이 되어 Boolean을 리턴할 수 없어 오류가 발생한다. 이런 경우, Dataset을 각 row로 분리하여 처리해야 하는지 아니면 다른 방법이 있는지는 추가로 알아봐야 할 것 같다.

for a, (b,c) in dataset3:
print('shapes: {a.shape}, {b.shape}, {c.shape}'.format(a=a, b=b, c=c))

shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)
shapes: (10,), (), (100,)

Dead & Street

(Guide) Data input pipeline : tf.data - part02 (Dataset structure)

+ Recent posts

티스토리툴바