(Guide) Data input pipeline : tf.data - part08 (Training workflow)

2020. 5. 25. 21:38

epoch : 전체 데이터를 n개로 나누어 배치 처리하는 경우, n개의 처리가 모두 끝난 즉, 전체 데이터를 1회 처리한 것을 1 epoch 라 한다.

Training workflows

Processing multiple epochs

tf.data API는 동일 데이터의 여러 epoch를 처리하기 위한 두가지 주요 방법을 제공한다.

여러 epoch에서 데이터셋을 반복하는 가장 간단한 방법은 Dataset.repeat() 변환을 사용하는 것이다. 우선, titanic data의 데이터셋을 생성한다.

import tensorflw as tf, matplotlib.pyplot as plt

titanic_file = tf.keras.utils.get_file("train.csv", "https://storage.googleapis.com/tf-datasets/titanic/train.csv")
titanic_lines = tf.data.TextLineDataset(titanic_file)

def plot_batch_sizes(ds):
batch_sizes = [batch.shape[0] for batch in ds]
plt.bar(range(len(batch_sizes)), batch_sizes)
plt.xlabel('Batch number')
plt.ylabel('Batch size')

인자 없이 Dataset.repeat() 변환을 적용하는 것은 입력을 무한히 반복할 것이다.

Dataset.repeat 변환은 하나의 epoch 끝과 다음 epoch의 시작을 나타는 것 없이 인자를 이어붙인다. 이 때문에 Dataset.repeat 다음에 Dataset.batch가 적용되면 epoch 경계를 가로지르는 배치를 생성할 것이다.

titanic_batches = titanic_lines.repeat(3).batch(128)
plot_batch_sizes(titanic_batches)

만약 epoch를 명확하게 분리시켜야 한다면, Dataset.repeat전에 Dataset.batch를 놓는다.

titanic_batches = titanic_lines.batch(128).repeat(3)
plot_batch_sizes(titanic_batches)

만약 각 epoch의 끝에 custom 연산(즉, 통계를 수집하기 위한)이 수행되길 원한다면, 각 epoch에서 데이터셋 반복을 재시작시키는것이 가장 간단하다.

epochs = 3
dataset = titanic_lines.batch(128)

for epoch in range(epochs):
  for batch in dataset:
  print(batch.shape)
  print("End of epoch: ", epoch)

(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 0
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 1
(128,)
(128,)
(128,)
(128,)
(116,)
End of epoch: 2

Randomly shuffling input data

Dataset.shuffle() 변환은 고정길이 버퍼를 유지하고 버퍼로부터 임의로 균일하게 다음 요소를 선택한다.

Note : 큰 buffer_size가 더 철저하게 shuffle하는 반면, 많은 메모리와 버퍼를 채우기 위해 상당한 시간이 걸릴 수 있다. 만약 이런 부분이 문제가 되면 파일에 Dataset.interleave를 사용하는 것을 고려하자.

데이터셋에 인덱스를 추가하자 그러면 효과를 확인할 수 있을 것이다.

lines = tf.data.TextLineDataset(titanic_file)
counter = tf.data.experimental.Counter()

dataset = tf.data.Dataset.zip((counter, lines))
dataset = dataset.shuffle(buffer_size=100)
dataset = dataset.batch(20)
dataset

tf.data.experimental.Counter(
start=0, step=1, dtype=tf.dtypes.int64
)
step 크기로 start로부터 카운트하는 Dataset을 생성한다.

Args
- start : (Optional) 카운터의 시작값, 기본값 0
- step : (Optional) 카운터 스텝. 기본값 1
- dtype : (optional) 카운터 요소의 타입. 기본값 tf.int64

Returns
scalar dtype 요소의 Dataset

buffer_size가 100이고 batch 크기가 20이기 때문에 첫번째 배치는 120이상을 가진 요소를 포함하지 않는다.

n,line_batch = next(iter(dataset))
print(n.numpy())

[ 15 67 14 55 79 56 69 75 89 43 1 59 54 104 111 45 90 105 64 99]

Dataset.batch와 같이, Dataset.repeat와 관련된 순서가 중요하다.

Dataset.shuffle은 shuffle 버퍼가 빌때까지 epoch의 끝을 나타내지 않는다. 그래서 repeat전에 위치한 shuflle은 다음 epoch로 넘어가기 전에 epoch의 모든 요소를 보여줄 것이다.

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.shuffle(buffer_size=100).batch(10).repeat(2)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(60).take(5):
print(n.numpy())

Here are the item ID's near the epoch boundary:

[390 454 593 603 533 612 509 379 579 471]
[599 584 338 517 613 438 464 484 615 607]
[481 559 621 526 622 510 545 406]
[ 83 96 17 63 70 98 30 102 48 92]
[88 40 9 5 23 65 71 85 47 37]

shuffle_repeat = [n.numpy().mean() for n, line_batch in shuffled]
plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.ylabel("Mean item ID")
plt.legend()

하지만, shuffle전에 repeat는 epoch 경계를 함께 섞는다.

dataset = tf.data.Dataset.zip((counter, lines))
shuffled = dataset.repeat(2).shuffle(buffer_size=100).batch(10)

print("Here are the item ID's near the epoch boundary:\n")
for n, line_batch in shuffled.skip(55).take(15):
print(n.numpy())

Here are the item ID's near the epoch boundary:

[566 526 448 18 17 621 9 506 611 539]
[569 579 625 28 324 3 27 588 627 622]
[609 576 20 487 37 605 544 548 334 485]
[578 459 558 623 50 7 15 604 5 610]
[ 48 572 603 602 40 45 21 16 42 612]
[ 65 8 552 598 26 488 592 59 471 608]
[455 601 73 497 553 57 25 590 556 395]
[597 63 535 380 22 574 96 584 6 562]
[ 77 617 75 11 47 2 52 66 263 538]
[ 89 600 529 87 104 10 32 112 613 41]
[ 54 113 0 122 83 571 589 582 70 79]
[ 95 336 36 513 51 82 35 68 106 100]
[105 118 595 71 528 141 124 99 101 142]
[ 98 13 23 153 110 14 117 133 392 145]
[ 30 159 69 31 34 19 152 86 84 80]

repeat_shuffle = [n.numpy().mean() for n, line_batch in shuffled]

plt.plot(shuffle_repeat, label="shuffle().repeat()")
plt.plot(repeat_shuffle, label="repeat().shuffle()")
plt.ylabel("Mean item ID")
plt.legend()

그림과 같이 각 epoch의 경계가 명확하지 않음을 확인할 수 있다.

Dead & Street