Preprocessing in Data Science (Part 2): Centering, Scaling and Logistic Regression

2020. 12. 10. 08:35

Preprocessing in Data Science (Part 2): Centering, Scaling and Logistic Regression

센터링(centering)과 스케일(scaling)이 로지스틱 회귀 설정에서 모델에 도움이 되는지 알아본다.

이전글에서는 KNN과 와인데이터셋으로 ML 분류 작업에서 전처리의 역할을 살펴보았다. 앞글에서 수치적 데이터를 센터링하고 스케일링하는 것이 정확도 같은 몇가지 모델 성능 측정에 대해 KNN의 성능을 개선하는 것을 알았다. 또한 전처리가 진공에서는 발생하지 않고 이것의 값은 오로지 예측 지향적(prediction-oriented) ML 파이프라인의 맥락에서 보여질 때만 판단될 수 있다는 것 또한 알았다. 그러나 단지 단일 모델 KNN의 맥락에서 전처리의 중요성을 보았다. 이런 경우, 모델이 훨씬 더 좋게 수행되지만, 항상 이런 경우일까? 반드시 그렇지는 않다. 이 글에서는 다시한번 와인 데이터셋을 사용하여 다른 기본 모델, 로지스틱 회귀(logistic regression)에서 수치적 데이터 스케일링과 센터링의 역할을 살펴본다.

우선 분류만큼 잘 수치적 데이터의 값을 예측하기 위해 사용할 수 있는 회귀에 대해 간략히 소개한다. 선형(linear) 회귀, 로지스틱 회귀를 알아보고 레드와인의 품질을 예측하기 위해 로지스틱회귀를 사용해 본다.

A brief introduction to regression with Python

Linear regression in Python

위에서 언급했듯, 회귀는 보통 다른 수치 데이터로 부터 하나의 수치 데이터를 예측하기 위해 사용된다. 예를 들어, 아래에서 보스턴 주택 데이터(scikit-learn이 포함하고 있는 데이터셋)에서 선형 회귀를 실행한다. 이 경우, 독립 변수(independent variable, x-축)은 방의 수이고 종속 변수(dependent variable, y-축)은 가격이다.

어떻게 이런 회귀가 동작하는 걸까? 개략적으로 기능적 구조는 다음과 같다.

우리는 모델 $y = ax + b$이 데이터 $(x_i, y_i)$에 적합하기를 원한다. 즉, 데이터가 주어지면, 최적의 a와 b를 찾기를 원한다.
보통의 최소 제곱(OLS - Ordinary Least Sqares, 단연코 가장 일반적인) 수식에서 오류가 종속변수에서 발생할 것이라는 추정이 있다. 그러한 이유로 최적의 a와 b는 최소화하는 것으로 구해진다.

$SSE = \sum_i(y_i - (ax_i + b))^2$

그리고 이 최적화는 보통 경사 하강법(gradient descent)로 알려진 알고리즘을 사용하여 달성된다. 여기서는 보스튼 주택가격 데이터에 대해 간단한 선형 회귀를 수행한다.


# Import necessary packages
import pandas as pd
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from sklearn import datasets
from sklearn import linear_model
import numpy as np

# Load data
boston = datasets.load_boston()
yb = boston.target.reshape(-1, 1)
Xb = boston['data'][:,5].reshape(-1, 1)

# Plot data
plt.scatter(Xb,yb)
plt.ylabel('value of house /1000 ($)')
plt.xlabel('number of rooms')

# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit( Xb, yb)

# Plot outputs
plt.scatter(Xb, yb,  color='black')
plt.plot(Xb, regr.predict(Xb), color='blue',
         linewidth=3)
plt.show()

이 회귀는 데이터의 일반적인 증가 추세를 포착하지만 그 이상은 아니다. 여기서는 단지 하나의 예측 변수만을 사용했지만 더 많이 사용할 수 있다. 이런 경우, 모델에서 각 예측 변수에 대해 하나인 n개의 계수 $a_1, ...,a_n$을 갖을 수 있다. 이것은 변수 $a_i$의 크기는 해당 변수가 목표변수와 얼마나 강하게 관계를 갖는지를 알려준다.

Logistic Regression in Python

회귀는 또한 분류문제에도 사용할 수 있다. 이것에 대한 첫번째 자연스러운 예제는 로지스틱 회귀(logistic regression)이다. 이진 분류(두개의 레이블)에서 0과 1로써 레이블을 생각할 수 있다. 다시한번 $x$로써 예측변수를 표시하면, 로지스틱 회귀는 로지스틱 함수에 의해 제공된다.

$F(x) = \frac{1}{1 + e^{-(ax+b)}}$

이것은 시그모이드(sigmoid - S자 모양)의 곡선이고 아래에서 예제를 볼 수 있다. 어떠한 $x$가 주어진 것에 대해 만약 $F(x) < 0.5$라면, 로지스틱 회귀는 $y = 0$을 예측하고 대신 $F(x) > 0.5$라면 모델은 $y = 1$을 예측한다. 다시한번, 하나 이상의 예측변수를 가지고 있는 경우 또한 각각의 예측 변수에 대해 하나인 n개의 계수 $a_1, ..., a_n$을 갖는다. 이 경우, $a_i$의 크기는 해당 변수가 얼마나 예측 변수에 영향을 미치는지 알려준다.


# Synthesize data
X1 = np.random.normal(size=150)
y1 = (X1 > 0).astype(np.float)
X1[X1 > 0] *= 4
X1 += .3 * np.random.normal(size=150)
X1= X1.reshape(-1, 1)

# Run the classifier
clf = linear_model.LogisticRegression()
clf.fit(X1, y1)

# Plot the result
plt.scatter(X1.ravel(), y1, color='black', zorder=20 , alpha = 0.5)
plt.plot(X1_ordered, clf.predict_proba(X1_ordered)[:,1], color='blue' , linewidth = 3)
plt.ylabel('target variable')
plt.xlabel('predictor variable')
plt.show()

Logistic Regression and Data Scaling: The Wine Data Set

로지스틱 회귀의 기능적 구조를 알아 보았다. 맛있는 와인 데이터셋에서 로지스틱 회귀 분류기를 구현해보자. 데이터를 임포트하여 목표 변수(good/bad wine)을 그려본다.


# Import necessary modules
from sklearn import linear_model
from sklearn.cross_validation import train_test_split

# Load data
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';')
X = df.drop('quality' , 1).values #drop target variable
y1 = df['quality'].values
y = y1 <= 5 # is the rating <= 5?

# plot histograms of original target variable
# and aggregated target variable
plt.figure(figsize=(20,5));
plt.subplot(1, 2, 1 );
plt.hist(y1);
plt.xlabel('original target value')
plt.ylabel('count')
plt.subplot(1, 2, 2);
plt.hist(y)
plt.xlabel('aggregated target value')
plt.show()

이제 로지스틱 회귀를 실행하여 어떻게 동작하는지 보자.


# Split the data into test and training sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#initial logistic regression model
lr = linear_model.LogisticRegression()

# fit the model
lr = lr.fit(X_train, y_train)
print('Logistic Regression score for training set: %f' % lr.score(X_train, y_train))
from sklearn.metrics import classification_report
y_true, y_pred = y_test, lr.predict(X_test)
print(classification_report(y_true, y_pred))


Logistic Regression score for training set: 0.752932
             precision    recall  f1-score   support

      False       0.78      0.74      0.76       179
       True       0.69      0.74      0.71       141

avg / total       0.74      0.74      0.74       320

이 로지스틱 회귀는 KNN보다 더 나은 성능을 보여준다. 이제 데이터를 스케일링하고 로지스틱 회귀 모델을 실행해 보자.


from sklearn.preprocessing import scale
Xs = scale(X)
Xs_train, Xs_test, y_train, y_test = train_test_split(Xs, y, test_size=0.2, random_state=42)
lr_2 = lr.fit(Xs_train, y_train)
print('Scaled Logistic Regression score for test set: %f' % lr_2.score(Xs_test, y_test))
y_true, y_pred = y_test, lr_2.predict(Xs_test)
print(classification_report(y_true, y_pred))


Scaled Logistic Regression score for test set: 0.740625
             precision    recall  f1-score   support

      False       0.79      0.74      0.76       179
       True       0.69      0.74      0.72       141

avg / total       0.74      0.74      0.74       320

데이터 스케일링으로 로지스틱 회귀의 성능이 개선되지 않았다. 특히 KNN 성능이 스케일링으로 상당히 개선되는 것을 보인것에 반해 왜 안되었을까? 이유는 목표변수에 영향을 미치지 않는 큰 범위를 가진 예측변수가 있다면 회귀 알고리즘은 예측에 많은 영향이 없도록 작은 해당 변수 $a_i$를 만들기 때문이다. KNN은 그같은 내부 전략을 가지고 있지 않아 데이터 스케일링이 매우 필요하다.

다음 글에서 데이터 세트를 합성하고, 노이즈를 추가하고, 센터링과 스케일링이 노이즈 강도의 함수로서 두 모델의 성능을 어떻게 변화시키는 지 살펴봄으로써 k-NN 및 로지스틱 회귀에서 센터링 및 스케일링의 매우 다른 결과를 풀겠습니다.

아래 코드에서 sc = True로 설정하여 스케일링 할 수 있다.


# Set sc = True to scale your features 
sc = False 

# Load data 
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv ' , sep = ';') 
X = df.drop('quality' , 1).values # drop target variable 

# Here we scale, if desired 
if sc == True: 
  X = scale(X) 

# Target value 
y1 = df['quality'].values # original target variable 
y = y1 <= 5  # new target variable: is the rating <= 5? 

# Split the data into a test set and a training set 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) 

# Train logistic regression model and print performance on the test set 
lr = linear_model.LogisticRegression() 
lr = lr.fit(X_train, y_train) 
print('Logistic Regression score for training set: %f' % lr.score(X_train, y_train)) 
y_true, y_pred = y_test, lr.predict(X_test) 
print(classification_report(y_true, y_pred))

저작자표시 비영리 동일조건

Dead & Street