31_피마인디언 당뇨병 예측

chuuvelop 2025. 4. 16. 16:50

728x90

피마 인디언 당뇨병 데이터

데이터 활용 목적 : 당뇨병 여부를 판단하는 머신러닝 예측 모델을 수립
데이터 개요
- 북아메리카 피마 지역 원주민의 Type-2 당뇨병 결과 데이터
- 일반적으로 알려진 당뇨병의 원인은 식습관과 유전
  - 피마 지역은 고립된 지역에서 인디언 고유의 혈통이 지속됨
  - 20세기 후반 서구화된 식습관으로 많은 당뇨 환자가 발생
- 피처
  - Pregnancies : 임신 횟수
  - Glucose : 포도당 부하 검사 수치
  - BloodPressure : 혈압(mm Hg)
  - SkinThickness : 팔 삼두근 뒤쪽의 피하지방 측정값(mm)
  - Insulin : 혈청 인슐린(mu U/ml)
  - BMI : 체질량지수(체중(kg) / 키(m)^2)
  - DiabetesPedigreeFunction : 당뇨 내력 가중치 값
  - Age : 나이
  - Outcome : 클래스 결정 값(0 또는 1)

※ 21세 이상 여성 데이터

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.metrics import (accuracy_score, precision_score, recall_score, roc_auc_score,
f1_score, confusion_matrix, precision_recall_curve, roc_curve)
from sklearn.preprocessing import StandardScaler, Binarizer
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings("ignore")

df = pd.read_csv("./data/diabetes.csv")

df.head()

df["Outcome"].value_counts()

Outcome
0    500
1    268
Name: count, dtype: int64

상대적으로 0인 데이터가 더 많음

df.shape

(768, 9)

df.dtypes

Pregnancies                   int64
Glucose                       int64
BloodPressure                 int64
SkinThickness                 int64
Insulin                       int64
BMI                         float64
DiabetesPedigreeFunction    float64
Age                           int64
Outcome                       int64
dtype: object

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Pregnancies               768 non-null    int64  
 1   Glucose                   768 non-null    int64  
 2   BloodPressure             768 non-null    int64  
 3   SkinThickness             768 non-null    int64  
 4   Insulin                   768 non-null    int64  
 5   BMI                       768 non-null    float64
 6   DiabetesPedigreeFunction  768 non-null    float64
 7   Age                       768 non-null    int64  
 8   Outcome                   768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

결측치는 없고 피처의 타입은 모두 숫자형

# 모델 평가 함수
def get_clf_eval(y_test, pred=None, pred_proba=None):
    confusion = confusion_matrix(y_test, pred)
    accuracy = accuracy_score(y_test, pred)
    precision = precision_score(y_test, pred)
    recall = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)

    # ROC-AUC 추가
    roc_auc = roc_auc_score(y_test, pred_proba)

    print("오차 행렬")
    print(confusion)
    print(f"정확도: {accuracy:.4f}, 정밀도: {precision:.4f}, 재현율: {recall:.4f}, F1: {f1:.4f}, \
    AUC: {roc_auc:.4f}")

def precision_recall_curve_plot(y_test, pred_proba_c1):
    # threshold ndarray와 이 threshold에 따른 정밀도, 재현율 ndarray 추출
    precisions, recalls, thresholds = precision_recall_curve(y_test, pred_proba_c1)

    # X축을 threshold값으로, Y축은 정밀도, 재현율 값으로 각각 시각화
    # 정밀도는 점섬으로 표시
    plt.figure(figsize = (8, 6))
    threshold_boundary = thresholds.shape[0]
    plt.plot(thresholds, precisions[:threshold_boundary], linestyle = "--", label = "precision")
    plt.plot(thresholds, recalls[:threshold_boundary], label = "recall")

    # threshold 값 X축의 단위를 0.1 단위로 변경
    start, end = plt.xlim()
    plt.xticks(np.round(np.arange(start, end, 0.1), 2))

    # x축, y축 label, legend, grid 설정
    plt.xlabel("Threshold")
    plt.ylabel("Precision & Recall")
    plt.legend()
    plt.grid()
    plt.show()

baseline 모델

df.head()

x = df.iloc[:, :-1]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y, random_state = 26)

# 로지스틱 회귀로 학습, 예측 및 평가 수행
lr_clf1 = LogisticRegression()
lr_clf1.fit(x_train, y_train)
pred1 = lr_clf1.predict(x_test)
pred_proba1 = lr_clf1.predict_proba(x_test)[:, 1]

get_clf_eval(y_test, pred1, pred_proba1)

오차 행렬
[[88 12]
 [22 32]]
정확도: 0.7792, 정밀도: 0.7273, 재현율: 0.5926, F1: 0.6531,     AUC: 0.8413

실제 당뇨 환자를 당뇨가 아니라고 진단하면 큰 문제가 생길 수 있기 때문에 재현율 성능에 초점을 맞추는 방향으로 최적화
전체 데이터의 종속변수 비율도 음성클래스 비율이 더 높기 때문에 재현율에 초점을 맞추는 것이 유리

임계값이 0.4정도에서 정밀도와 재현율이 균형을 맞출 것으로 보임
데이터 전처리가 추가로 필요

데이터 전처리

df.describe()

데이터 값의 최솟값이 0인 피처가 존재
- 포도당, 혈압, 피하지방, 인슐린, BMI가 0인 것은 납득하기 어려움

plt.hist(df["Glucose"], bins = 100)
plt.show()

포도당이 0인 데이터가 5개 존재

# 정규성을 띄고 있는 데이터 → 로지스틱회귀

df.columns

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')

# 0값을 검사할 피처명 리스트
zero_features = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# 전체 데이터 건수
total_count = df["Glucose"].count()

# 피처별로 데이터 값이 0인 데이터 건수 추출하고, 비율 계산
for feature in zero_features:
zero_count = df.loc[df[feature] == 0, feature].count()
print(f"{feature} 0 건수는 {zero_count}, 비율은 {100 * zero_count/total_count:.2f}")

Glucose 0 건수는 5, 비율은 0.65
BloodPressure 0 건수는 35, 비율은 4.56
SkinThickness 0 건수는 227, 비율은 29.56
Insulin 0 건수는 374, 비율은 48.70
BMI 0 건수는 11, 비율은 1.43

SkinThickness 와 Insulin의 0값 비율이 아주 높은 편이어서 데이터 일괄 삭제시 오히려 모델 학습에 악영향을 줄 수 있음
- 이상치를 평균값으로 대체

# zero_features 리스트 내부에 저장된 개별 피처들에 대해서 0값을 평균값으로 대체
mean_zero_features = df[zero_features].mean()
mean_zero_features

Glucose          120.894531
BloodPressure     69.105469
SkinThickness     20.536458
Insulin           79.799479
BMI               31.992578
dtype: float64

df[zero_features] = df[zero_features].replace(0, mean_zero_features)

df.head()

모델 학습 방법 테스트

x = df.iloc[:, :-1]
y = df.iloc[:, -1]

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, stratify = y,
random_state = 26)

# StandardScaler 적용
ss = StandardScaler()
scaled_train = ss.fit_transform(x_train)
scaled_test = ss.transform(x_test)

lr_clf2 = LogisticRegression()
lr_clf2.fit(scaled_train, y_train)
pred2 = lr_clf2.predict(scaled_test)
pred_proba2 = lr_clf2.predict_proba(scaled_test)[:, 1]

get_clf_eval(y_test, pred2, pred_proba2)

오차 행렬
[[87 13]
 [22 32]]
정확도: 0.7727, 정밀도: 0.7111, 재현율: 0.5926, F1: 0.6465,     AUC: 0.8424

lr_clf2.score(scaled_train, y_train)

0.7768729641693811

하이퍼파라미터 튜닝

logi = LogisticRegression(random_state = 26)

param = {"penalty" : ["l1", "l2", "elasticnet", None],
"C" : [0.01, 0.1, 1, 10, 100],
"solver" : ["lbfgs", "liblinear", "newton-cg", "newton-cholesky", "sag", "saga"]}

gs = GridSearchCV(logi, param, scoring = "roc_auc", n_jobs = -1) # 교차검증을 수행하면서 auc가 높은 모델을 좋은 모델이라고 평가

gs.fit(scaled_train, y_train)

gs.best_params_

{'C': 1, 'penalty': 'l2', 'solver': 'saga'}

lr_clf3 = gs.best_estimator_

pred3 = lr_clf3.predict(scaled_test)
pred_proba3 = lr_clf3.predict_proba(scaled_test)[:, 1]

get_clf_eval(y_test, pred3, pred_proba3)

오차 행렬
[[87 13]
 [22 32]]
정확도: 0.7727, 정밀도: 0.7111, 재현율: 0.5926, F1: 0.6465,     AUC: 0.8426

모델 성능 비교

fpr1, tpr1, threshold1 = roc_curve(y_test, pred_proba1)
fpr2, tpr2, threshold2 = roc_curve(y_test, pred_proba2)
fpr3, tpr3, threshold3 = roc_curve(y_test, pred_proba3)

plt.figure(figsize = (10, 10))

plt.plot([0, 1], [0, 1], label = "random")
plt.plot(fpr1, tpr1, label = "baseline")
plt.plot(fpr2, tpr2, label = "standard_scaler")
plt.plot(fpr3, tpr3, label = "hyperparam")

plt.xlabel("FPR")
plt.ylabel("TPR")
plt.xlim(0, 1)
plt.ylim(0, 1)
plt.legend()
plt.show()

def get_eval_by_threshold(y_test, pred_proba_c1, thresholds):
    # thresholds 리스트 객체 내의 값을 차례로 검증
    for custom_threshold in thresholds:
        binarizer = Binarizer(threshold = custom_threshold)
        binarizer.fit(pred_proba_c1)
        custom_predict = binarizer.transform(pred_proba_c1)

        print("임곗값:", custom_threshold)
        get_clf_eval(y_test, custom_predict, pred_proba_c1)
        print("-" * 80)

pred_proba1.shape

(154,)

binarizer = Binarizer(threshold = 0.3)
binarizer.fit(pred_proba1.reshape(-1, 1))

pred_proba1[:10]

array([0.95983915, 0.80271628, 0.22405142, 0.39843103, 0.10276872,
       0.13580544, 0.11813209, 0.58030832, 0.96204124, 0.18723707])

# 2진 데이터로 바꿔주는게 binarizer(임계값에 따라 데이터를 0과 1로 변경)
binarizer.transform(pred_proba1[:10].reshape(-1, 1)).flatten()

array([1., 1., 0., 1., 0., 0., 0., 1., 1., 0.])

thresholds = np.arange(0.3, 0.51, 0.04)
thresholds

array([0.3 , 0.34, 0.38, 0.42, 0.46, 0.5 ])

get_eval_by_threshold(y_test, pred_proba3.reshape(-1, 1), thresholds)

임곗값: 0.3
오차 행렬
[[72 28]
 [10 44]]
정확도: 0.7532, 정밀도: 0.6111, 재현율: 0.8148, F1: 0.6984,     AUC: 0.8426
--------------------------------------------------------------------------------
임곗값: 0.33999999999999997
오차 행렬
[[81 19]
 [10 44]]
정확도: 0.8117, 정밀도: 0.6984, 재현율: 0.8148, F1: 0.7521,     AUC: 0.8426
--------------------------------------------------------------------------------
임곗값: 0.37999999999999995
오차 행렬
[[83 17]
 [13 41]]
정확도: 0.8052, 정밀도: 0.7069, 재현율: 0.7593, F1: 0.7321,     AUC: 0.8426
--------------------------------------------------------------------------------
임곗값: 0.41999999999999993
오차 행렬
[[85 15]
 [18 36]]
정확도: 0.7857, 정밀도: 0.7059, 재현율: 0.6667, F1: 0.6857,     AUC: 0.8426
--------------------------------------------------------------------------------
임곗값: 0.4599999999999999
오차 행렬
[[87 13]
 [22 32]]
정확도: 0.7727, 정밀도: 0.7111, 재현율: 0.5926, F1: 0.6465,     AUC: 0.8426
--------------------------------------------------------------------------------
임곗값: 0.4999999999999999
오차 행렬
[[87 13]
 [22 32]]
정확도: 0.7727, 정밀도: 0.7111, 재현율: 0.5926, F1: 0.6465,     AUC: 0.8426
--------------------------------------------------------------------------------

# 최종 예측
# 임곗값을 0.34로 설정한 Binarizer 생성
binarizer = Binarizer(threshold = 0.34)

pred_th = binarizer.fit_transform(pred_proba3.reshape(-1, 1))

get_clf_eval(y_test, pred_th, pred_proba3)

오차 행렬
[[81 19]
 [10 44]]
정확도: 0.8117, 정밀도: 0.6984, 재현율: 0.8148, F1: 0.7521,     AUC: 0.8426

[ ]:

728x90