๊ด€๋ฆฌ ๋ฉ”๋‰ด

๐Ÿฆ• ๊ณต๋ฃก์ด ๋˜์ž!

์ง‘ ๊ฐ’ ์˜ˆ์ธก ๋ถ„์„...1 ๋ณธ๋ฌธ

Data/Dacon

์ง‘ ๊ฐ’ ์˜ˆ์ธก ๋ถ„์„...1

Kirok Kim 2022. 2. 3. 20:34
๋ช…๋ชฉํ˜• ๋ฐ์ดํ„ฐ ๋ณ€ํ™˜ ๋ฐ ํžˆํŠธ๋งต์˜ ์ž์„ธํ•œ ๋‚ด์šฉ์€ 3์žฅ์—์„œ ๋‹ค๋ฃฐ ์˜ˆ์ •
์Šต์ž‘(์—ฐ๊ตฌ์ค‘)
!pip install pandas
!pip install numpy
!pip install matplotlib
!pip install seaborn
!pip install sklearn

import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
# 1. train.csv : ํ•™์Šต ๋ฐ์ดํ„ฐ
# id : ๋ฐ์ดํ„ฐ ๊ณ ์œ  id
# OverallQual : ์ „๋ฐ˜์  ์žฌ๋ฃŒ์™€ ๋งˆ๊ฐ ํ’ˆ์งˆ
# YearBuilt : ์™„๊ณต ์—ฐ๋„
# YearRemodAdd : ๋ฆฌ๋ชจ๋ธ๋ง ์—ฐ๋„
# ExterQual : ์™ธ๊ด€ ์žฌ๋ฃŒ ํ’ˆ์งˆ
# BsmtQual : ์ง€ํ•˜์‹ค ๋†’์ด
# TotalBsmtSF : ์ง€ํ•˜์‹ค ๋ฉด์  
# 1stFlrSF : 1์ธต ๋ฉด์  
# GrLivArea : ์ง€์ƒ์ธต ์ƒํ™œ ๋ฉด์ 
# FullBath : ์ง€์ƒ์ธต ํ™”์žฅ์‹ค ๊ฐœ์ˆ˜ 
# KitchenQual : ๋ถ€์–ต ํ’ˆ์งˆ 
# GarageYrBlt : ์ฐจ๊ณ  ์™„๊ณต ์—ฐ๋„
# GarageCars: ์ฐจ๊ณ  ์ž๋ฆฌ ๊ฐœ์ˆ˜
# GarageArea: ์ฐจ๊ณ  ๋ฉด์  
# target : ์ง‘๊ฐ’(๋‹ฌ๋Ÿฌ ๋‹จ์œ„)

data=pd.read_csv('/content/drive/MyDrive/์ง‘๊ฐ’์˜ˆ์ธก๋ถ„์„/train.csv')

data.drop('id',axis=1,inplace=True)

data

# ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ ๋’ค ๊ฒฐ์ธก์น˜ ํ™•์ธ์€ ํ•„์ˆ˜๋‹ค
def check(data):
    mcol = []
    for col in data.columns:
        mv = sum(data[col].isna())
        is_missing = True if mv >= 1 else False
        if is_missing:
            print(f'๊ฒฐ์ธก {col}')
            print(f'{mv} ๊ฐœ')
            mcol.append([col, data[col].dtype])
    if mcol == []:
        print('x')
    return mcol

mcol = check(data)

data.describe()

data.info()

# ๋จผ์ € ์ƒ๊ด€๊ณ„์ˆ˜ ๊ณ„์‚ฐ์„ ์œ„ํ•ด ํ…์ŠคํŠธ ํ˜•์‹์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ ๋ณ€ํ™˜ํ•ด์คŒ.
from sklearn.preprocessing import LabelEncoder

corr_df = data.copy()
corr_df[corr_df.columns[corr_df.dtypes=='O']] = corr_df[corr_df.columns[corr_df.dtypes=='O']].astype(str).apply(LabelEncoder().fit_transform)
corr_df['Exter Qual']
## ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ถ„์„ํ•ด๋ณด๋ฉด ๋ฐ˜๋น„๋ก€์  ์š”์†Œ๊ฐ€ ๋งŽ์Œ 
์ด ๋ถ€๋ถ„์€ sklearn์„ ํ†ตํ•ด ์ œ๋Œ€๋กœ ๋œ ์ˆ˜์น˜ํ™”๊ฐ€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์•˜๋‹ค๊ณ  ์ƒ๊ฐํ•จ.

 

#์ƒ๊ด€๊ด€๊ณ„ ๋ถ„์„๋„ ์ด๊ฒŒ ๋ฐ์ดํ„ฐ ๋ถ„์„ํ•˜๋Š”๋ฐ์— ์ œ์ผ ์œ ์šฉํ•œ ์‹œ๊ฐํ™”๊ฐ€ ์•„๋‹Œ๊ฐ€ ์‹ถ๋‹ค
plt.figure(figsize=(15,10))

heat_table = corr_df.corr()
mask = np.zeros_like(heat_table)
mask[np.triu_indices_from(mask)] = True
heatmap_ax = sns.heatmap(heat_table, annot=True, mask = mask, cmap='coolwarm')
heatmap_ax.set_xticklabels(heatmap_ax.get_xticklabels(), fontsize=15, rotation=45)
 # ๊ธ€์ž ๊ธฐ์šธ์ด๊ธฐ ๋ฐ ํฐํŠธ์‚ฌ์ด์ฆˆ ๊ฐ๋„๋Š” ๋ฐ˜์‹œ๊ณ„ ๋ฐฉํ–ฅ
heatmap_ax.set_yticklabels(heatmap_ax.get_yticklabels(), fontsize=15)
plt.title('correlation between features', fontsize=40)
plt.show()

# sns.heatmap ์— ๋Œ€ํ•œ ์ „๋ฐ˜์ ์ธ ์ง€์‹์ด ๋ถ€์กฑ ์„œ์น˜๊ฐ€ ํ•„์š”...

 

๋ฐ˜์‘ํ˜•
Comments