R의 이상값 분석 - 이상값 감지 및 제거

안녕하세요, 독자 여러분! 이 기사에서는 R 프로그래밍의 Outlier Analysis에 대해 자세히 설명합니다.

그럼 시작하겠습니다!!

데이터의 이상치란 무엇입니까?

이상값의 개념에 대해 자세히 알아보기 전에 데이터 값의 전처리에 초점을 맞추겠습니다.

데이터 과학 및 기계 학습 영역에서 데이터 값의 사전 처리는 중요한 단계입니다. 사전 처리란 모델링 전에 데이터에서 모든 오류와 노이즈를 제거하는 것을 의미합니다.

지난 포스트에서 R 프로그래밍의 결측값 분석에 대해 알아보았습니다.

오늘, 우리는 R의 이상값 탐지 및 제거와 같은 고급 수준에 초점을 맞출 것입니다.

이상값은 이름에서 알 수 있듯이 데이터 집합의 다른 지점에서 떨어져 있는 데이터 지점입니다. 이는 다른 데이터 값과 떨어져 나타나 데이터 세트의 전체 분포를 방해하는 데이터 값입니다.

이것은 일반적으로 데이터 값의 비정상적인 분포로 가정됩니다.

이상값이 모델에 미치는 영향 -

데이터가 왜곡된 형식으로 판명되었습니다.
평균, 분산 등의 측면에서 데이터의 전반적인 통계적 분포를 변경합니다.
모델의 정확도 수준에서 편향을 얻습니다.

이상값의 영향을 이해했으므로 이제 구현 작업을 수행할 때입니다.

이상값 분석 - 설정 시작 GO!

처음에는 데이터 세트에서 이상값의 존재를 감지하는 것이 매우 중요합니다.

그럼 시작하겠습니다. 자전거 대여 횟수 예측 데이터 세트를 사용했습니다. 여기에서 데이터 세트를 찾을 수 있습니다!

1. 데이터세트 불러오기

처음에는 read.csv() 함수를 사용하여 데이터 세트를 R 환경에 로드했습니다.

이상값 감지 전에 NULL 또는 누락된 값이 있는지 확인하기 위해 누락된 값 분석을 수행했습니다. 이를 위해 sum(is.na(data)) 함수를 사용했습니다.

#Removed all the existing objects
rm(list = ls())

#Setting the working directory
setwd("D:/Ediwsor_Project - Bike_Rental_Count/")
getwd()

#Load the dataset
bike_data = read.csv("day.csv",header=TRUE)

### Missing Value Analysis ###
sum(is.na(bike_data))
summary(is.na(bike_data))

#From the above result, it is clear that the dataset contains NO Missing Values.

여기의 데이터에는 누락된 값이 없습니다.

2. Boxplot 기능으로 이상값 감지

이제 데이터 세트에서 이상값의 존재를 감지할 때입니다. 이를 달성하기 위해 c() 함수를 사용하여 숫자 데이터 열을 별도의 데이터 구조/변수에 저장했습니다.

또한 boxplot() 함수를 사용하여 숫자 변수에서 이상값의 존재를 감지했습니다.

박스플롯:

시각적 개체에서 변수 '윙윙'과 '풍속'이 데이터 값에 이상값을 포함하고 있음이 분명합니다.

3. 이상값을 NULL 값으로 바꾸기

이제 R에서 이상값 분석을 수행한 후 boxplot() 메서드로 식별된 이상값을 NULL 값으로 대체하여 아래와 같이 연산합니다.

##############################Outlier Analysis -- DETECTION###########################

# 1. Outliers in the data values exists only in continuous/numeric form of data variables. Thus, we need to store all the numeric and categorical independent variables into a separate array structure.
col = c('temp','cnt','hum','windspeed')
categorical_col = c("season","yr","mnth","holiday","weekday","workingday","weathersit")

# 2. Using BoxPlot to detect the presence of outliers in the numeric/continuous data columns.
boxplot(bike_data[,c('temp','atemp','hum','windspeed')])

# From the above visualization, it is clear that the data variables 'hum' and 'windspeed' contains outliers in the data values.
#OUTLIER ANALYSIS -- Removal of Outliers
# 1. From the boxplot, we have identified the presence of outliers. That is, the data values that are present above the upper quartile and below the lower quartile can be considered as the outlier data values.
# 2. Now, we will replace the outlier data values with NULL.

for (x in c('hum','windspeed'))
{
  value = bike_data[,x][bike_data[,x] %in% boxplot.stats(bike_data[,x])$out]
  bike_data[,x][bike_data[,x] %in% value] = NA
} 

#Checking whether the outliers in the above defined columns are replaced by NULL or not
sum(is.na(bike_data$hum))
sum(is.na(bike_data$windspeed))
as.data.frame(colSums(is.na(bike_data)))

4. 모든 이상값이 NULL로 대체되었는지 확인

이제 sum(is.na()) 함수를 사용하여 이상값이 누락된 값으로 제대로 변환되었는지 여부와 같이 누락된 데이터가 있는지 확인합니다.

산출:

> sum(is.na(bike_data$hum))
[1] 2
> sum(is.na(bike_data$windspeed))
[1] 13
> as.data.frame(colSums(is.na(bike_data)))
           colSums(is.na(bike_data))
instant                            0
dteday                             0
season                             0
yr                                 0
mnth                               0
holiday                            0
weekday                            0
workingday                         0
weathersit                         0
temp                               0
atemp                              0
hum                                2
windspeed                         13
casual                             0
registered                         0
cnt                                0

그 결과 'hum' 열의 이상치 2개와 'windspeed' 열의 이상치 13개를 missing(NA) 값으로 변환했습니다.

5. 누락된 값이 있는 열 삭제

마지막으로 'tidyr' 라이브러리의 drop_na() 함수를 사용하여 NULL 값을 삭제하여 누락된 값을 처리합니다.

#Removing the null values
library(tidyr)
bike_data = drop_na(bike_data)
as.data.frame(colSums(is.na(bike_data)))

산출:

결과적으로 이제 모든 이상값이 효과적으로 제거되었습니다!

> as.data.frame(colSums(is.na(bike_data)))
           colSums(is.na(bike_data))
instant                            0
dteday                             0
season                             0
yr                                 0
mnth                               0
holiday                            0
weekday                            0
workingday                         0
weathersit                         0
temp                               0
atemp                              0
hum                                0
windspeed                          0
casual                             0
registered                         0
cnt                                0

결론

이상으로 이 주제를 마치겠습니다. 궁금한 점이 있으면 아래에 의견을 남겨주세요. R 프로그래밍과 관련된 더 많은 게시물을 보려면 계속 지켜봐 주세요!!

그때까지 즐거운 배움!!:)