본문 바로가기

Data/Data Analysis

[Pyspark] 차원축소 pyspark.ml.feature의 PCA 사용

반응형

[참조] spark.apache.org/docs/1.5.1/ml-features.html#pca

 

Feature Extraction, Transformation, and Selection - SparkML - Spark 1.5.1 Documentation

ML - Features This section covers algorithms for working with features, roughly divided into these groups: Extraction: Extracting features from “raw” data Transformation: Scaling, converting, or modifying features Selection: Selecting a subset from a l

spark.apache.org

학습할 pyspark.dataframe을 만들고나서, pyspark.ml관련 라이브러리에 넣기위해선 features라는 벡터형식의 dataframe column이 필요로한다.

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import PCA

# 벡터화 시키고싶은 column들을 inputCols에 지정, 출력 column을 features로 설정
training_vectorize = VectorAssembler(inputCols=['A', 'B', 'C', "D", 'E', 'F', 'G'],
                                     outputCol="features")
dataset = training_vectorize.transform(final)

# PCA 차원축소 진행
# k - 몇차원으로 축소시키고싶은지
# inputCol - 차원축소를 시키고싶은 column
# outputCOl - 차원축소 output column
pca = PCA(k=2, inputCol="features", outputCol="pcaFeatures")
model = pca.fit(dataset)
result = model.transform(dataset).select(F.col("pcaFeatures")[0].alias('X'), F.col("pcaFeatures")[1].alias('Y'), 'label')

 

반응형