본문 바로가기

Data/Data Analysis

[SQL] ROW_NUMBER(), LEAD(), LAG()

반응형

설명전에 해당 sql은 pyspark.sql 이라는 점 확인!

 

  • row_number() - 행 번호 매겨주는 sql함수
  • LEAD() - 다음 행 값을 가져오는 sql 함수
  • LAG() - 이전 행 값을 가져오는 sql 함수

아래와 같은 테이블이 있을때 (shot)

user timestamp shot_id
jaeyung 12:34:54 123
jaeyung 12:45:34 234
jaeyung 12:50:55 456
seung 12:55:23 334
seung 13:01:34 523
seung 13:12:43 646

1. row_number()

spark.sql("""
select user, timestamp, shot_id,
    row_number() over(partition by user order by timestamp) as row_number
""")

출력 테이블

user timestamp shot_id row_number
jaeyung 12:34:54 123 1
jaeyung 12:45:34 234 2
jaeyung 12:50:55 456 3
seung 12:55:23 334 1
seung 13:01:34 523 2
seung 13:12:43 646 3

 

2. LEAD()

spark.sql("""
select user, timestamp, shot_id,
    lead(shot_id) over(partition by user order by timestamp) as lead_shot_id
from shot
""")
user timestamp shot_id lead_shot_id
jaeyung 12:34:54 123 234
jaeyung 12:45:34 234 456
jaeyung 12:50:55 456 null
seung 12:55:23 334 523
seung 13:01:34 523 646
seung 13:12:43 646 null

행 뒤에 값이 없을경우엔 null값으로 찍힌다

 

3. LAG()

spark.sql("""
select user, timestamp, shot_id,
    lag(shot_id) over(partition by user order by timestamp) as lag_shot_id
from shot
""")
user timestamp shot_id lag_shot_id
jaeyung 12:34:54 123 null
jaeyung 12:45:34 234 123
jaeyung 12:50:55 456 234
seung 12:55:23 334 null
seung 13:01:34 523 334
seung 13:12:43 646 523

행 앞에 값이 없을경우엔 null값으로 찍힌다

반응형

'Data > Data Analysis' 카테고리의 다른 글

[SQL] ARRAY_CONTAINS 함수  (0) 2020.07.20
[Pyspark] UDF 함수에 parameter값 추가  (0) 2020.05.08
[Pandas] Dataframe resample 함수  (0) 2020.03.30
[Statistics] 회귀와 예측2  (0) 2020.03.09
[Statistics] 회귀와 예측  (0) 2020.03.06