Data/Data Analysis
[SQL] ROW_NUMBER(), LEAD(), LAG()
재융
2020. 4. 29. 02:08
반응형
설명전에 해당 sql은 pyspark.sql 이라는 점 확인!
- row_number() - 행 번호 매겨주는 sql함수
- LEAD() - 다음 행 값을 가져오는 sql 함수
- LAG() - 이전 행 값을 가져오는 sql 함수
아래와 같은 테이블이 있을때 (shot)
user | timestamp | shot_id |
jaeyung | 12:34:54 | 123 |
jaeyung | 12:45:34 | 234 |
jaeyung | 12:50:55 | 456 |
seung | 12:55:23 | 334 |
seung | 13:01:34 | 523 |
seung | 13:12:43 | 646 |
1. row_number()
spark.sql("""
select user, timestamp, shot_id,
row_number() over(partition by user order by timestamp) as row_number
""")
출력 테이블
user | timestamp | shot_id | row_number |
jaeyung | 12:34:54 | 123 | 1 |
jaeyung | 12:45:34 | 234 | 2 |
jaeyung | 12:50:55 | 456 | 3 |
seung | 12:55:23 | 334 | 1 |
seung | 13:01:34 | 523 | 2 |
seung | 13:12:43 | 646 | 3 |
2. LEAD()
spark.sql("""
select user, timestamp, shot_id,
lead(shot_id) over(partition by user order by timestamp) as lead_shot_id
from shot
""")
user | timestamp | shot_id | lead_shot_id |
jaeyung | 12:34:54 | 123 | 234 |
jaeyung | 12:45:34 | 234 | 456 |
jaeyung | 12:50:55 | 456 | null |
seung | 12:55:23 | 334 | 523 |
seung | 13:01:34 | 523 | 646 |
seung | 13:12:43 | 646 | null |
행 뒤에 값이 없을경우엔 null값으로 찍힌다
3. LAG()
spark.sql("""
select user, timestamp, shot_id,
lag(shot_id) over(partition by user order by timestamp) as lag_shot_id
from shot
""")
user | timestamp | shot_id | lag_shot_id |
jaeyung | 12:34:54 | 123 | null |
jaeyung | 12:45:34 | 234 | 123 |
jaeyung | 12:50:55 | 456 | 234 |
seung | 12:55:23 | 334 | null |
seung | 13:01:34 | 523 | 334 |
seung | 13:12:43 | 646 | 523 |
행 앞에 값이 없을경우엔 null값으로 찍힌다
반응형