HackerRank SQL - Interviews

Samantha interviews many candidates from different colleges using coding challenges and contests. Write a query to print the contest_id, hacker_id, name, and the sums of total_submissions, total_accepted_submissions, total_views, and total_unique_views for each contest sorted by contest_id. Exclude the contest from the result if all four sums are .

Note: A specific contest can be used to screen candidates at more than one college, but each college only holds screening contest.

Input Format

The following tables hold interview data:

Contests: The contest_id is the id of the contest, hacker_id is the id of the hacker who created the contest, and name is the name of the hacker.
Colleges: The college_id is the id of the college, and contest_id is the id of the contest that Samantha used to screen the candidates.
Challenges: The challenge_id is the id of the challenge that belongs to one of the contests whose contest_id Samantha forgot, and college_id is the id of the college where the challenge was given to candidates.
View_Stats: The challenge_id is the id of the challenge, total_views is the number of times the challenge was viewed by candidates, and total_unique_views is the number of times the challenge was viewed by unique candidates.
Submission_Stats: The challenge_id is the id of the challenge, total_submissions is the number of submissions for the challenge, and total_accepted_submission is the number of submissions that achieved full scores.

Problem

- contest_id, hacker_id, name, and the sums of total_submissions, total_accepted_submissions, total_views, and total_unique_views를 추출하라.

- 단, contest_id를 기준으로 오름차순 정렬하되, 4가지 집계합이 0이상인 경우만 표기하라.

Answer1

SELECT  C.contest_id
      , C.hacker_id
      , C.name
      , sum(SS.total_submissions)
      , sum(SS.total_accepted_submissions)
      , sum(VS.total_views)
      , sum(VS.total_unique_views)
FROM Contests C LEFT JOIN Colleges C2 ON C.contest_id = C2.contest_id
                LEFT JOIN Challenges C3 ON C3.college_id = C2.college_id
                LEFT JOIN (SELECT challenge_id 
                           ,SUM(total_views) total_views
                           ,SUM(total_unique_views) total_unique_views
                           FROM View_Stats
                       GROUP BY challenge_id) VS ON VS.challenge_id = C3.challenge_id
                LEFT JOIN (SELECT challenge_id
                                  ,SUM(total_submissions) total_submissions
                                  ,SUM(total_accepted_submissions) total_accepted_submissions
                            FROM Submission_Stats
                        GROUP BY challenge_id) SS ON SS.challenge_id = C3.challenge_id
GROUP BY C.contest_id, C.hacker_id, C.name
HAVING sum(SS.total_submissions)+sum(SS.total_accepted_submissions)+sum(VS.total_views)+sum(total_unique_views) >0
ORDER BY C.contest_id;

실제로 현업에서도 종종 실수하는 부분 중 하나인데 테이블에 PK나 FK 외 join을 하게 되면 집계가 꼬이게 되는데, 이때 group by나 distinct로 중복을 제거하고 join을 해야 한다. 해당 문제 역시 테이블을 꼼꼼하게 파악하고 관계를 머릿속에 잘 그린 상태로 쿼리를 작성해야 알 수 있었다.

실제로 업무 중에 이런 몇 가지 포인트가 있어서 노션에 정리해두고 틈틈이 회고용으로 보기도 한다.

How to solve

테이블은 5개로 이루어져 있고, Challenge_id로 결합된 View_Stats, submission_Stats는 집계테이블이라 할 수 있다.

단순 inner join으로 묶어내면 다음과 같은 구조가 될텐데,

이렇게 쿼리를 날리면 틀렸다고 나온다.

SELECT  C.contest_id
      , C.hacker_id
      , C.name
      , sum(SS.total_submissions) sums_of_total_submissions
      , sum(SS.total_accepted_submissions) total_accepted_submissions
      , sum(VS.total_views) total_views
      , sum(total_unique_views) total_unique_views
FROM Contests C
     ,Colleges C2
     ,Challenges C3
     ,View_Stats VS  
     ,Submission_Stats SS 
WHERE C.contest_id = C2.contest_id
  AND C3.college_id = C2.college_id
  AND VS.challenge_id = C3.challenge_id
  AND SS.challenge_id = C3.challenge_id
GROUP BY C.contest_id, C.hacker_id, C.name
HAVING sum(SS.total_submissions) + sum(SS.total_accepted_submissions) + sum(VS.total_views) + sum(total_unique_views) >0

다시 예시 테이블을 보면, 이상한 점이 있다.

집계테이블에서 중복값이 있다는 점이다.

이게 어떤 문제를 발생시키냐면 inner join으로 묶었을 때, 해당 컬럼값을 중복으로 계산한다는 소리가 된다.

말로 설명하기 어려우니 실제로 한 번 돌려보자

view_stats에서 challenge_id가 773인 레코드는 다음과 같다.

submission_stats에서 challenge_id가 773인 레코드는 다음과 같다.

이제 다음과 같은 쿼리를 돌려보면 어떻게 결과가 나올까?

SELECT c.challenge_id, vs.*, ss.*
FROM Challenges c, view_stats vs, submission_stats ss
WHERE c.challenge_id = 773
  AND c.challenge_id = vs.challenge_id
  AND c.challenge_id = ss.challenge_id
ORDER BY c.challenge_id

다음과 같이 한 번만 계산되어야 할 각각의 레코드가 크로스 조인되듯이 모두 펼쳐진다.

다시 돌아가서 stats인 집계테이블에는 challenge_id는 중복이 존재하기 때문에, 즉 정확히는 반정규화되어 있는 상태이기 때문에 distinct로 중복을 제거하든지, group by를 써서 중복을 제거해줘야 한다.

여기서는 각각의 합이 필요함으로 group by절을 이용해 각각의 stats 테이블을 from절의 서브쿼리로 묶어주어야 한다.

#from절
FROM Contests C LEFT JOIN Colleges C2 ON C.contest_id = C2.contest_id
                LEFT JOIN Challenges C3 ON C3.college_id = C2.college_id
                # view_stats table with grouping value
                LEFT JOIN (SELECT challenge_id 
                           ,SUM(total_views) total_views
                           ,SUM(total_unique_views) total_unique_views
                           FROM View_Stats
                       GROUP BY challenge_id) VS ON VS.challenge_id = C3.challenge_id
                # submission_stats table with grouping value
                LEFT JOIN (SELECT challenge_id
                                  ,SUM(total_submissions) total_submissions
                                  ,SUM(total_accepted_submissions) total_accepted_submissions
                            FROM Submission_Stats
                        GROUP BY challenge_id) SS ON SS.challenge_id = C3.challenge_id

좀 더 가독성을 위해 contests 테이블을 기준으로 left join으로 결합해주었다.

이렇게 하면 각각의 중복값이 중복 횟수가 아니라 한 번만 계산할 수 있게 결합된다.

마지막으로 having절에서 4개의 집계값 합이 0인 경우를 제외하면 끝이다.

참고자료

http://www.gurubee.net/article/84239

'데이터 > SQL 문제풀이' 카테고리의 다른 글

HackerRank SQL - 15 Days of Learning SQL (0)	2023.02.22
HackerRank SQL - Print Prime Numbers (0)	2023.02.21
HackerRank SQL - Symmetric Pairs (0)	2023.02.12
HackerRank SQL - Placements (1)	2023.02.05
HackerRank SQL - SQL Project Planning (2)	2023.02.02