There are some Sql patterns that, as soon as you recognize them, you begin seeing them in every single place. The options to the puzzles that I’ll present you at the moment are literally quite simple SQL queries, however understanding the idea behind them will certainly unlock new options to the queries you write on a day-to-day foundation.
These challenges are all based mostly on real-world eventualities, as over the previous few months I made a degree of writing down each puzzle-like question that I needed to construct. I additionally encourage you to attempt them for your self, so as to problem your self first, which can enhance your studying!
All queries to generate the datasets will likely be offered in a PostgreSQL and DuckDB-friendly syntax, so as to simply copy and play with them. On the finish I may also present you a hyperlink to a GitHub repo containing all of the code, in addition to the reply to the bonus problem I’ll depart for you!
I organized these puzzles so as of accelerating issue, so, if you happen to discover the primary ones too simple, no less than check out the final one, which makes use of a method that I really consider you received’t have seen earlier than.
Okay, let’s get began.
I like this puzzle due to how quick and easy the ultimate question is, although it offers with many edge instances. The information for this problem exhibits tickets shifting in between Kanban levels, and the target is to search out how lengthy, on common, tickets keep within the Doing stage.
The information incorporates the ID of the ticket, the date the ticket was created, the date of the transfer, and the “from” and “to” levels of the transfer. The levels current are New, Doing, Evaluate, and Performed.
Some issues you might want to know (edge instances):
- Tickets can transfer backwards, that means tickets can return to the Doing stage.
- You shouldn’t embrace tickets which can be nonetheless caught within the Doing stage, as there is no such thing as a approach to understand how lengthy they’ll keep there for.
- Tickets usually are not all the time created within the New stage.
CREATE TABLE ticket_moves (
ticket_id INT NOT NULL,
create_date DATE NOT NULL,
move_date DATE NOT NULL,
from_stage TEXT NOT NULL,
to_stage TEXT NOT NULL
);
INSERT INTO ticket_moves (ticket_id, create_date, move_date, from_stage, to_stage)
VALUES
-- Ticket 1: Created in "New", then strikes to Doing, Evaluate, Performed.
(1, '2024-09-01', '2024-09-03', 'New', 'Doing'),
(1, '2024-09-01', '2024-09-07', 'Doing', 'Evaluate'),
(1, '2024-09-01', '2024-09-10', 'Evaluate', 'Performed'),
-- Ticket 2: Created in "New", then strikes: New → Doing → Evaluate → Doing once more → Evaluate.
(2, '2024-09-05', '2024-09-08', 'New', 'Doing'),
(2, '2024-09-05', '2024-09-12', 'Doing', 'Evaluate'),
(2, '2024-09-05', '2024-09-15', 'Evaluate', 'Doing'),
(2, '2024-09-05', '2024-09-20', 'Doing', 'Evaluate'),
-- Ticket 3: Created in "New", then strikes to Doing. (Edge case: no subsequent transfer from Doing.)
(3, '2024-09-10', '2024-09-16', 'New', 'Doing'),
-- Ticket 4: Created already in "Doing", then strikes to Evaluate.
(4, '2024-09-15', '2024-09-22', 'Doing', 'Evaluate');
A abstract of the info:
- Ticket 1: Created within the New stage, strikes usually to Doing, then Evaluate, after which Performed.
- Ticket 2: Created in New, then strikes: New → Doing → Evaluate → Doing once more → Evaluate.
- Ticket 3: Created in New, strikes to Doing, however it’s nonetheless caught there.
- Ticket 4: Created within the Doing stage, strikes to Evaluate afterward.
It may be a good suggestion to cease for a bit and suppose how you’d take care of this. Are you able to learn the way lengthy a ticket stays on a single stage?
Actually, this sounds intimidating at first, and it seems to be like it is going to be a nightmare to take care of all the sting instances. Let me present you the total answer to the issue, after which I’ll clarify what is occurring afterward.
WITH stage_intervals AS (
SELECT
ticket_id,
from_stage,
move_date
- COALESCE(
LAG(move_date) OVER (
PARTITION BY ticket_id
ORDER BY move_date
),
create_date
) AS days_in_stage
FROM
ticket_moves
)
SELECT
SUM(days_in_stage) / COUNT(DISTINCT ticket_id) as avg_days_in_doing
FROM
stage_intervals
WHERE
from_stage = 'Doing';

The primary CTE makes use of the LAG perform to search out the earlier transfer of the ticket, which would be the time the ticket entered that stage. Calculating the period is so simple as subtracting the earlier date from the transfer date.
What it is best to discover is using the COALESCE within the earlier transfer date. What that does is that if a ticket doesn’t have a earlier transfer, then it makes use of the date of creation of the ticket. This takes care of the instances of tickets being created straight into the Doing stage, because it nonetheless will correctly calculate the time it took to depart the stage.
That is the results of the primary CTE, displaying the time spent in every stage. Discover how the Ticket 2 has two entries, because it visited the Doing stage in two separate events.

With this performed, it’s only a matter of getting the common because the SUM of whole days spent in doing, divided by the distinct variety of tickets that ever left the stage. Doing it this fashion, as an alternative of merely utilizing the AVG, makes positive that the 2 rows for Ticket 2 get correctly accounted for as a single ticket.
Not so dangerous, proper?
The objective of this second problem is to discover the latest contract sequence of each worker. A break of sequence occurs when two contracts have a spot of greater than sooner or later between them.
On this dataset, there are not any contract overlaps, that means {that a} contract for a similar worker both has a spot or ends a day earlier than the brand new one begins.
CREATE TABLE contracts (
contract_id integer PRIMARY KEY,
employee_id integer NOT NULL,
start_date date NOT NULL,
end_date date NOT NULL
);
INSERT INTO contracts (contract_id, employee_id, start_date, end_date)
VALUES
-- Worker 1: Two steady contracts
(1, 1, '2024-01-01', '2024-03-31'),
(2, 1, '2024-04-01', '2024-06-30'),
-- Worker 2: One contract, then a spot of three days, then two contracts
(3, 2, '2024-01-01', '2024-02-15'),
(4, 2, '2024-02-19', '2024-04-30'),
(5, 2, '2024-05-01', '2024-07-31'),
-- Worker 3: One contract
(6, 3, '2024-03-01', '2024-08-31');

As a abstract of the info:
- Worker 1: Has two steady contracts.
- Worker 2: One contract, then a spot of three days, then two contracts.
- Worker 3: One contract.
The anticipated consequence, given the dataset, is that every one contracts ought to be included apart from the primary contract of Worker 2, which is the one one which has a spot.
Earlier than explaining the logic behind the answer, I would love you to consider what operation can be utilized to affix the contracts that belong to the identical sequence. Focus solely on the second row of information, what info do you might want to know if this contract was a break or not?
I hope it’s clear that that is the proper state of affairs for window features, once more. They’re extremely helpful for fixing issues like this, and understanding when to make use of them helps quite a bit to find clear options to issues.
Very first thing to do, then, is to get the tip date of the earlier contract for a similar worker with the LAG perform. Doing that, it’s easy to check each dates and verify if it was a break of sequence.
WITH ordered_contracts AS (
SELECT
*,
LAG(end_date) OVER (PARTITION BY employee_id ORDER BY start_date) AS previous_end_date
FROM
contracts
),
gapped_contracts AS (
SELECT
*,
-- Offers with the case of the primary contract, which will not have
-- a earlier finish date. On this case, it is nonetheless the beginning of a brand new
-- sequence.
CASE WHEN previous_end_date IS NULL
OR previous_end_date < start_date - INTERVAL '1 day' THEN
1
ELSE
0
END AS is_new_sequence
FROM
ordered_contracts
)
SELECT * FROM gapped_contracts ORDER BY employee_id ASC;

An intuitive approach to proceed the question is to quantity the sequences of every worker. For instance, an worker who has no hole, will all the time be on his first sequence, however an worker who had 5 breaks in contracts will likely be on his fifth sequence. Funnily sufficient, that is performed by one other window perform.
--
-- Earlier CTEs
--
sequences AS (
SELECT
*,
SUM(is_new_sequence) OVER (PARTITION BY employee_id ORDER BY start_date) AS sequence_id
FROM
gapped_contracts
)
SELECT * FROM sequences ORDER BY employee_id ASC;

Discover how, for Worker 2, he begins his sequence #2 after the primary gapped worth. To complete this question, I grouped the info by worker, acquired the worth of their most up-to-date sequence, after which did an internal be part of with the sequences to maintain solely the latest one.
--
-- Earlier CTEs
--
max_sequence AS (
SELECT
employee_id,
MAX(sequence_id) AS max_sequence_id
FROM
sequences
GROUP BY
employee_id
),
latest_contract_sequence AS (
SELECT
c.contract_id,
c.employee_id,
c.start_date,
c.end_date
FROM
sequences c
JOIN max_sequence m ON c.sequence_id = m.max_sequence_id
AND c.employee_id = m.employee_id
ORDER BY
c.employee_id,
c.start_date
)
SELECT
*
FROM
latest_contract_sequence;

As anticipated, our ultimate result’s principally our beginning question simply with the primary contract of Worker 2 lacking!
Lastly, the final puzzle — I’m glad you made it this far.
For me, that is probably the most mind-blowing one, as once I first encountered this downside I considered a very totally different answer that may be a multitude to implement in SQL.
For this puzzle, I’ve modified the context from what I needed to take care of for my job, as I feel it can make it simpler to clarify.
Think about you’re a knowledge analyst at an occasion venue, and also you’re analyzing the talks scheduled for an upcoming occasion. You wish to discover the time of day the place there would be the highest variety of talks occurring on the similar time.
That is what it is best to know in regards to the schedules:
- Rooms are booked in increments of 30min, e.g. from 9h-10h30.
- The information is clear, there are not any overbookings of assembly rooms.
- There will be back-to-back conferences in a single assembly room.

Assembly schedule visualized (that is the precise knowledge).
CREATE TABLE conferences (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO conferences (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room C', '2024-10-01 11:30', '2024-10-01 12:00');

The way in which to unravel that is utilizing what is known as a Sweep Line Algorithm, or also called an event-based answer. This final title truly helps to grasp what will likely be performed, as the thought is that as an alternative of coping with intervals, which is what we have now within the unique knowledge, we take care of occasions as an alternative.
To do that, we have to rework each row into two separate occasions. The primary occasion would be the Begin of the assembly, and the second occasion would be the Finish of the assembly.
WITH occasions AS (
-- Create an occasion for the beginning of every assembly (+1)
SELECT
start_time AS event_time,
1 AS delta
FROM conferences
UNION ALL
-- Create an occasion for the tip of every assembly (-1)
SELECT
-- Small trick to work with the back-to-back conferences (defined later)
end_time - interval '1 minute' as end_time,
-1 AS delta
FROM conferences
)
SELECT * FROM occasions;

Take the time to grasp what is occurring right here. To create two occasions from a single row of information, we’re merely unioning the dataset on itself; the primary half makes use of the beginning time because the timestamp, and the second half makes use of the tip time.
You would possibly already discover the delta column created and see the place that is going. When an occasion begins, we depend it as +1, when it ends, we depend it as -1. You would possibly even be already pondering of one other window perform to unravel this, and also you’re truly proper!
However earlier than that, let me simply clarify the trick I used in the long run dates. As I don’t need back-to-back conferences to depend as two concurrent conferences, I’m subtracting a single minute of each finish date. This fashion, if a gathering ends and one other begins at 10h30, it received’t be assumed that two conferences are concurrently occurring at 10h30.
Okay, again to the question and yet one more window perform. This time, although, the perform of alternative is a rolling SUM.
--
-- Earlier CTEs
--
ordered_events AS (
SELECT
event_time,
delta,
SUM(delta) OVER (ORDER BY event_time, delta DESC) AS concurrent_meetings
FROM occasions
)
SELECT * FROM ordered_events ORDER BY event_time DESC;

The rolling SUM on the Delta column is basically strolling down each file and discovering what number of occasions are energetic at the moment. For instance, at 9 am sharp, it sees two occasions beginning, so it marks the variety of concurrent conferences as two!
When the third assembly begins, the depend goes as much as three. However when it will get to 9h59 (10 am), then two conferences finish, bringing the counter again to 1. With this knowledge, the one factor lacking is to search out when the very best worth of concurrent conferences occurs.
--
-- Earlier CTEs
--
max_events AS (
-- Discover the utmost concurrent conferences worth
SELECT
event_time,
concurrent_meetings,
RANK() OVER (ORDER BY concurrent_meetings DESC) AS rnk
FROM ordered_events
)
SELECT event_time, concurrent_meetings
FROM max_events
WHERE rnk = 1;

That’s it! The interval of 9h30–10h is the one with the biggest variety of concurrent conferences, which checks out with the schedule visualization above!
This answer seems to be extremely easy in my view, and it really works for thus many conditions. Each time you’re coping with intervals now, it is best to suppose if the question wouldn’t be simpler if you considered it within the perspective of occasions.
However earlier than you progress on, and to actually nail down this idea, I wish to depart you with a bonus problem, which can also be a standard utility of the Sweep Line Algorithm. I hope you give it a attempt!
Bonus problem
The context for this one remains to be the identical because the final puzzle, however now, as an alternative of looking for the interval when there are most concurrent conferences, the target is to search out dangerous scheduling. Plainly there are overlaps within the assembly rooms, which should be listed so it may be fastened ASAP.
How would you discover out if the identical assembly room has two or extra conferences booked on the similar time? Listed below are some recommendations on find out how to clear up it:
- It’s nonetheless the identical algorithm.
- This implies you’ll nonetheless do the UNION, however it can look barely totally different.
- You need to suppose within the perspective of every assembly room.
You should use this knowledge for the problem:
CREATE TABLE meetings_overlap (
room TEXT NOT NULL,
start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL
);
INSERT INTO meetings_overlap (room, start_time, end_time) VALUES
-- Room A conferences
('Room A', '2024-10-01 09:00', '2024-10-01 10:00'),
('Room A', '2024-10-01 10:00', '2024-10-01 11:00'),
('Room A', '2024-10-01 11:00', '2024-10-01 12:00'),
-- Room B conferences
('Room B', '2024-10-01 09:30', '2024-10-01 11:30'),
-- Room C conferences
('Room C', '2024-10-01 09:00', '2024-10-01 10:00'),
-- Overlaps with earlier assembly.
('Room C', '2024-10-01 09:30', '2024-10-01 12:00');
In the event you’re within the answer to this puzzle, in addition to the remainder of the queries, verify this GitHub repo.
The primary takeaway from this weblog put up is that window features are overpowered. Ever since I acquired extra snug with utilizing them, I really feel that my queries have gotten a lot easier and simpler to learn, and I hope the identical occurs to you.
In the event you’re enthusiastic about studying extra about them, you’d in all probability take pleasure in studying this other blog post I’ve written, the place I am going over how one can perceive and use them successfully.
The second takeaway is that these patterns used within the challenges actually do occur in lots of different locations. You would possibly want to search out sequences of subscriptions, buyer retention, otherwise you would possibly want to search out overlap of duties. There are numerous conditions when you have to to make use of window features in a really related trend to what was performed within the puzzles.
The third factor I need you to recollect is about this answer to utilizing occasions moreover coping with intervals. I’ve checked out some issues I solved a very long time in the past that I may’ve used this sample on to make my life simpler, and sadly, I didn’t find out about it on the time.
I actually do hope you loved this put up and gave a shot to the puzzles your self. And I’m positive that if you happen to made it this far, you both discovered one thing new about SQL or strengthened your data of window features!
Thanks a lot for studying. You probably have questions or simply wish to get in contact with me, don’t hesitate to contact me at mtrentz.com.
All photos by the writer except said in any other case.