| home | about me | feed RSS feed

This was posted as an answer to the following question:
http://stackoverflow.com/questions/6174451/merge-date-rows-if-episodes-are-in-direct-succession-or-overlapping.
I also wanted to write this slightly more detailed explanation.

Say we have a table that stores start- and end-dates for some entities. That is to say: There can be many start- and end-date-pairs for each entity.

For the purposes of this exercise we would ultimately like to condense overlapping start- and end-date-pairs into one row for each set of overlapping date-ranges; non-overlapping date-ranges should not be condensed. In other words, the final goal is to have one and only one row for each set of however many overlapping start- and end-dates exist, for each entity in the table. You can probably think of several applications for this kind of query.

For a solution to this problem we turn to our good friend: so-called "recursive SQL"; which is actually more of an iterative process.

WITH RECURSIVE t1_rec ( id, "begin", "end", n ) AS (
    SELECT id, "begin", "end", n
      FROM (
        SELECT
            id, "begin", "end",
            CASE 
                WHEN LEAD("begin") OVER (
                PARTITION BY    id
                ORDER BY        "begin") <= 
                    ("end" + interval '2' day) 
                THEN 1 ELSE 0 END AS cl,
            ROW_NUMBER() OVER (
                PARTITION BY    id
                ORDER BY        "begin") AS n
        FROM mytable 
    ) s
    WHERE s.cl = 1
  UNION ALL
    SELECT p1.id, p1."begin", p1."end", a.n
      FROM t1_rec a 
           JOIN mytable p1 ON p1.id = a.id
       AND p1."begin" > a."begin"
       AND (a."begin",  a."end" + interval '2' day) OVERLAPS 
           (p1."begin", p1."end")
)
SELECT t1.id, min(t1."begin"), max(t1."end")
  FROM t1_rec t1
       LEFT JOIN t1_rec t2 ON t1.id = t2.id 
       AND t2."end" = t1."end"
       AND t2.n < t1.n
 WHERE t2.n IS NULL
 GROUP BY t1.id, t1.n
 ORDER BY t1.id, t1.n;

To begin the query we start at the end: We specify what fields will be contained in our final resultset: id, begin, end, and n.

Now for the real beginning: We start with the those begin dates in the table that have a subsequent (temporally) begin date. The LEAD function helps with this task. If you want to see more detail about how this part works you can add this field into the SELECT clause of the (sub)query with the CASE statement:

-- for debugging 
LEAD(beg) OVER (
    PARTITION BY    entity_id
    ORDER BY        beg),

Next we build up our resultset row by row using UNION ALL. You can think of this step like a for loop iterating over your table. The second and third join-conditions are specific to this particular question, and ensure that we append 1) subsequent ranges that 2) have a gap of up to one day between one entity's end-date and the next entity's start- date.

The final step is to reduce the resultset to only the rows that represent entire overlapping date-ranges. What happens in the iterative process is that any overlapping ranges beyond an initial one generate a subset of overlapping ranges. This might be useful in some cases, but not here; so we remove them using an aggregate operation. If self-exclusion joins give you the heebie-jeebies use this instead:

-- correlated subquery alternative to self-exclusion join
NOT EXISTS (
    SELECT 1 
      FROM t1_rec t2 
     WHERE t2.entity_id = t1.entity_id 
       AND t2.fin = t1.fin 
       AND t2.n < t1.n)
created: 2011-06-05 | updated: 2011-06-19


Comments are now closed for this post.