Well, an algoritm is not so complicated...
Steps to do:
1) You need to sort data by
initial_range
.
2) Then you need to add 2 extra fields, where ID1 starts from 0 (zero) and ID2 starts from 1 (one). This part of job can be done by using sequence number. See:
PySpark: Dataframe Sequence Number[
^]
index initial_range final_range ID1 ID2
1 1000000 5999999 0 1
2 6000000 6299999 1 2
3 6300000 6399999 2 3
4 6400000 6499999 3 4
12 6500000 6549999 4 5
5 6600000 6699999 5 6
6 6700000 6749999 6 7
7 6750000 6799999 7 8
8 7000000 7399999 8 9
9 7600000 7699999 9 10
10 7700000 7749999 10 11
11 7750000 7799999 11 12
3) Now, you have to make two left-self-joins based on ID1 and ID2:
a) to the next value (where ID2 = ID1) to identify end of group (EOG)
b) to the prev value (where ID1 = ID2) to identify beginning of group (BOG).
To define EOG and BOG - use sequence number :)
See:
left-self-joins.png (43.4 KB)
4) Finally, you have to join a) and b) based on the EOG and BOG.
Result:
index initial_range final_range BOG EOG
1 1000000 6549999 1 1
5 6600000 6799999 2 2
8 7000000 7399999 3 3
9 7600000 7799999 4 4