Hi,
I am using Pig 0.20.0 and having 2 large data sets (log files) each is of
around 2-3 TBs in size in compressed format. I need to join both of them
with a composite join key of userId and timestamp.
UserId is common however as the log files are from 2 completely different
systems and may have different timestamp. So I have to use the logic of
join in such a way that:
Take the timestamp from one log, lets say timestampA, then consider all
records from other data sets where userId matches with that and timestamp
of B (say timestampB) be: (timestampA - x) >= timestampB >= (timestampA +
x) where x is like 5 minutes.
*Questions:*
1. So far I have used equi-join, but not sure how I can do non equi join.
2. Is there anyway to optimize the join operation? if I do the secondary
sort of userId + timestamp in both datasets, will it help?
Thanks
Amit
I am using Pig 0.20.0 and having 2 large data sets (log files) each is of
around 2-3 TBs in size in compressed format. I need to join both of them
with a composite join key of userId and timestamp.
UserId is common however as the log files are from 2 completely different
systems and may have different timestamp. So I have to use the logic of
join in such a way that:
Take the timestamp from one log, lets say timestampA, then consider all
records from other data sets where userId matches with that and timestamp
of B (say timestampB) be: (timestampA - x) >= timestampB >= (timestampA +
x) where x is like 5 minutes.
*Questions:*
1. So far I have used equi-join, but not sure how I can do non equi join.
2. Is there anyway to optimize the join operation? if I do the secondary
sort of userId + timestamp in both datasets, will it help?
Thanks
Amit