Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Filter bag with multiple output

$
0
0
Hi, I have rather simple problem and I can't create nice solution.
Here is my input:
msisdn longitude latitude ts
1 20.30 40.50 123
1 0.0 null 456
2 60.70 34.67 678
2 null null 978

I need:
group by msisdn
order by ts inside each group
filter records in each group:
1. put all records where longitude, latitude are valid on one side
2. put all records where longitude/latidude = 0.0/null to the othe side

Here is pig pseudo-code:
rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped{
ordered = ORDER rawRecords BY ts;
--do sometihing here to filter valid and not valid records....

STORE notValidRecords INTO /not_valid_data;

someOtherProjection = GROUP validRecords By msisdn;
--continue to work with filtered valid records...

Can I do it in a single pig script, or I need to create two scripts:
the first one would filter not valid records and store them
the second one will continue to process filtered set of records?

Viewing all articles
Browse latest Browse all 5648

Latest Images

Trending Articles



Latest Images