Hi, I have rather simple problem and I can't create nice solution.
Here is my input:
msisdn longitude latitude ts
1 20.30 40.50 123
1 0.0 null 456
2 60.70 34.67 678
2 null null 978
I need:
group by msisdn
order by ts inside each group
filter records in each group:
1. put all records where longitude, latitude are valid on one side
2. put all records where longitude/latidude = 0.0/null to the othe side
Here is pig pseudo-code:
rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped{
ordered = ORDER rawRecords BY ts;
--do sometihing here to filter valid and not valid records....
STORE notValidRecords INTO /not_valid_data;
someOtherProjection = GROUP validRecords By msisdn;
--continue to work with filtered valid records...
Can I do it in a single pig script, or I need to create two scripts:
the first one would filter not valid records and store them
the second one will continue to process filtered set of records?
Here is my input:
msisdn longitude latitude ts
1 20.30 40.50 123
1 0.0 null 456
2 60.70 34.67 678
2 null null 978
I need:
group by msisdn
order by ts inside each group
filter records in each group:
1. put all records where longitude, latitude are valid on one side
2. put all records where longitude/latidude = 0.0/null to the othe side
Here is pig pseudo-code:
rawRecords = LOAD '/data' as ...;
grouped = GROUP rawRecords BY msisdn;
validAndNotValidRecords = FOREACH grouped{
ordered = ORDER rawRecords BY ts;
--do sometihing here to filter valid and not valid records....
STORE notValidRecords INTO /not_valid_data;
someOtherProjection = GROUP validRecords By msisdn;
--continue to work with filtered valid records...
Can I do it in a single pig script, or I need to create two scripts:
the first one would filter not valid records and store them
the second one will continue to process filtered set of records?