Hello everyone,
I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this:
logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile as profile,
FLATTEN(MY_UDF(t)) as (generic_status);
};
The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors...
My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11
Cheers,
Anastasis
I have a foreach statement and inside of it, I use an order by. After the order by, I have a UDF. Example like this:
logs = LOAD 'raw_data' USING org.apache.hcatalog.pig.HCatLoader();
logs_g = GROUP logs BY (date, site, profile) PARALLEL 2;
service_flavors = FOREACH logs_g {
t = ORDER logs BY status;
GENERATE group.date as dates, group.site as site, group.profile as profile,
FLATTEN(MY_UDF(t)) as (generic_status);
};
The problem is that I get duplicate results.. I know that MY_UDF is running on mappers, but shouldn't each mapper take 1 group from the logs_g? Is something wrong with order by? I tried to add order by parallel but I get syntax errors...
My problem is resolved if I put GROUP logs BY (date, site, profile) PARALLEL 1; But this is not a scalable solution. Can someone help me pls? I am using pig 0.11
Cheers,
Anastasis