Hi everyone,
I have a doubt.
Well, as far as I understood from the book "Programming Pig", after GROUP
all the records with the same key go to the same reduce. Well, so far so
good.
This allows us to write a statement like this:
*foreach grpd generate group, COUNT(input)*
which should count the elements *per* key.
Then comes my issue. I have a script like this:
B = GROUP A BY key PARALLEL p;
C = FILTER B BY NOT IsEmpty(A);
D = FOREACH C GENERATE FLATTEN(MyFunction(A)) AS (mySchema);
If I go through all the tuples in the bag handed to *MyFunction*, I see
elements with different keys (although they are sorted)! Am I doing
something wrong? What am I missing here?
So far, I'm managing this by checking when the key changes and then
computing my stuff in a per key basis. But I'm not sure if this is OK or if
it's a kind of a hack.
Thank you!
Rodrigo Ferreira.
I have a doubt.
Well, as far as I understood from the book "Programming Pig", after GROUP
all the records with the same key go to the same reduce. Well, so far so
good.
This allows us to write a statement like this:
*foreach grpd generate group, COUNT(input)*
which should count the elements *per* key.
Then comes my issue. I have a script like this:
B = GROUP A BY key PARALLEL p;
C = FILTER B BY NOT IsEmpty(A);
D = FOREACH C GENERATE FLATTEN(MyFunction(A)) AS (mySchema);
If I go through all the tuples in the bag handed to *MyFunction*, I see
elements with different keys (although they are sorted)! Am I doing
something wrong? What am I missing here?
So far, I'm managing this by checking when the key changes and then
computing my stuff in a per key basis. But I'm not sure if this is OK or if
it's a kind of a hack.
Thank you!
Rodrigo Ferreira.