Hi,
I'm having trouble with a simple task that I believe someone out there must
have already solved some day.
I'm trying to group and count the frequency of terms for each group in
PigLatin, but I'm having some troubles to figure it out how to do it.
I have a collection of objects with the following schema:
{cluster_id: bytearray,terms: chararray}
And here is some samples
(10, smerter)
(10, graviditeten)
(10, smerter)
(10, smerter)
(10, udemærket)
(20, eis feuer)
(20, herunterladen schau)
(20, download gratis)
(20, download gratis)
(30, anschauen kinofilm)
(30, kauf rechnung)
(30, kauf rechnung)
(30, versandkostenfreie lieferung)
(30, kostenlose)
(30, kostenlose)
(30, kostenlose)
the result I m trying to get is something like this
(10, smerter, 3)
(10, graviditeten, 2)
(10, udemærket, 1)
(20, download gratis, 2)
(20, eis feuer, 1)
(20, herunterladen schau, 1)
(30, kostenlose, 3)
(30, kauf rechnung, 2)
(30, anschauen kinofilm, 1)
(30, versandkostenfreie lieferung, 1)
What would be the best way to do that? The following code groups by id and
count the terms, but I wanted to count the terms for each group.
by_clusters = GROUP sample_data by cluster_id;
by_clusters_terms_count = FOREACH by_clusters GENERATE group as
cluster_id, COUNT($1);
I make the grouping like this I end up with an object with the following
schema
by_clusters: {group: bytearray,sample_data: {(cluster_id:
bytearray,terms: chararray)}}
Now, I get to the point to actually count the terms inside the
'sample_data' tuple. I'm thinking about nested foreach, but I still didn't
get it how could I apply it in this case. The code would be something like
the following:
result = FOREACH by_clusters {
--count terms here, I don't know how
-- compiler gives me an error here
c = GROUP $1 BY terms; --
d = FOREACH c GENERATE COUNT(b), group;
GENERATE cluster_id, d;
Error I get:
ERROR 1200: Syntax error, unexpected symbol at or near '$1
Finally, I think I'm close, but I'm unable to solve it. I don't believe
I'll have to write an UDF in this case.
Arian
I'm having trouble with a simple task that I believe someone out there must
have already solved some day.
I'm trying to group and count the frequency of terms for each group in
PigLatin, but I'm having some troubles to figure it out how to do it.
I have a collection of objects with the following schema:
{cluster_id: bytearray,terms: chararray}
And here is some samples
(10, smerter)
(10, graviditeten)
(10, smerter)
(10, smerter)
(10, udemærket)
(20, eis feuer)
(20, herunterladen schau)
(20, download gratis)
(20, download gratis)
(30, anschauen kinofilm)
(30, kauf rechnung)
(30, kauf rechnung)
(30, versandkostenfreie lieferung)
(30, kostenlose)
(30, kostenlose)
(30, kostenlose)
the result I m trying to get is something like this
(10, smerter, 3)
(10, graviditeten, 2)
(10, udemærket, 1)
(20, download gratis, 2)
(20, eis feuer, 1)
(20, herunterladen schau, 1)
(30, kostenlose, 3)
(30, kauf rechnung, 2)
(30, anschauen kinofilm, 1)
(30, versandkostenfreie lieferung, 1)
What would be the best way to do that? The following code groups by id and
count the terms, but I wanted to count the terms for each group.
by_clusters = GROUP sample_data by cluster_id;
by_clusters_terms_count = FOREACH by_clusters GENERATE group as
cluster_id, COUNT($1);
I make the grouping like this I end up with an object with the following
schema
by_clusters: {group: bytearray,sample_data: {(cluster_id:
bytearray,terms: chararray)}}
Now, I get to the point to actually count the terms inside the
'sample_data' tuple. I'm thinking about nested foreach, but I still didn't
get it how could I apply it in this case. The code would be something like
the following:
result = FOREACH by_clusters {
--count terms here, I don't know how
-- compiler gives me an error here
c = GROUP $1 BY terms; --
d = FOREACH c GENERATE COUNT(b), group;
GENERATE cluster_id, d;
Error I get:
ERROR 1200: Syntax error, unexpected symbol at or near '$1
Finally, I think I'm close, but I'm unable to solve it. I don't believe
I'll have to write an UDF in this case.
Arian