I am stumped by what, clearly, is my misunderstanding of how PIG does a JOIN.
Data in 'data' file:
A1 B1
A1 B1
A2 B2
A2 B2
My little program:
A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray);
B1 = foreach A generate var1,var2;
B2 = foreach A generate var1,var2;
C = join B1 by (var1,var2), B2 by (var1,var2);
dump C;
What I expect to get:
(A1,B1,A1,B1)
(A2,B2,A2,B2)
What I actually get:
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases.
Cameron Walker
Americas IT | AIG Property Casualty
Senior DBA, Enterprise Application Technology Services
200 South College
13th Floor
Charlotte, NC 28202
(Off): 704-338-7423
(Cell): 980-201-0496
mailto:cameron.walker1 [ at ] aig.com | http://www.aig.com
Data in 'data' file:
A1 B1
A1 B1
A2 B2
A2 B2
My little program:
A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray);
B1 = foreach A generate var1,var2;
B2 = foreach A generate var1,var2;
C = join B1 by (var1,var2), B2 by (var1,var2);
dump C;
What I expect to get:
(A1,B1,A1,B1)
(A2,B2,A2,B2)
What I actually get:
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases.
Cameron Walker
Americas IT | AIG Property Casualty
Senior DBA, Enterprise Application Technology Services
200 South College
13th Floor
Charlotte, NC 28202
(Off): 704-338-7423
(Cell): 980-201-0496
mailto:cameron.walker1 [ at ] aig.com | http://www.aig.com