Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

How JOIN works ?

$
0
0
I am stumped by what, clearly, is my misunderstanding of how PIG does a JOIN.

Data in 'data' file:

A1 B1
A1 B1
A2 B2
A2 B2

My little program:

A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray);

B1 = foreach A generate var1,var2;
B2 = foreach A generate var1,var2;

C = join B1 by (var1,var2), B2 by (var1,var2);

dump C;

What I expect to get:

(A1,B1,A1,B1)
(A2,B2,A2,B2)

What I actually get:

(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A1,B1,A1,B1)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)
(A2,B2,A2,B2)

I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases.

Cameron Walker
Americas IT | AIG Property Casualty
Senior DBA, Enterprise Application Technology Services
200 South College
13th Floor
Charlotte, NC 28202
(Off): 704-338-7423
(Cell): 980-201-0496
mailto:cameron.walker1 [ at ] aig.com | http://www.aig.com

Viewing all articles
Browse latest Browse all 5648

Trending Articles