I have this bug that is killing me, where I can't self-join/cross a dataset
with itself. Its blocking my work :(
The script is like this:
businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
$0#'longitude' AS longitude,
$0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;
distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
locations.latitude,
locations_2.longitude,
locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:
*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
This makes no sense! What am I to do? I can't self-join :(
with itself. Its blocking my work :(
The script is like this:
businesses = LOAD
'yelp_phoenix_academic_dataset/yelp_academic_dataset_business.json' using
com.twitter.elephantbird.pig.load.JsonLoader() as json:map[];
/* {open=true, neighborhoods={}, review_count=14, stars=4.0, name=Drybar,
business_id=LcAamvosJu0bcPgEVF-9sQ, state=AZ, full_address=3172 E Camelback
Rd
Phoenix, AZ85018, categories={(Hair Salons),(Hair Stylists),(Beauty
Spas)}, longitude=-112.0131927, latitude=33.5107772, type=business,
city=Phoenix} */
locations = FOREACH businesses GENERATE $0#'business_id' AS business_id,
$0#'longitude' AS longitude,
$0#'latitude' AS latitude;
STORE locations INTO 'yelp_phoenix_academic_dataset/locations.tsv';
locations_2 = LOAD 'yelp_phoenix_academic_dataset/locations.tsv' AS
(business_id:chararray, longitude:double, latitude:double);
location_comparisons = CROSS locations_2, locations;
distances = FOREACH businesses GENERATE locations.business_id AS
business_id_1,
locations_2.business_id AS
business_id_2,
udfs.haversine(locations.longitude,
locations.latitude,
locations_2.longitude,
locations_2.latitude) AS distance;
STORE distances INTO 'yelp_phoenix_academic_dataset/distances.tsv';
I have also tried converting this to a self-join using JOIN BY '1', and
also locations_2 = locations, and I get the same error:
*org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has
more than one row in the output. 1st :
(rncjoVoEFUJGCUoC1JgnUA,-112.241596,33.581867), 2nd
:(0FNFSzCFP_rGUoJx8W7tJg,-112.105933,33.604054)*
at org.apache.pig.impl.builtin.ReadScalars.exec(ReadScalars.java:111)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:336)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:438)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.getNext(PhysicalOperator.java:347)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:378)
at
org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:298)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.runPipeline(PigGenericMapBase.java:283)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:278)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigGenericMapBase.map(PigGenericMapBase.java:64)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
This makes no sense! What am I to do? I can't self-join :(