*I am loading 3 data sources with data like:*
Source 1 (main data source):
id, id1, userId, type
Source 2 (supporting source used for filtering):
parition_number, id, id1
Source 3 (static set source with all allowed types):
type
*Am using the following pig script, to count the unique userIds by type:*
grains = CROSS source1, source2;
users = JOIN
grains BY (source2::id, source2::id1, source3::type) LEFT OUTER,
source1 BY (id, id1, type);
usersGrouped = GROUP users
BY (metricsGrains::grain1::partitionNumber,
grains:: source2::organizationId,
grains:: source2::networkId,
grains:: source3::browserFormFactor)
PARTITION BY MyCustomPartitioner PARALLEL 32;
counts = FOREACH usersGrouped {
userCountCount = DISTINCT users.(source1::userId);
GENERATE FLATTEN(group), COUNT(userCount);
STORE counts INTO 'output';
*My partitioner is quite simple (It just fetches a hdfs partition from 0 to
31 based on the partition number 1 to 32):*
public class MyCustomPartitioner extends Partitioner<PigNullableWritable,
NullableTuple> {
@Override
public int getPartition(PigNullableWritable partitionWritable,
NullableTuple valueWritable, int numPartitions) {
String partition = partitionWritable.getValueAsPigType().toString();
int inputPartitionNum = Integer.valueOf(partition.substring(1,
partition.indexOf(",")));
int hdfsPartitionNum = inputPartitionNum - 1;
checkState(hdfsPartitionNum >= 0 & hdfsPartitionNum <
numPartitions, "Invalid partition chosen: " + hdfsPartitionNum);
return hdfsPartitionNum;
So, data from input partition 1 should always result in part0000 file,
partition 2 data should go in part0001 file and so on. But sometime,
partition 1 data is resulting in part0005 (any random partition) file. This
is not happening for all the data sets but for some and that too randomly.
I am using Hadoop 2.3 with Pig 0.13. Please advise what could be the issue
here?
Thanks,
Shakti
Source 1 (main data source):
id, id1, userId, type
Source 2 (supporting source used for filtering):
parition_number, id, id1
Source 3 (static set source with all allowed types):
type
*Am using the following pig script, to count the unique userIds by type:*
grains = CROSS source1, source2;
users = JOIN
grains BY (source2::id, source2::id1, source3::type) LEFT OUTER,
source1 BY (id, id1, type);
usersGrouped = GROUP users
BY (metricsGrains::grain1::partitionNumber,
grains:: source2::organizationId,
grains:: source2::networkId,
grains:: source3::browserFormFactor)
PARTITION BY MyCustomPartitioner PARALLEL 32;
counts = FOREACH usersGrouped {
userCountCount = DISTINCT users.(source1::userId);
GENERATE FLATTEN(group), COUNT(userCount);
STORE counts INTO 'output';
*My partitioner is quite simple (It just fetches a hdfs partition from 0 to
31 based on the partition number 1 to 32):*
public class MyCustomPartitioner extends Partitioner<PigNullableWritable,
NullableTuple> {
@Override
public int getPartition(PigNullableWritable partitionWritable,
NullableTuple valueWritable, int numPartitions) {
String partition = partitionWritable.getValueAsPigType().toString();
int inputPartitionNum = Integer.valueOf(partition.substring(1,
partition.indexOf(",")));
int hdfsPartitionNum = inputPartitionNum - 1;
checkState(hdfsPartitionNum >= 0 & hdfsPartitionNum <
numPartitions, "Invalid partition chosen: " + hdfsPartitionNum);
return hdfsPartitionNum;
So, data from input partition 1 should always result in part0000 file,
partition 2 data should go in part0001 file and so on. But sometime,
partition 1 data is resulting in part0005 (any random partition) file. This
is not happening for all the data sets but for some and that too randomly.
I am using Hadoop 2.3 with Pig 0.13. Please advise what could be the issue
here?
Thanks,
Shakti