I'm using pig 0.11.2.
I had been processing ASCII files of json with schema: (key:chararray,
columns:bag {column:tuple (timeUUID:chararray, value:chararray,
timestamp:long)})
For what it's worth, this is cassandra data, at a fairly low level.
But, this was getting big, so I compressed it all with gzip (my "ETL"
process is already chunking the data into 1GB parts, making the .gz files
~100MB).
As a sanity check, I decided to do a quick check of pre/post, and the
numbers aren't matching. Then I've done a lot of messing around trying to
figure out why and I'm getting more and more puzzled.
My "quick check" was to get an overall count. It looked like (assuming A
is a LOAD given the schema above):
allGrp = GROUP A ALL;
aCount = FOREACH allGrp GENERATE group, COUNT(A);
DUMP aCount;
Basically the original data returned a number GREATER than the compressed
data number (not by a lot, but still...).
Then I uncompressed all of the compressed files, and did a size check of
original vs. uncompressed. They were the same. Then I "quick checked" the
uncompressed, and the count of that was == original! So, the way in which
pig processes the gzip'ed data is actually somehow different.
Then I tried to see if there are nulls floating around, so I loaded "orig"
and "comp" and tried to catch the "missing keys" with outer joins:
joined = JOIN orig by key LEFT OUTER, comp BY key;
filtered = FILTER joined BY (comp::key is null);
And filtered was empty! I then tried the reverse (which makes no sense I
know, as this was the smaller set), and filtered is still empty!
All of these loads are through a custom UDF that extends LoadFunc. But,
there isn't much to that UDF (and it's been in use for many months now).
Basically, the "raw" data is JSON (from cassandra's sstable2json program).
And I parse the json and turn it into the pig structure of the schema
noted above.
Does anything make sense here?
Thanks!
will
I had been processing ASCII files of json with schema: (key:chararray,
columns:bag {column:tuple (timeUUID:chararray, value:chararray,
timestamp:long)})
For what it's worth, this is cassandra data, at a fairly low level.
But, this was getting big, so I compressed it all with gzip (my "ETL"
process is already chunking the data into 1GB parts, making the .gz files
~100MB).
As a sanity check, I decided to do a quick check of pre/post, and the
numbers aren't matching. Then I've done a lot of messing around trying to
figure out why and I'm getting more and more puzzled.
My "quick check" was to get an overall count. It looked like (assuming A
is a LOAD given the schema above):
allGrp = GROUP A ALL;
aCount = FOREACH allGrp GENERATE group, COUNT(A);
DUMP aCount;
Basically the original data returned a number GREATER than the compressed
data number (not by a lot, but still...).
Then I uncompressed all of the compressed files, and did a size check of
original vs. uncompressed. They were the same. Then I "quick checked" the
uncompressed, and the count of that was == original! So, the way in which
pig processes the gzip'ed data is actually somehow different.
Then I tried to see if there are nulls floating around, so I loaded "orig"
and "comp" and tried to catch the "missing keys" with outer joins:
joined = JOIN orig by key LEFT OUTER, comp BY key;
filtered = FILTER joined BY (comp::key is null);
And filtered was empty! I then tried the reverse (which makes no sense I
know, as this was the smaller set), and filtered is still empty!
All of these loads are through a custom UDF that extends LoadFunc. But,
there isn't much to that UDF (and it's been in use for many months now).
Basically, the "raw" data is JSON (from cassandra's sstable2json program).
And I parse the json and turn it into the pig structure of the schema
noted above.
Does anything make sense here?
Thanks!
will