problems with .gz

I'm using pig 0.11.2.

I had been processing ASCII files of json with schema: (key:chararray,
columns:bag {column:tuple (timeUUID:chararray, value:chararray,
timestamp:long)})
For what it's worth, this is cassandra data, at a fairly low level.

But, this was getting big, so I compressed it all with gzip (my "ETL"
process is already chunking the data into 1GB parts, making the .gz files
~100MB).

As a sanity check, I decided to do a quick check of pre/post, and the
numbers aren't matching. Then I've done a lot of messing around trying to
figure out why and I'm getting more and more puzzled.

My "quick check" was to get an overall count. It looked like (assuming A
is a LOAD given the schema above):

allGrp = GROUP A ALL;
aCount = FOREACH allGrp GENERATE group, COUNT(A);
DUMP aCount;

Basically the original data returned a number GREATER than the compressed
data number (not by a lot, but still...).

Then I uncompressed all of the compressed files, and did a size check of
original vs. uncompressed. They were the same. Then I "quick checked" the
uncompressed, and the count of that was == original! So, the way in which
pig processes the gzip'ed data is actually somehow different.

Then I tried to see if there are nulls floating around, so I loaded "orig"
and "comp" and tried to catch the "missing keys" with outer joins:

joined = JOIN orig by key LEFT OUTER, comp BY key;
filtered = FILTER joined BY (comp::key is null);

And filtered was empty! I then tried the reverse (which makes no sense I
know, as this was the smaller set), and filtered is still empty!

All of these loads are through a custom UDF that extends LoadFunc. But,
there isn't much to that UDF (and it's been in use for many months now).
Basically, the "raw" data is JSON (from cassandra's sstable2json program).
And I parse the json and turn it into the pig structure of the schema
noted above.

Does anything make sense here?

Thanks!

will

problems with .gz

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...