Unique Self Cross Optimization

I have a data input of aliases and many identifying attributes per each
alias. The order of aliases is ~1E8 and for all attributes is ~1E5. I am
attempting to generate a network of alias-alias commutative parings which
share at least one attribute in common. For the rotation, a vast majority
of the attributes contain a relatively small number of corresponding
aliases ~1E3 - except for a few, whereas these <1% of attributes have
corresponding aliases on the order of the entire input alias set ~1E8.

I am running into an issue with respect to these large alias <1% attributes
tasks. The reducers for some of these tasks are taking many orders of
magnitude longer to complete than the other 99% (on the order of many hours
to minutes). A representation of the script is below (Pig 0.11.2):

SET default_parallel $REDUCERS;
SET pig.schematuple true;
SET pig.exec.mapPartAgg true;
SET output.compression.enabled true;
SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
X = LOAD '$INPUT/user_item' USING PigStorage() AS (alias:chararray,
attributeURI:chararray);
A1 = FOREACH X GENERATE *;
A2 = FOREACH X GENERATE *;
A3 = JOIN A1 BY (attributeURI), A2 BY (attributeURI);
A4 = FILTER A3 BY (A1::alias != A2::alias);
A5 = FOREACH A4 GENERATE A1::alias, A2::alias; --projection bc X contains
other fields not shown here
A6 = DISTINCT A5;
STORE A6 INTO '$OUTPUT/network' USING PigStorage();

Here, Reducer steps A4, A5 are taking forever on a handful of reducer
tasks, likely related to the <1% attributes issues described above. Is
there a better way to optimize this script?

An example of the input X:
aa, cat
aa, dog
bb, dog
bb, bear
cc, cat
dd, bird

An example of the output A6:
aa, bb
aa, cc
aa, dd
bb, aa
cc, aa

Many Thanks. -Dan

Unique Self Cross Optimization

Trending Articles

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

लड़कियां सेक्स के दौरान क्यों करती है उह! आह!लड़कियां सेक्स के दौरान क्यों करती...

Neem Baba Extra Questions Answer Class 6 English Poorvi

Throw Back: 4×4 — Sikilitele (Ft Castro) Prod by JQ

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Lowe faces four theft charges

Practice Sheet of Right form of verbs for HSC Students

Mafia, Murder & Mayhem In The Motor City: Detroit Mob Hit Timeline (1937-2007)

The 10 Tennessee Cities With The Largest Black Population For 2021

Materials Around Us Class 6 Worksheet Science Chapter 6

デスクトップヒープの枯渇

Best Suvichar in Hindi |बेस्ट सुविचार |शुभ विचार हिंदी में

Kanulanu Thaake Lyrics and translation | Manam (2014)

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Teen Shot In Miami Drive-By Dies From Injuries

Download: IQ Muzatasha feat Shy D & Pmj – Ulesi NiFertilizer Yamavuto

Mahakal Attitude Status

Property developer set up cannabis factory to help pay off debts...

♡

KB: How to troubleshoot issues when adding a Hyper-V host in System Center...