Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Combining small S3 inputs

$
0
0
Hi,
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java
job, I get one split per file, which is really inefficient. In Pig, this
gets done over 49 splits, which is much faster.

How does Pig do this? Is there a piece of the source code that I can be
referred to? I seem to be banging my head on how to combine multiple S3
objects into a single split.

Thanks,
Brian

Viewing all articles
Browse latest Browse all 5648

Trending Articles