Hi,
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java
job, I get one split per file, which is really inefficient. In Pig, this
gets done over 49 splits, which is much faster.
How does Pig do this? Is there a piece of the source code that I can be
referred to? I seem to be banging my head on how to combine multiple S3
objects into a single split.
Thanks,
Brian
I was comparing performance of a Hadoop job that I wrote in Java to one
that I wrote in Pig. I have ~106,000 small (<1Mb) input files. In my Java
job, I get one split per file, which is really inefficient. In Pig, this
gets done over 49 splits, which is much faster.
How does Pig do this? Is there a piece of the source code that I can be
referred to? I seem to be banging my head on how to combine multiple S3
objects into a single split.
Thanks,
Brian