Hey All,
I’m struggling with performance of algebraic aggregates. It seems that Pig will always bag up tuples between the input and intermediate aggregate stages. For my workload those bags get large and spill to disk. The spilling in and of itself seems to cause a lot of memory pressure and then GC and slowdown.
The aggregates I am computing are things like MAX, where I would really like to stream the records through the input stage and only maintain the current max. Is this possible with algebraic, accumulator or anything else?
Thanks,
Adam
I’m struggling with performance of algebraic aggregates. It seems that Pig will always bag up tuples between the input and intermediate aggregate stages. For my workload those bags get large and spill to disk. The spilling in and of itself seems to cause a lot of memory pressure and then GC and slowdown.
The aggregates I am computing are things like MAX, where I would really like to stream the records through the input stage and only maintain the current max. Is this possible with algebraic, accumulator or anything else?
Thanks,
Adam