Hi everybody,
I am implementing a classifier that handles a big amount of data using naive bayes using EMR as the "hadoop cluster". By large amount of data I mean that the final models are around 45GB. While the feature extraction step works fine, calling TrainNaiveBayesJob results in the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:491)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:444)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:122)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1875)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2007)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at com.google.common.collect.AbstractIterator.tryToComputeNext(Unknown Source)
at com.google.common.collect.AbstractIterator.hasNext(Unknown Source)
at com.google.common.collect.Iterators$5.hasNext(Unknown Source)
at com.google.common.collect.ForwardingIterator.hasNext(Unknown Source)
at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:79)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:161)It took me a little bit to realize that the MR job from Naive Bayes finished fine on each of the reducers, but after the reduce step the namenode gets the models from the reducers, loads them to memory, validates them and after the validation, serializes the models. My second thought was to increase the heap memory of the namenode (using boostrap-actions in EMR, s3://elasticmapreduce/bootstrap-actions/configure-daemons --namenode-heap-size=60000) but even with this setup I am receiving the same exception.
Has somebody dealt with a similar problem? Any suggestion? (other than trying a better master node).
Also, what is the rationale to load all the model in the memory of the namenode? While I understand the need of validation, can't it be dode in chunks of data instead of the complete model to avoid having this scalability issue?
Thanks!
I am implementing a classifier that handles a big amount of data using naive bayes using EMR as the "hadoop cluster". By large amount of data I mean that the final models are around 45GB. While the feature extraction step works fine, calling TrainNaiveBayesJob results in the following exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:491)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:444)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:122)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1875)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2007)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at com.google.common.collect.AbstractIterator.tryToComputeNext(Unknown Source)
at com.google.common.collect.AbstractIterator.hasNext(Unknown Source)
at com.google.common.collect.Iterators$5.hasNext(Unknown Source)
at com.google.common.collect.ForwardingIterator.hasNext(Unknown Source)
at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:79)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:161)It took me a little bit to realize that the MR job from Naive Bayes finished fine on each of the reducers, but after the reduce step the namenode gets the models from the reducers, loads them to memory, validates them and after the validation, serializes the models. My second thought was to increase the heap memory of the namenode (using boostrap-actions in EMR, s3://elasticmapreduce/bootstrap-actions/configure-daemons --namenode-heap-size=60000) but even with this setup I am receiving the same exception.
Has somebody dealt with a similar problem? Any suggestion? (other than trying a better master node).
Also, what is the rationale to load all the model in the memory of the namenode? While I understand the need of validation, can't it be dode in chunks of data instead of the complete model to avoid having this scalability issue?
Thanks!