Java heap space using TrainNaiveBayesJob

Hi everybody,

I am implementing a classifier that handles a big amount of data using naive bayes using EMR as the "hadoop cluster". By large amount of data I mean that the final models are around 45GB. While the feature extraction step works fine, calling TrainNaiveBayesJob results in the following exception:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at org.apache.mahout.math.map.OpenIntDoubleHashMap.rehash(OpenIntDoubleHashMap.java:491)
at org.apache.mahout.math.map.OpenIntDoubleHashMap.put(OpenIntDoubleHashMap.java:444)
at org.apache.mahout.math.RandomAccessSparseVector.setQuick(RandomAccessSparseVector.java:139)
at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:122)
at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1875)
at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:2007)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:95)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:38)
at com.google.common.collect.AbstractIterator.tryToComputeNext(Unknown Source)
at com.google.common.collect.AbstractIterator.hasNext(Unknown Source)
at com.google.common.collect.Iterators$5.hasNext(Unknown Source)
at com.google.common.collect.ForwardingIterator.hasNext(Unknown Source)
at org.apache.mahout.classifier.naivebayes.BayesUtils.readModelFromDir(BayesUtils.java:79)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:161)It took me a little bit to realize that the MR job from Naive Bayes finished fine on each of the reducers, but after the reduce step the namenode gets the models from the reducers, loads them to memory, validates them and after the validation, serializes the models. My second thought was to increase the heap memory of the namenode (using boostrap-actions in EMR, s3://elasticmapreduce/bootstrap-actions/configure-daemons --namenode-heap-size=60000) but even with this setup I am receiving the same exception.

Has somebody dealt with a similar problem? Any suggestion? (other than trying a better master node).

Also, what is the rationale to load all the model in the memory of the namenode? While I understand the need of validation, can't it be dode in chunks of data instead of the complete model to avoid having this scalability issue?

Thanks!

Java heap space using TrainNaiveBayesJob

Trending Articles

Black Angus Grilled Artichokes

SPYAIR – RAGE OF DUST [Mora FLAC 24bit/96kHz]

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Main Rahoon ya Na Rahun Lyrics Translation | Bas Itna Hai Tumse Kehna

22-06-2015 – Moondru Mudichu Serial

Critical Reasoning (CR) | Re: Outsourcing is the practice of obtaining from...

SANIDAPA LIVE IN GADAMBUWANA 2017

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Moondru Mudichu 27-05-2016 – Polimer tv Serial

hi bro file toyota 89663-60090

SUPREME COURT RULES AGAINST O’NEILL GOVERNMENT

Tavalequ

Ummet Ozcan – Ocean’s Voice – Single [iTunes Plus M4A]

My Sisters Plan For Me To Smell Her Feet (Fiction): Part 1,2,3 and 4!!!

Download: Enalia – Malumbo

Nalgonda District Police Office Mobile Numbers List in Telangana State

Felon with a Loaded Firearm Arrested Near Ohlone Greenway

Practice Sheet of Right form of verbs for HSC Students

Ulster's King Coke Barney 'Rubble' Morgan leaves mansion to rot

Júnior Porciúncula W-10 KONTAKT