Hi,
I'm a noob and trying to run the wikipedia bayes example on EC2 (using a
cdh4.5 setup). I've searched the archives and haven't been able to find
info on this. I apologize if this is a duplicate question.
The cloudera install comes with Mahout 0.7.
I've run into a few snags on the first step (chunking the data into
pieces). The first was that it couldn't find wikipediaXMLSplitter but I
found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter
in the command it got past that error. (just changing
the capitalization wasn't enough)
However I am now stuck. I'm getting a java.lang.OutOfMemoryError: Java
heap space error.
I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error.
See the full error here: http://pastebin.com/P5PYuR8U (I added a print
statement to the mahout/bin just to confirm that my export of
MAHOUT_HEAPSIZE was being successfully detected)
I'm wondering whether some other setting is overriding the
MAHOUT_HEAPSIZE? One of the hadoop or cloudera specific ones?
Does anyone have any experience with this or suggestions?
Thank you,
Jessie Wright
I'm a noob and trying to run the wikipedia bayes example on EC2 (using a
cdh4.5 setup). I've searched the archives and haven't been able to find
info on this. I apologize if this is a duplicate question.
The cloudera install comes with Mahout 0.7.
I've run into a few snags on the first step (chunking the data into
pieces). The first was that it couldn't find wikipediaXMLSplitter but I
found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter
in the command it got past that error. (just changing
the capitalization wasn't enough)
However I am now stuck. I'm getting a java.lang.OutOfMemoryError: Java
heap space error.
I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error.
See the full error here: http://pastebin.com/P5PYuR8U (I added a print
statement to the mahout/bin just to confirm that my export of
MAHOUT_HEAPSIZE was being successfully detected)
I'm wondering whether some other setting is overriding the
MAHOUT_HEAPSIZE? One of the hadoop or cloudera specific ones?
Does anyone have any experience with this or suggestions?
Thank you,
Jessie Wright