Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

wikipedia bayes quickstart example on EC2 (cloudera)

$
0
0
Hi,

I'm a noob and trying to run the wikipedia bayes example on EC2 (using a
cdh4.5 setup). I've searched the archives and haven't been able to find
info on this. I apologize if this is a duplicate question.

The cloudera install comes with Mahout 0.7.

I've run into a few snags on the first step (chunking the data into
pieces). The first was that it couldn't find wikipediaXMLSplitter but I
found that substituting org.apache.mahout.text.wikipedia.WikipediaXmlSplitter
in the command it got past that error. (just changing
the capitalization wasn't enough)

However I am now stuck. I'm getting a java.lang.OutOfMemoryError: Java
heap space error.
I upped MAHOUT_HEAPSIZE to 5000 and am still getting the same error.
See the full error here: http://pastebin.com/P5PYuR8U (I added a print
statement to the mahout/bin just to confirm that my export of
MAHOUT_HEAPSIZE was being successfully detected)

I'm wondering whether some other setting is overriding the
MAHOUT_HEAPSIZE? One of the hadoop or cloudera specific ones?

Does anyone have any experience with this or suggestions?

Thank you,

Jessie Wright

Viewing all articles
Browse latest Browse all 5648

Trending Articles