Hi,
the log bellow shows an issue that started to occur just "recently" (I
haven't ran tests with this somewhat larger dataset (320K documents) for
some time and when I did today, I got this "all of a sudden").
Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I
can tell, that's vanilla mahout in the community edition I use).
As far as I can tell, it's happening in the middle of seq2sparse and all
three - the input, the output and the mr-job are being generated by
mahout and there's no my code involved.
It would be cool if anyone could point me to the source of this error.
thanks and kind regards
reinis.
SETTINGS OF SEQ2SPARSE
{"--analyzerName", "com.myproj.quantify.ticket.text.TicketTextAnalyzer",
"--chunkSize", "200",
"--output", finalDir,
"--input", ticketTextsOutput.toString,
"--minSupport", "2",
"--minDF", "2",
"--maxDFPercent", "85",
"--weight", "tfidf",
"--minLLR", "50",
"--maxNGramSize", "3",
"--norm", "2",
"--namedVector", "--sequentialAccessVector", "--overwrite"}
LOG
14/07/12 16:46:16 INFO vectorizer.SparseVectorsFromSequenceFiles:
Creating Term Frequency Vectors
14/07/12 16:46:16 INFO vectorizer.DictionaryVectorizer: Creating
dictionary from /quantify/ticket/text/final/tokenized-documents and
saving at /quantify/ticket/text/final/wordcount
14/07/12 16:46:16 INFO client.RMProxy: Connecting to ResourceManager at
hadoop1
14/07/12 16:46:17 INFO input.FileInputFormat: Total input paths to
process : 1
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: number of splits:2
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1404888747437_0074
14/07/12 16:46:17 INFO impl.YarnClientImpl: Submitted application
application_1404888747437_0074
14/07/12 16:46:17 INFO mapreduce.Job: The url to track the job:
http://hadoop1:8088/proxy/application_1404888747437_0074/
14/07/12 16:46:17 INFO mapreduce.Job: Running job: job_1404888747437_0074
14/07/12 16:46:30 INFO mapreduce.Job: Job job_1404888747437_0074 running
in uber mode : false
14/07/12 16:46:30 INFO mapreduce.Job: map 0% reduce 0%
14/07/12 16:46:41 INFO mapreduce.Job: map 6% reduce 0%
14/07/12 16:46:44 INFO mapreduce.Job: map 10% reduce 0%
14/07/12 16:46:47 INFO mapreduce.Job: map 11% reduce 0%
14/07/12 16:46:48 INFO mapreduce.Job: map 14% reduce 0%
14/07/12 16:46:50 INFO mapreduce.Job: map 15% reduce 0%
14/07/12 16:46:51 INFO mapreduce.Job: map 19% reduce 0%
14/07/12 16:46:53 INFO mapreduce.Job: map 20% reduce 0%
14/07/12 16:46:54 INFO mapreduce.Job: map 23% reduce 0%
14/07/12 16:46:57 INFO mapreduce.Job: map 26% reduce 0%
14/07/12 16:47:00 INFO mapreduce.Job: map 29% reduce 0%
14/07/12 16:47:01 INFO mapreduce.Job: Task Id :
attempt_1404888747437_0074_m_000000_0, Status : FAILED
Error: java.lang.IllegalStateException: java.io.IOException: Spill failed
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:140)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:115)
at
org.apache.mahout.math.map.OpenObjectIntHashMap.forEachPair(OpenObjectIntHashMap.java:185)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:115)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1535)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$300(MapTask.java:853)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
org.apache.mahout.vectorizer.collocations.llr.GramKey.write(GramKey.java:91)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1126)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:131)
... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1836016430
at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:144)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at
org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
at
org.apache.mahout.vectorizer.collocations.llr.GramKey.readFields(GramKey.java:78)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:132)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245)
at
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:105)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:853)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1505)
the log bellow shows an issue that started to occur just "recently" (I
haven't ran tests with this somewhat larger dataset (320K documents) for
some time and when I did today, I got this "all of a sudden").
Am using mahout 0.9-cdh5.2.0-SNAPSHOT (yes its cloudera but as far as I
can tell, that's vanilla mahout in the community edition I use).
As far as I can tell, it's happening in the middle of seq2sparse and all
three - the input, the output and the mr-job are being generated by
mahout and there's no my code involved.
It would be cool if anyone could point me to the source of this error.
thanks and kind regards
reinis.
SETTINGS OF SEQ2SPARSE
{"--analyzerName", "com.myproj.quantify.ticket.text.TicketTextAnalyzer",
"--chunkSize", "200",
"--output", finalDir,
"--input", ticketTextsOutput.toString,
"--minSupport", "2",
"--minDF", "2",
"--maxDFPercent", "85",
"--weight", "tfidf",
"--minLLR", "50",
"--maxNGramSize", "3",
"--norm", "2",
"--namedVector", "--sequentialAccessVector", "--overwrite"}
LOG
14/07/12 16:46:16 INFO vectorizer.SparseVectorsFromSequenceFiles:
Creating Term Frequency Vectors
14/07/12 16:46:16 INFO vectorizer.DictionaryVectorizer: Creating
dictionary from /quantify/ticket/text/final/tokenized-documents and
saving at /quantify/ticket/text/final/wordcount
14/07/12 16:46:16 INFO client.RMProxy: Connecting to ResourceManager at
hadoop1
14/07/12 16:46:17 INFO input.FileInputFormat: Total input paths to
process : 1
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: number of splits:2
14/07/12 16:46:17 INFO mapreduce.JobSubmitter: Submitting tokens for
job: job_1404888747437_0074
14/07/12 16:46:17 INFO impl.YarnClientImpl: Submitted application
application_1404888747437_0074
14/07/12 16:46:17 INFO mapreduce.Job: The url to track the job:
http://hadoop1:8088/proxy/application_1404888747437_0074/
14/07/12 16:46:17 INFO mapreduce.Job: Running job: job_1404888747437_0074
14/07/12 16:46:30 INFO mapreduce.Job: Job job_1404888747437_0074 running
in uber mode : false
14/07/12 16:46:30 INFO mapreduce.Job: map 0% reduce 0%
14/07/12 16:46:41 INFO mapreduce.Job: map 6% reduce 0%
14/07/12 16:46:44 INFO mapreduce.Job: map 10% reduce 0%
14/07/12 16:46:47 INFO mapreduce.Job: map 11% reduce 0%
14/07/12 16:46:48 INFO mapreduce.Job: map 14% reduce 0%
14/07/12 16:46:50 INFO mapreduce.Job: map 15% reduce 0%
14/07/12 16:46:51 INFO mapreduce.Job: map 19% reduce 0%
14/07/12 16:46:53 INFO mapreduce.Job: map 20% reduce 0%
14/07/12 16:46:54 INFO mapreduce.Job: map 23% reduce 0%
14/07/12 16:46:57 INFO mapreduce.Job: map 26% reduce 0%
14/07/12 16:47:00 INFO mapreduce.Job: map 29% reduce 0%
14/07/12 16:47:01 INFO mapreduce.Job: Task Id :
attempt_1404888747437_0074_m_000000_0, Status : FAILED
Error: java.lang.IllegalStateException: java.io.IOException: Spill failed
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:140)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:115)
at
org.apache.mahout.math.map.OpenObjectIntHashMap.forEachPair(OpenObjectIntHashMap.java:185)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:115)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper.map(CollocMapper.java:41)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:340)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:168)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1548)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:163)
Caused by: java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.checkSpillException(MapTask.java:1535)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$300(MapTask.java:853)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1349)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at
org.apache.mahout.vectorizer.collocations.llr.GramKey.write(GramKey.java:91)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:98)
at
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:82)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:1126)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:692)
at
org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
at
org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
at
org.apache.mahout.vectorizer.collocations.llr.CollocMapper$1.apply(CollocMapper.java:131)
... 12 more
Caused by: java.lang.ArrayIndexOutOfBoundsException: 1836016430
at java.io.ByteArrayInputStream.read(ByteArrayInputStream.java:144)
at java.io.DataInputStream.readByte(DataInputStream.java:265)
at
org.apache.mahout.math.Varint.readUnsignedVarInt(Varint.java:159)
at
org.apache.mahout.vectorizer.collocations.llr.GramKey.readFields(GramKey.java:78)
at
org.apache.hadoop.io.WritableComparator.compare(WritableComparator.java:132)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:1245)
at
org.apache.hadoop.util.QuickSort.sortInternal(QuickSort.java:105)
at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:63)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1575)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$900(MapTask.java:853)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1505)