I encounter few troubles with Mahout that I can't sort out..
The context is that I'm trying to calculate pairwise euclidean distances
between music tracks based on 6 audio features per track. My input for the
mahout job is a text file which looks like this:
feature_id,track_id,feature_value
<integer>,< integer>,<double>
This command works locally for less than 600 tracks (based on
mahout-core-0.7-cdh4.5.0-job.jar):
mahout itemsimilarity --input input/msd_sample/mahout --output
output/mahout --similarityClassname
SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1
But for more tracks I get an empty file part-r-0000. I tried to decrease
the --threshold parameter but I still don't have any result.
I also tried to launch the job on aws EMR with the equivalent input for
3000 tracks (based on mahout-core-0.8-job.jar):
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
s3n://hadoop-filrouge/input/msd-sample/mahout --output
s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false
--maxSimilaritiesPerItem 1
The job runs successfully but I get 17 empty part-r-000xx..
I'm totally stuck right now and I'm running out of idea to fix this issue.
So if anydody only have a little idea of what is going on, that could
really help.
Many thanks,
The context is that I'm trying to calculate pairwise euclidean distances
between music tracks based on 6 audio features per track. My input for the
mahout job is a text file which looks like this:
feature_id,track_id,feature_value
<integer>,< integer>,<double>
This command works locally for less than 600 tracks (based on
mahout-core-0.7-cdh4.5.0-job.jar):
mahout itemsimilarity --input input/msd_sample/mahout --output
output/mahout --similarityClassname
SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1
But for more tracks I get an empty file part-r-0000. I tried to decrease
the --threshold parameter but I still don't have any result.
I also tried to launch the job on aws EMR with the equivalent input for
3000 tracks (based on mahout-core-0.8-job.jar):
org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob --input
s3n://hadoop-filrouge/input/msd-sample/mahout --output
s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity
--similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false
--maxSimilaritiesPerItem 1
The job runs successfully but I get 17 empty part-r-000xx..
I'm totally stuck right now and I'm running out of idea to fix this issue.
So if anydody only have a little idea of what is going on, that could
really help.
Many thanks,