Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Vectorizing data in mapreduce mode

$
0
0
Hi everyone,

My Pig script generates the following -- results are stored in part-m-00000 to part-m-00004 files.

-bash-4.1$ hadoop dfs -ls /scratch/ItemIds

Found 7 items
-rw-r--r-- 1 userid supergroup 0 2013-12-23 11:13 /scratch/ItemIds/_SUCCESS
drwxr-xr-x - userid supergroup 0 2013-12-23 11:12 /scratch/ItemIds/_logs
-rw-r--r-- 1 userid supergroup 276019 2013-12-23 11:12 /scratch/ItemIds/part-m-00000
-rw-r--r-- 1 userid supergroup 272188 2013-12-23 11:12 /scratch/ItemIds/part-m-00001
-rw-r--r-- 1 userid supergroup 252597 2013-12-23 11:12 /scratch/ItemIds/part-m-00002
-rw-r--r-- 1 userid supergroup 236508 2013-12-23 11:12 /scratch/ItemIds/part-m-00003
-rw-r--r-- 1 userid supergroup 270658 2013-12-23 11:12 /scratch/ItemIds/part-m-00004

The output is stored as the Tab separated values:

userid1 itemid1 itemid2 itemid3 ......
userid2 itemid1 itemid2 itemid3 ......
......

I have following questions:

1. Is there a mahout utility that lets me point to the /scratch/ItemIds and will generate one file out of these 5 part files?

2. What is the recommended way of parsing this tab separated file in a mapreduce mode? I want to vectorize this data and would like to do that in a parallel mode. I know how to vectorize the data correctly and how to run K-means on that.

I have been using the following command to run my clustering algorithm on dummy data. Now, I want to ingest real data.

hadoop jar /apps/analytics/myanalytics.jar myanalytics.SimpleKMeansClustering -libjars /apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT.jar /:/apps/mahout/trunk/core/target/mahout-core-0.9-SNAPSHOT-job.jar:/apps/mahout/trunk/math/target/mahout-math-0.9-SNAPSHOT.jar

However, I am not sure if I write the code to vectorize data in my SimpleKMeansClustering class, will the above command run it in mapreduce mode?

Viewing all articles
Browse latest Browse all 5648

Trending Articles