Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Mahout for clustering

$
0
0
Hi All,We are using Apache Pig for building our data pipeline. We have data in the following fashion:
userid, age, items {code 1, code 2, ….}, few other features...
Each item has a unique alphanumeric code. I would like to use mahout for clustering it. Based on my current reading I see following few options
1. Map each alphanumeric item code to a numeric code -- AAAAA1 -> 0, AAAAA2 -> 1, AAAAA2 ->2 etc. Then run the clustering algorithm on the reformatted data and then map the results back onto the real item codes.2. Represent info on item codes as 1 X M matrix where a column represents an items (1 if a given user has viewed a particular item 0 otherwise) and will have millions of columns. So each user will have id, age, and this matrix. Not sure if this will work…..
We also want to do frequency pattern mining etc. on the same data. Any thoughts on data representation and clustering will be great.

Viewing all articles
Browse latest Browse all 5648

Trending Articles