Hi,
I'd like to use Mahout for clustering and classification where I have tens of
terabytes of data on Amazon's S3 storage service. Each file in my data will
generate one data point where I need to decompress the file and process it
prior to applying machine learning. Is it necessary to have all the files
pre-processed prior to using Mahout or is there a straightforward way to
combine the pre-processing with Mahout? For example, I have a script that
does the preprocessing and I somehow tell Mahout to run the script.
Pre-processing the files prior to running Mahout is simple, but Amazon
charges for the extra storage space the pre-processed files would use.
Thanks.
Eric
I'd like to use Mahout for clustering and classification where I have tens of
terabytes of data on Amazon's S3 storage service. Each file in my data will
generate one data point where I need to decompress the file and process it
prior to applying machine learning. Is it necessary to have all the files
pre-processed prior to using Mahout or is there a straightforward way to
combine the pre-processing with Mahout? For example, I have a script that
does the preprocessing and I somehow tell Mahout to run the script.
Pre-processing the files prior to running Mahout is simple, but Amazon
charges for the extra storage space the pre-processed files would use.
Thanks.
Eric