Quantcast
Channel: Apache Timeline
Viewing all articles
Browse latest Browse all 5648

Performance of RowSimilarityJob

$
0
0
I've been implementing the RowSimilarityJob on our 40-node cluster and have
run into so serious performance issues.

Trying to run the job on a corpus of just over 2 million documents using
bi-grams. When I get to the pairwise similarity step (CooccurrencesMapper
and SimilarityReducer) I am running out of space on hdfs because the job is
generating over 5 terabytes of output data.

Has anybody else run into similar issues? What other info can I provide
that would be helpful?

Thanks,
Burke

Viewing all articles
Browse latest Browse all 5648

Trending Articles