Hey,
i want to cluster a set of documents using a bag-of-words approach (e.g.
using K-means). However, my documents (since they are automatically
generated by aggregating text snippets) show huge differences according to
their document size.
This means, some document vectors have 50 words with a count greater 0
(thereby those words having some small count), and some small number of
document vectors have 1,000,000 words with a count greater 0 (with those
words having a power-law like distribution of counts). Even when using
tf-idf normalization this will make the sparse document vectors (small
docs) having large scores for a few words and the more densely filled
document vectors (large docs) having small scores for a huge number of
words.
I call this a highly heterogeneous input data set (don't know if it is the
right term) and expect this to be a problem for existing domains. E.g., in
NLP, when clustering similar terms on basis of a term-document matrix some
terms will occur just a few times in a few number of documents while a
small number of terms will occur a lot in a huge number of documents.
People proposed using PPMI and smoothing to get better results, however, in
the papers i read they do not explicitly talk about
the heterogeneity problem and how it affects the output (e.g. the clusters
or the similarity calculation).
Someone has a hint what normalization/clustering approach is promising in
the presence of huge heterogeneity or can hint me to some relevant papers?
I thought about approaches explicitly trying to (a) extract sparse clusters
(Sparse PCA), (b) splitting large documents by sampling, or (c) smoothing
the doc-term matrix before clustering using SVD, PLSA, or LDA (to make the
small docs more densely filled).
However i am searching for a somewhat more well-founded approach for that
kind of problem or some good resource. Anyone can point me to some good
paper covering that problem? That would be of great help.
Thanks a lot,
Chris
i want to cluster a set of documents using a bag-of-words approach (e.g.
using K-means). However, my documents (since they are automatically
generated by aggregating text snippets) show huge differences according to
their document size.
This means, some document vectors have 50 words with a count greater 0
(thereby those words having some small count), and some small number of
document vectors have 1,000,000 words with a count greater 0 (with those
words having a power-law like distribution of counts). Even when using
tf-idf normalization this will make the sparse document vectors (small
docs) having large scores for a few words and the more densely filled
document vectors (large docs) having small scores for a huge number of
words.
I call this a highly heterogeneous input data set (don't know if it is the
right term) and expect this to be a problem for existing domains. E.g., in
NLP, when clustering similar terms on basis of a term-document matrix some
terms will occur just a few times in a few number of documents while a
small number of terms will occur a lot in a huge number of documents.
People proposed using PPMI and smoothing to get better results, however, in
the papers i read they do not explicitly talk about
the heterogeneity problem and how it affects the output (e.g. the clusters
or the similarity calculation).
Someone has a hint what normalization/clustering approach is promising in
the presence of huge heterogeneity or can hint me to some relevant papers?
I thought about approaches explicitly trying to (a) extract sparse clusters
(Sparse PCA), (b) splitting large documents by sampling, or (c) smoothing
the doc-term matrix before clustering using SVD, PLSA, or LDA (to make the
small docs more densely filled).
However i am searching for a somewhat more well-founded approach for that
kind of problem or some good resource. Anyone can point me to some good
paper covering that problem? That would be of great help.
Thanks a lot,
Chris