Clustering of heterogenous input data

Hey,

i want to cluster a set of documents using a bag-of-words approach (e.g.
using K-means). However, my documents (since they are automatically
generated by aggregating text snippets) show huge differences according to
their document size.

This means, some document vectors have 50 words with a count greater 0
(thereby those words having some small count), and some small number of
document vectors have 1,000,000 words with a count greater 0 (with those
words having a power-law like distribution of counts). Even when using
tf-idf normalization this will make the sparse document vectors (small
docs) having large scores for a few words and the more densely filled
document vectors (large docs) having small scores for a huge number of
words.

I call this a highly heterogeneous input data set (don't know if it is the
right term) and expect this to be a problem for existing domains. E.g., in
NLP, when clustering similar terms on basis of a term-document matrix some
terms will occur just a few times in a few number of documents while a
small number of terms will occur a lot in a huge number of documents.
People proposed using PPMI and smoothing to get better results, however, in
the papers i read they do not explicitly talk about
the heterogeneity problem and how it affects the output (e.g. the clusters
or the similarity calculation).

Someone has a hint what normalization/clustering approach is promising in
the presence of huge heterogeneity or can hint me to some relevant papers?
I thought about approaches explicitly trying to (a) extract sparse clusters
(Sparse PCA), (b) splitting large documents by sampling, or (c) smoothing
the doc-term matrix before clustering using SVD, PLSA, or LDA (to make the
small docs more densely filled).

However i am searching for a somewhat more well-founded approach for that
kind of problem or some good resource. Anyone can point me to some good
paper covering that problem? That would be of great help.

Thanks a lot,
Chris

Clustering of heterogenous input data

Trending Articles

Joey Bada$$ – Lonely At The Top [iTunes Plus M4A]

Korean Sex Porn Videos: XXX Videos & Free Porn Movies

Sergeant retires after 30 years in police force

Neem Baba Extra Questions Answer Class 6 English Poorvi

RAMAYAMPET Mandal Sarpanch | Upa-Sarpanch | Ward member Mobile Numbers Medak...

A/L Technology Stream – Subject combinations, Syllabuses and Teacher guides

Who was Victor Wayne Harris on 'Seinfeld' ? #whispers #harris #worker...

Outlook でメールを保存または送信時に...

Love Status in Punjabi, ਪੰਜਾਬੀ ਲਵ ਸਟੇਟਸ

WNT: Bell names 21-player squad for Scotland trip

IVAN A ALDECOCEA Arrested by Miami-Dade County Corrections on Feb 16, 2017

Practice Sheet of Right form of verbs for HSC Students

Bureau of Internal Revenue: Regional Offices (Directory)

Search warrants lead to gun, ammunition, cash and drug seizure, Two people...

Kanulanu Thaake Lyrics and translation | Manam (2014)

Moondru Mudichu 07-06-2016 – Polimer tv Serial

Forum Post: RE: Aquiring Token from ACS has failed. Please check if your...

10 Best Eid Milad Un Nabi Whatsapp Status in Hindi

[GET] Benjamin Benichou – Masters of AI ($699.00) + Latest Update (mvp version)

The National's 'Sunshine on My Back' is deceptively sad