I'm clustering a pretty typical use case (news articles), but I keep
running into a problem that ends up ruining the final cluster quality:
noise, or "junk" sentences appended or prepended to the articles by
the news outlet. I removing common noise from datasets is a problem
common to many domains (news, bioinformatics, etc) so I figure there
must be some solution to it in existence already. Does anyone know of
any libraries to clean common strings from a set of strings (Java,
preferably)?
I'm scraping pages from news outlets using HTMLUnit and passing the
output to Boilerpipe to extract the article contents. I've noticed
that Boilerpipe doesn't always do that great of a job. Often noise
will slip through and when I cluster the data the results are skewed
because of it.
Examples of common "junk" sentences are as follows:
-”Get Connected! MASNsports.com is your online home for the latest
Orioles and Nationals news, features, and commentary. And now, you can
connect with MASN on every digital level. From web and social media to
our new mobile alert service, MASN has got all the bases covered. Get
social!”
-”Home KKTV firmly believes in freedom of speech for all and we are
happy to provide this forum for the community to share opinions and
facts. We ask that commenters keep it clean, keep it truthful, stay on
topic and be responsible. Comments left here do not necessarily
represent the viewpoint of KKTV 11 News. If you believe that any of
the comments on our site are inappropriate or offensive, please tell
us by clicking “Report Abuse” and answering the questions that follow.
We will review any reported comments promptly.”
-”(TM and © Copyright 2014 CBS Radio Inc. and its relevant
subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS
Broadcasting Inc. Used under license. All Rights Reserved. This
material may not be published, broadcast, rewritten, or redistributed.
The Associated Press contributed to this report.)”
-”(© Copyright 2014 The Associated Press. All Rights Reserved. This
material may not be published, broadcast, rewritten or
redistributed.)”
..and on.
I've played around with a number of different methods to clean the
dataset prior to clustering: manually gathering and scrubbing common
substrings, using various LCS implementations (Longest Common
Subsequence), computing the Levenshtein distance for all possible
substrings, and on, but I've put a significant amount of time into
them and haven't had the greatest results. So I figure I'd ask if
anyone knows of any library that does something along the lines of
what I'm trying to do. Has anyone had any luck finding such a thing?
Many thanks,
-David
running into a problem that ends up ruining the final cluster quality:
noise, or "junk" sentences appended or prepended to the articles by
the news outlet. I removing common noise from datasets is a problem
common to many domains (news, bioinformatics, etc) so I figure there
must be some solution to it in existence already. Does anyone know of
any libraries to clean common strings from a set of strings (Java,
preferably)?
I'm scraping pages from news outlets using HTMLUnit and passing the
output to Boilerpipe to extract the article contents. I've noticed
that Boilerpipe doesn't always do that great of a job. Often noise
will slip through and when I cluster the data the results are skewed
because of it.
Examples of common "junk" sentences are as follows:
-”Get Connected! MASNsports.com is your online home for the latest
Orioles and Nationals news, features, and commentary. And now, you can
connect with MASN on every digital level. From web and social media to
our new mobile alert service, MASN has got all the bases covered. Get
social!”
-”Home KKTV firmly believes in freedom of speech for all and we are
happy to provide this forum for the community to share opinions and
facts. We ask that commenters keep it clean, keep it truthful, stay on
topic and be responsible. Comments left here do not necessarily
represent the viewpoint of KKTV 11 News. If you believe that any of
the comments on our site are inappropriate or offensive, please tell
us by clicking “Report Abuse” and answering the questions that follow.
We will review any reported comments promptly.”
-”(TM and © Copyright 2014 CBS Radio Inc. and its relevant
subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS
Broadcasting Inc. Used under license. All Rights Reserved. This
material may not be published, broadcast, rewritten, or redistributed.
The Associated Press contributed to this report.)”
-”(© Copyright 2014 The Associated Press. All Rights Reserved. This
material may not be published, broadcast, rewritten or
redistributed.)”
..and on.
I've played around with a number of different methods to clean the
dataset prior to clustering: manually gathering and scrubbing common
substrings, using various LCS implementations (Longest Common
Subsequence), computing the Levenshtein distance for all possible
substrings, and on, but I've put a significant amount of time into
them and haven't had the greatest results. So I figure I'd ask if
anyone knows of any library that does something along the lines of
what I'm trying to do. Has anyone had any luck finding such a thing?
Many thanks,
-David