Cleaning dataset prior to clustering

I'm clustering a pretty typical use case (news articles), but I keep
running into a problem that ends up ruining the final cluster quality:
noise, or "junk" sentences appended or prepended to the articles by
the news outlet. I removing common noise from datasets is a problem
common to many domains (news, bioinformatics, etc) so I figure there
must be some solution to it in existence already. Does anyone know of
any libraries to clean common strings from a set of strings (Java,
preferably)?

I'm scraping pages from news outlets using HTMLUnit and passing the
output to Boilerpipe to extract the article contents. I've noticed
that Boilerpipe doesn't always do that great of a job. Often noise
will slip through and when I cluster the data the results are skewed
because of it.

Examples of common "junk" sentences are as follows:

-”Get Connected! MASNsports.com is your online home for the latest
Orioles and Nationals news, features, and commentary. And now, you can
connect with MASN on every digital level. From web and social media to
our new mobile alert service, MASN has got all the bases covered. Get
social!”

-”Home KKTV firmly believes in freedom of speech for all and we are
happy to provide this forum for the community to share opinions and
facts. We ask that commenters keep it clean, keep it truthful, stay on
topic and be responsible. Comments left here do not necessarily
represent the viewpoint of KKTV 11 News. If you believe that any of
the comments on our site are inappropriate or offensive, please tell
us by clicking “Report Abuse” and answering the questions that follow.
We will review any reported comments promptly.”

-”(TM and © Copyright 2014 CBS Radio Inc. and its relevant
subsidiaries. CBS RADIO and EYE Logo TM and Copyright 2014 CBS
Broadcasting Inc. Used under license. All Rights Reserved. This
material may not be published, broadcast, rewritten, or redistributed.
The Associated Press contributed to this report.)”

-”(© Copyright 2014 The Associated Press. All Rights Reserved. This
material may not be published, broadcast, rewritten or
redistributed.)”

..and on.

I've played around with a number of different methods to clean the
dataset prior to clustering: manually gathering and scrubbing common
substrings, using various LCS implementations (Longest Common
Subsequence), computing the Levenshtein distance for all possible
substrings, and on, but I've put a significant amount of time into
them and haven't had the greatest results. So I figure I'd ask if
anyone knows of any library that does something along the lines of
what I'm trying to do. Has anyone had any luck finding such a thing?

Many thanks,

-David

Cleaning dataset prior to clustering

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112