Hi Mahout users!
I'm starting to deal with unstructured text classification, namely
classification of web pages of unknown structure. The number of possible
categories would probably be quite small (as for now I believe that three
categories are enough).
Later I would add another level of data processing based on document
structure (existence of meta tags and so on).
Do you have any experience or suggestions? Somehow I don't feel like using
bag of words approach (but maybe i am wrong?).
<mailto:grzegorz.ewald [ at ] gmail.com
I'm starting to deal with unstructured text classification, namely
classification of web pages of unknown structure. The number of possible
categories would probably be quite small (as for now I believe that three
categories are enough).
Later I would add another level of data processing based on document
structure (existence of meta tags and so on).
Do you have any experience or suggestions? Somehow I don't feel like using
bag of words approach (but maybe i am wrong?).
<mailto:grzegorz.ewald [ at ] gmail.com