Hi,
I'm training a naive bayes classifier on some structured documents. The
documents have several fields: title, description, body, breadcrumb, etc.
I'd like to weight the tokens from the different fields. For example, let's
say I just use the title and body fields; I'd like tokens from the title to
be weighted three times as much as the body tokens. Right now, when I'm
creating the sequence files I would just concatenate the title string three
times with the body string and let seq2sparse do it's work. Is there a
better way?
One possibility would be to preserve some structure in the value field
of the sequence file, perhaps using '|' or ';' to separate fields, and then
pass a
special analyzer which understands this syntax to seq2sparse. Is that a
sensible approach? What do others do in this situation? Thanks in advance.
-- Brian
I'm training a naive bayes classifier on some structured documents. The
documents have several fields: title, description, body, breadcrumb, etc.
I'd like to weight the tokens from the different fields. For example, let's
say I just use the title and body fields; I'd like tokens from the title to
be weighted three times as much as the body tokens. Right now, when I'm
creating the sequence files I would just concatenate the title string three
times with the body string and let seq2sparse do it's work. Is there a
better way?
One possibility would be to preserve some structure in the value field
of the sequence file, perhaps using '|' or ';' to separate fields, and then
pass a
special analyzer which understands this syntax to seq2sparse. Is that a
sensible approach? What do others do in this situation? Thanks in advance.
-- Brian