Hi all,
I put together a utility which vectorizes plain old Java objects annotated
with @Feature and @Target via Mahout's vector encoders.
See my Github branch:
https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
and the unit test:
https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
Use it like this:
class NewsgroupPost {
@Target
private String newsgroup;
@Feature(encoder = TextValueEncoder.class)
private String newsgroup;
// Getters setters
AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
AnnotationBasedVectorizer<NewsgroupPost>(new
TypeReference<NewsgroupPost>(){});
Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
this:
NewsgroupPost post = ...
Vector vector = vectorizer.vectorize(post);
int target = vectorizer.getTarget(post);
int numFeatures = vectorizer.getNumberOfFeatures();
Note that vectorize() and getTarget() methods are genericly typed and due
to the type token passed in the constructor we can enforce that only
NewsgroupPosts are accepted.
The vectorizer uses a Dictionary for encoding the target.
Thoughts?
Cheers,
Frank
I put together a utility which vectorizes plain old Java objects annotated
with @Feature and @Target via Mahout's vector encoders.
See my Github branch:
https://github.com/frankscholten/mahout/tree/annotation-based-vectorizer
and the unit test:
https://github.com/frankscholten/mahout/blob/annotation-based-vectorizer/core/src/test/java/org/apache/mahout/classifier/sgd/AnnotationBasedVectorizerTest.java
Use it like this:
class NewsgroupPost {
@Target
private String newsgroup;
@Feature(encoder = TextValueEncoder.class)
private String newsgroup;
// Getters setters
AnnotationBasedVectorizer<NewsgroupPost> vectorizer = new
AnnotationBasedVectorizer<NewsgroupPost>(new
TypeReference<NewsgroupPost>(){});
Here the vectorizer scans the NewsgroupPost's annotations. Then you can do
this:
NewsgroupPost post = ...
Vector vector = vectorizer.vectorize(post);
int target = vectorizer.getTarget(post);
int numFeatures = vectorizer.getNumberOfFeatures();
Note that vectorize() and getTarget() methods are genericly typed and due
to the type token passed in the constructor we can enforce that only
NewsgroupPosts are accepted.
The vectorizer uses a Dictionary for encoding the target.
Thoughts?
Cheers,
Frank