Hello,
I'm using the logistic regression of Mahout (version 0.9) but when I check
the created model on the same data set it was trained for, I do not see a
high value for AUC. I would expect it to be very high since it is the same
data set.
My data set is a CSV file with about 7 million lines and has 18 attributes,
some numerical and some categorical.
This is how I create the model for logistic regression (I ignore some of
the attributes):
$ mahout trainlogistic --input train.csv \
--output ./model \
--categories 2 \
--predictors attribute1 ... attribute15 \
--types w w w n n w w w w w w w n n n \
--target is_delayed \
--rate 100 \
--passes 2 \
--features 500000
And then when I check the AUC value using the model on the same data set:
$ mahout runlogistic --input train.csv --model ./model --auc --confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
AUC = 0.48
confusion: [[1703477.0, 761921.0], [3034369.0, 1137161.0]]
entropy: [[NaN, NaN], [-16.5, -17.4]]
15/01/18 06:50:50 INFO driver.MahoutDriver: Program took 98213 ms (Minutes:
1.6368833333333332)
I'm really confused why I only get AUC = 0.48, instead of 1.00 or something
very close since it is the same data set.
Do I miss something? What are the things I should check first?
I tried with only a few attributes but still very low AUC, around 0.47,
that means the model is almost guessing randomly, even worse, right?
I'm using the logistic regression of Mahout (version 0.9) but when I check
the created model on the same data set it was trained for, I do not see a
high value for AUC. I would expect it to be very high since it is the same
data set.
My data set is a CSV file with about 7 million lines and has 18 attributes,
some numerical and some categorical.
This is how I create the model for logistic regression (I ignore some of
the attributes):
$ mahout trainlogistic --input train.csv \
--output ./model \
--categories 2 \
--predictors attribute1 ... attribute15 \
--types w w w n n w w w w w w w n n n \
--target is_delayed \
--rate 100 \
--passes 2 \
--features 500000
And then when I check the AUC value using the model on the same data set:
$ mahout runlogistic --input train.csv --model ./model --auc --confusion
MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /usr/lib/hadoop/bin/hadoop and
HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /usr/lib/mahout/mahout-examples-0.9-cdh5.3.0-job.jar
AUC = 0.48
confusion: [[1703477.0, 761921.0], [3034369.0, 1137161.0]]
entropy: [[NaN, NaN], [-16.5, -17.4]]
15/01/18 06:50:50 INFO driver.MahoutDriver: Program took 98213 ms (Minutes:
1.6368833333333332)
I'm really confused why I only get AUC = 0.48, instead of 1.00 or something
very close since it is the same data set.
Do I miss something? What are the things I should check first?
I tried with only a few attributes but still very low AUC, around 0.47,
that means the model is almost guessing randomly, even worse, right?