scala - How to deal with more than one categorical feature in a decision tree? -
i read piece of code binary decision tree book. has 1 categorical feature, field(3), in raw data, , converted one-of-k(one-hot encoding).
def preparedata(sc: sparkcontext): (rdd[labeledpoint], rdd[labeledpoint], rdd[labeledpoint], map[string, int]) = { val rawdatawithheader = sc.textfile("data/train.tsv") val rawdata = rawdatawithheader.mappartitionswithindex { (idx, iter) => if (idx == 0) iter.drop(1) else iter } val lines = rawdata.map(_.split("\t")) val categoriesmap = lines.map(fields => fields(3)).distinct.collect.zipwithindex.tomap val labelpointrdd = lines.map { fields => val trfields = fields.map(_.replaceall("\"", "")) val categoryfeaturesarray = array.ofdim[double](categoriesmap.size) val categoryidx = categoriesmap(fields(3)) categoryfeaturesarray(categoryidx) = 1 val numericalfeatures = trfields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.todouble) val label = trfields(fields.size - 1).toint labeledpoint(label, vectors.dense(categoryfeaturesarray ++ numericalfeatures)) } val array(traindata, validationdata, testdata) = labelpointrdd.randomsplit(array(8, 1, 1)) return (traindata, validationdata, testdata, categoriesmap) }
i wonder how revise code if there several categorical features in raw data, let's field(3), field(5), field(7) categorical features.
i revised first line:
def preparedata(sc: sparkcontext): (rdd[labeledpoint], rdd[labeledpoint], rdd[labeledpoint], map[string, int], map[string, int], map[string, int], map[string, int]) =......
then, converted 2 fields 1-of-k encoding done like:
val categoriesmap5 = lines.map(fields => fields(5)).distinct.collect.zipwithindex.tomap val categoriesmap7 = lines.map(fields => fields(7)).distinct.collect.zipwithindex.tomap val categoryfeaturesarray5 = array.ofdim[double](categoriesmap5.size) val categoryfeaturesarray7 = array.ofdim[double](categoriesmap7.size) val categoryidx3 = categoriesmap5(fields(5)) val categoryidx5 = categoriesmap7(fields(7)) categoryfeaturesarray5(categoryidx5) = 1 categoryfeaturesarray7(categoryidx7) = 1
finally, revised labeledpoint , return like:
labeledpoint(label, vectors.dense(categoryfeaturesarray ++ categoryfeaturesarray5 ++ categoryfeaturesarray7 ++ numericalfeatures)) return (traindata, validationdata, testdata, categoriesmap, categoriesmap5, categoriesmap7)
is correct?
==================================================
the second problem encountered is: following code book, in trainmodel, uses
decisiontree.trainregressor(trainingdata, categoricalfeaturesinfo, impurity, maxdepth, maxbins)
here code:
def trainmodel(traindata: rdd[labeledpoint], impurity: string, maxdepth: int, maxbins: int): (decisiontreemodel, double) = { val starttime = new datetime() val model = decisiontree.trainclassifier(traindata, 2, map[int, int](), impurity, maxdepth, maxbins) val endtime = new datetime() val duration = new duration(starttime, endtime) (model, duration.getmillis()) }
the question is: how pass categoricalfeaturesinfo method if has 3 categorical features mentioned previously?
i want follow step on book build prediction system on own using decision tree. more specific, data sets chose has several categorical features : gender: male, female
education: hs-grad, bachelors, master, ph.d, ......
country: us, canada, england, australia, ......
but don't know how merge them 1 single categoryfeatures ++ numericalfeatures
put vector.dense()
, , 1 single categoricalfeaturesinfo
put decisiontree.trainregressor()
it not clear me you're doing here looks wrong beginning.
ignoring fact you're reinventing wheel implementing one-hot-encoding scratch, whole point of encoding convert categorical variables numerical ones. required linear models arguably doesn't make sense when working decision trees.
keeping in mind have 2 choices:
- index categorical fields without encoding , pass indexed features
categoricalfeaturesinfo
. - one-hot-encode categorical features , treat these numerical variables.
i believe former approach right approach. latter 1 should work in practice artificially increases dimensionality without providing benefits. may in conflict heuristics used spark implementation.
one way or should consider using ml pipelines provide required indexing, encoding, , merging tools.
Comments
Post a Comment