scala - How to deal with more than one categorical feature in a decision tree? -


i read piece of code binary decision tree book. has 1 categorical feature, field(3), in raw data, , converted one-of-k(one-hot encoding).

def preparedata(sc: sparkcontext): (rdd[labeledpoint], rdd[labeledpoint], rdd[labeledpoint], map[string, int]) = {    val rawdatawithheader = sc.textfile("data/train.tsv")   val rawdata = rawdatawithheader.mappartitionswithindex { (idx, iter) => if (idx == 0) iter.drop(1) else iter }   val lines = rawdata.map(_.split("\t"))     val categoriesmap = lines.map(fields => fields(3)).distinct.collect.zipwithindex.tomap   val labelpointrdd = lines.map { fields =>      val trfields = fields.map(_.replaceall("\"", ""))      val categoryfeaturesarray = array.ofdim[double](categoriesmap.size)      val categoryidx = categoriesmap(fields(3))      categoryfeaturesarray(categoryidx) = 1      val numericalfeatures = trfields.slice(4, fields.size - 1).map(d => if (d == "?") 0.0 else d.todouble)      val label = trfields(fields.size - 1).toint      labeledpoint(label, vectors.dense(categoryfeaturesarray ++ numericalfeatures))   }    val array(traindata, validationdata, testdata) = labelpointrdd.randomsplit(array(8, 1, 1))   return (traindata, validationdata, testdata, categoriesmap) } 

i wonder how revise code if there several categorical features in raw data, let's field(3), field(5), field(7) categorical features.

i revised first line:

def preparedata(sc: sparkcontext): (rdd[labeledpoint], rdd[labeledpoint], rdd[labeledpoint], map[string, int], map[string, int], map[string, int], map[string, int]) =...... 

then, converted 2 fields 1-of-k encoding done like:

val categoriesmap5 = lines.map(fields => fields(5)).distinct.collect.zipwithindex.tomap val categoriesmap7 = lines.map(fields => fields(7)).distinct.collect.zipwithindex.tomap val categoryfeaturesarray5 = array.ofdim[double](categoriesmap5.size) val categoryfeaturesarray7 = array.ofdim[double](categoriesmap7.size) val categoryidx3 = categoriesmap5(fields(5)) val categoryidx5 = categoriesmap7(fields(7)) categoryfeaturesarray5(categoryidx5) = 1 categoryfeaturesarray7(categoryidx7) = 1 

finally, revised labeledpoint , return like:

labeledpoint(label, vectors.dense(categoryfeaturesarray ++ categoryfeaturesarray5 ++ categoryfeaturesarray7 ++ numericalfeatures)) return (traindata, validationdata, testdata, categoriesmap, categoriesmap5, categoriesmap7) 

is correct?

==================================================

the second problem encountered is: following code book, in trainmodel, uses

  decisiontree.trainregressor(trainingdata, categoricalfeaturesinfo, impurity, maxdepth, maxbins) 

here code:

def trainmodel(traindata: rdd[labeledpoint], impurity: string, maxdepth: int, maxbins: int): (decisiontreemodel, double) = {    val starttime = new datetime()    val model = decisiontree.trainclassifier(traindata, 2, map[int, int](), impurity, maxdepth, maxbins)    val endtime = new datetime()    val duration = new duration(starttime, endtime)    (model, duration.getmillis()) } 

the question is: how pass categoricalfeaturesinfo method if has 3 categorical features mentioned previously?

i want follow step on book build prediction system on own using decision tree. more specific, data sets chose has several categorical features : gender: male, female

education: hs-grad, bachelors, master, ph.d, ......

country: us, canada, england, australia, ......

but don't know how merge them 1 single categoryfeatures ++ numericalfeatures put vector.dense(), , 1 single categoricalfeaturesinfo put decisiontree.trainregressor()

it not clear me you're doing here looks wrong beginning.

ignoring fact you're reinventing wheel implementing one-hot-encoding scratch, whole point of encoding convert categorical variables numerical ones. required linear models arguably doesn't make sense when working decision trees.

keeping in mind have 2 choices:

  • index categorical fields without encoding , pass indexed features categoricalfeaturesinfo.
  • one-hot-encode categorical features , treat these numerical variables.

i believe former approach right approach. latter 1 should work in practice artificially increases dimensionality without providing benefits. may in conflict heuristics used spark implementation.

one way or should consider using ml pipelines provide required indexing, encoding, , merging tools.


Comments

Popular posts from this blog

magento2 - Magento 2 admin grid add filter to collection -

Android volley - avoid multiple requests of the same kind to the server? -

Combining PHP Registration and Login into one class with multiple functions in one PHP file -