Change datatype of a column in a spark RDD to date and query on it -
by default when loading data, every column being considered string type. data looks like:
firstname,lastname,age,doj dileep,gog,21,2016-01-01 avishek,ganguly,21,2016-01-02 shreyas,t,20,2016-01-03
after updating schema of rdd
looks like
temp.printschema |-- firstname: string (nullable = true) |-- lastname: string (nullable = true) |-- age: string (nullable = true) |-- doj: date (nullable = true)
registered temporary table , queried on
temp.registertemptable("temptable"); val temp1 = sqlcontext.sql("select * temptable") temp1.show() +---------+--------+---+----------+ |firstname|lastname|age| doj| +---------+--------+---+----------+ | dileep| gog| 21|2016-01-01| | avishek| ganguly| 21|2016-01-02| | shreyas| t| 20|2016-01-03| +---------+--------+---+----------+ val temp2 = sqlcontext.sql("select * temptable doj > cast('2016-01-02' date)")
but when trying see result giving me:
temp2: org.apache.spark.sql.dataframe = [firstname: string, lastname: string, age: string, doj: date]
when do
temp2.show() java.lang.classcastexception: java.lang.string cannot cast java.lang.integer
so have tried code , works me. suspect problem in how change schema initially, looks off me (granted little hard read when post in comment - should update question code instead).
anyway, have done way:
first simulating input:
val df = sc.parallelize(list(("dileep","gog","21","2016-01-01"), ("avishek","ganguly","21","2016-01-02"), ("shreyas","t","20","2016-01-03"))).todf("firstname", "lastname", "age", "doj")
then:
import org.apache.spark.sql.functions._ val temp = df.withcolumn("doj", to_date('doj)) temp.registertemptable("temptable"); val temp2 = sqlcontext.sql("select * temptable doj > cast('2016-01-02' date)")
doing temp2.show()
reveals expected:
+---------+--------+---+----------+ |firstname|lastname|age| doj| +---------+--------+---+----------+ | shreyas| t| 20|2016-01-03| +---------+--------+---+----------+
Comments
Post a Comment