Monday, December 11, 2017

Scala/SPARK date string to timestamp conversion

https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3696710289009770/4413065072037724/latest.html

https://stackoverflow.com/questions/29844144/better-way-to-convert-a-string-field-into-timestamp-in-spark

Notes:
  1. Dates are checked for validity. 2017/11/31 is an invalid date. Nov. only has 30 days and a null is returned.
  2. Note the regex year, month, day order. It must match the date string in the input otherwise a null will be returned.
  3. $"date" is the date column from the dataframe "df".

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
import org.apache.spark.sql.functions._ 
import org.apache.spark.sql.types._ 

val df = Seq( (1, "2014/01/01 23:00:01"), 
(1, "2016/12/29 09:54:00"), 
(1, "2016/05/09 10:12:43"), 
(1, "2014/11/30 10:12:43"), 
(1, "2014/11/31 12:40:32")).toDF("id", "date")

import org.apache.spark.sql.functions.to_timestamp 

val ts = to_timestamp($"date", "yyyy/MM/dd HH:mm:ss") 

df.withColumn("ts", ts).show(5, false)

Now lets replace the date column with the converted timestamp column.

What follows is not quite right.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
scala> val res = df.select($"id").show()
+---+
| id|
+---+
|  1|
|  1|
|  1|
|  1|
|  1|
+---+


scala> val res = df.select($"id", ts)
res: org.apache.spark.sql.DataFrame = [id: int, to_timestamp(`date`, 'yyyy/MM/dd HH:mm:ss'): timestamp]

scala> res.show()
+---+-------------------------------------------+
| id|to_timestamp(`date`, 'yyyy/MM/dd HH:mm:ss')|
+---+-------------------------------------------+
|  1|                        2014-01-01 23:00:01|
|  1|                        2016-12-29 09:54:00|
|  1|                        2016-05-09 10:12:43|
|  1|                        2014-11-30 10:12:43|
|  1|                                       null|
+---+-------------------------------------------+

.alias() is the key here to change the name of the new column to "date"


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
scala> val ts = to_timestamp($"date", "yyyy/MM/dd HH:mm:ss").alias("date") 
ts: org.apache.spark.sql.Column = to_timestamp(`date`, 'yyyy/MM/dd HH:mm:ss') AS `date`

scala> val res = df.select($"id", ts)
res: org.apache.spark.sql.DataFrame = [id: int, date: timestamp]

scala> res.show()
+---+-------------------+
| id|               date|
+---+-------------------+
|  1|2014-01-01 23:00:01|
|  1|2016-12-29 09:54:00|
|  1|2016-05-09 10:12:43|
|  1|2014-11-30 10:12:43|
|  1|               null|
+---+-------------------+


No comments:

Post a Comment