programming matrix: SPARK SQL and Dataframe links

http://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/

// construct RDD[Sting] val events = sc.parallelize( """{"action":"create","timestamp":"2016-01-07T00:01:17Z"}""" :: Nil) // read it val df = sqlContext.read.json(events)

scala > df.show
+------+--------------------+
|action|           timestamp|
+------+--------------------+
|create|2016-01-07T00:01:17Z|
+------+--------------------+

scala>; df.printSchema
root
 |-- action: string (nullable = true)
 |-- timestamp: string (nullable = true)

val schema = (new StructType).add("action", StringType).add("timestamp", TimestampType) val df = sqlContext.read.schema(schema).json(events) df.show // +------+--------------------+ // |action| timestamp| // +------+--------------------+ // |create|2016-01-07 01:01:...| // +------+--------------------+

val events = sc.parallelize(
  """{"action":"create","timestamp":1452121277}""" ::
  """{"action":"create","timestamp":"1452121277"}""" ::
  """{"action":"create","timestamp":""}""" ::
  """{"action":"create","timestamp":null}""" ::
  """{"action":"create","timestamp":"null"}""" ::
  Nil
)

val schema = (new StructType).add("action", StringType).add("timestamp", LongType)

sqlContext.read.schema(schema).json(events).show

// +------+----------+
// |action| timestamp|
// +------+----------+
// |create|1452121277|
// |  null|      null|
// |create|      null|
// |create|      null|
// |  null|      null|
// +------+----------+

https://www.supergloo.com/fieldnotes/spark-sql-json-examples/

[{

 "Year": "2013",

 "First Name": "DAVID",

 "County": "KINGS",

 "Sex": "M",

 "Count": "272"

}, {

 "Year": "2013",

 "First Name": "JAYDEN",

 "County": "KINGS",

 "Sex": "M",

 "Count": "268"

}, {

 "Year": "2013",

 "First Name": "JAYDEN",

 "County": "QUEENS",

 "Sex": "M",

 "Count": "219"

}, {

 "Year": "2013",

 "First Name": "MOSHE",

 "County": "KINGS",

 "Sex": "M",

 "Count": "219"

}, {

 "Year": "2013",

 "First Name": "ETHAN",

 "County": "QUEENS",

 "Sex": "M",

 "Count": "216"

}]

STEPS
1. Start the spark-shell from the same directory containing the baby_names.json file

2. Load the JSON using the Spark Context wholeTextFiles method which produces a PairRDD.  Use map to create the new RDD using the value portion of the pair.

1

2

3

4

scala> val jsonRDD = sc.wholeTextFiles("baby_names.json").map(x => x._2)

jsonRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[10] at map at <console>:21

3. Read in this RDD as JSON and confirm the schema

1

2

3

4

5

6

7

8

9

10

11

12

13

14

scala> val namesJson = sqlContext.read.json(jsonRDD)

namesJson: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]

scala> namesJson.printSchema

root

 |-- Count: string (nullable = true)

 |-- County: string (nullable = true)

 |-- First Name: string (nullable = true)

 |-- Sex: string (nullable = true)

 |-- Year: string (nullable = true)

scala>

https://medium.com/@InDataLabs/converting-spark-rdd-to-dataframe-and-dataset-expert-opinion-826db069eb5

https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm

programming matrix

Wednesday, June 28, 2017

SPARK SQL and Dataframe links

STEPS

No comments:

Post a Comment

Followers

Blog Archive