Wednesday, June 28, 2017

SPARK SQL and Dataframe links

http://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/


// construct RDD[Sting] val events = sc.parallelize( """{"action":"create","timestamp":"2016-01-07T00:01:17Z"}""" :: Nil) // read it val df = sqlContext.read.json(events)


scala > df.show
+------+--------------------+
|action|           timestamp|
+------+--------------------+
|create|2016-01-07T00:01:17Z|
+------+--------------------+

scala>; df.printSchema
root
 |-- action: string (nullable = true)
 |-- timestamp: string (nullable = true)
val schema = (new StructType).add("action", StringType).add("timestamp", TimestampType) val df = sqlContext.read.schema(schema).json(events) df.show // +------+--------------------+ // |action| timestamp| // +------+--------------------+ // |create|2016-01-07 01:01:...| // +------+--------------------+





val events = sc.parallelize(
  """{"action":"create","timestamp":1452121277}""" ::
  """{"action":"create","timestamp":"1452121277"}""" ::
  """{"action":"create","timestamp":""}""" ::
  """{"action":"create","timestamp":null}""" ::
  """{"action":"create","timestamp":"null"}""" ::
  Nil
)

val schema = (new StructType).add("action", StringType).add("timestamp", LongType)

sqlContext.read.schema(schema).json(events).show

// +------+----------+
// |action| timestamp|
// +------+----------+
// |create|1452121277|
// |  null|      null|
// |create|      null|
// |create|      null|
// |  null|      null|
// +------+----------+
https://www.supergloo.com/fieldnotes/spark-sql-json-examples/

[{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "QUEENS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "MOSHE",
"County": "KINGS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "ETHAN",
"County": "QUEENS",
"Sex": "M",
"Count": "216"
}]

STEPS

1. Start the spark-shell from the same directory containing the baby_names.json file
2. Load the JSON using the Spark Context wholeTextFiles method which produces a PairRDD.  Use map to create the new RDD using the value portion of the pair.
3. Read in this RDD as JSON and confirm the schema


https://medium.com/@InDataLabs/converting-spark-rdd-to-dataframe-and-dataset-expert-opinion-826db069eb5

https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm




No comments:

Post a Comment