http://blog.antlypls.com/blog/2016/01/30/processing-json-data-with-sparksql/
// construct RDD[Sting] val events = sc.parallelize( """{"action":"create","timestamp":"2016-01-07T00:01:17Z"}""" :: Nil) // read it val df = sqlContext.read.json(events)
https://medium.com/@InDataLabs/converting-spark-rdd-to-dataframe-and-dataset-expert-opinion-826db069eb5
https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm
// construct RDD[Sting] val events = sc.parallelize( """{"action":"create","timestamp":"2016-01-07T00:01:17Z"}""" :: Nil) // read it val df = sqlContext.read.json(events)
scala > df.show
+------+--------------------+
|action| timestamp|
+------+--------------------+
|create|2016-01-07T00:01:17Z|
+------+--------------------+
scala>; df.printSchema
root
|-- action: string (nullable = true)
|-- timestamp: string (nullable = true)
val schema = (new StructType).add("action", StringType).add("timestamp", TimestampType)
val df = sqlContext.read.schema(schema).json(events)
df.show
// +------+--------------------+
// |action| timestamp|
// +------+--------------------+
// |create|2016-01-07 01:01:...|
// +------+--------------------+
https://www.supergloo.com/fieldnotes/spark-sql-json-examples/
[{
"Year": "2013",
"First Name": "DAVID",
"County": "KINGS",
"Sex": "M",
"Count": "272"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "KINGS",
"Sex": "M",
"Count": "268"
}, {
"Year": "2013",
"First Name": "JAYDEN",
"County": "QUEENS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "MOSHE",
"County": "KINGS",
"Sex": "M",
"Count": "219"
}, {
"Year": "2013",
"First Name": "ETHAN",
"County": "QUEENS",
"Sex": "M",
"Count": "216"
}]
STEPS
1. Start the spark-shell from the same directory containing the baby_names.json file
2. Load the JSON using the Spark Context wholeTextFiles method which produces a PairRDD. Use map to create the new RDD using the value portion of the pair.
3. Read in this RDD as JSON and confirm the schema
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
scala> val namesJson = sqlContext.read.json(jsonRDD)
namesJson: org.apache.spark.sql.DataFrame = [Count: string, County: string, First Name: string, Sex: string, Year: string]
scala> namesJson.printSchema
root
|-- Count: string (nullable = true)
|-- County: string (nullable = true)
|-- First Name: string (nullable = true)
|-- Sex: string (nullable = true)
|-- Year: string (nullable = true)
scala>
|
https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm
No comments:
Post a Comment