programming matrix: SPARK Dataframe filters

https://stackoverflow.com/questions/42951905/spark-dataframe-filter

Creating a JSON Dataframe testcase

val df = sc.parallelize(Seq((1,"Emailab"), (2,"Phoneab"), (3, "Faxab"),(4,"Mail"),(5,"Other"),(6,"MSL12"),(7,"MSL"),(8,"HCP"),(9,"HCP12"))).toDF("c1","c2")

+---+-------+
| c1|     c2|
+---+-------+
|  1|Emailab|
|  2|Phoneab|
|  3|  Faxab|
|  4|   Mail|
|  5|  Other|
|  6|  MSL12|
|  7|    MSL|
|  8|    HCP|
|  9|  HCP12|
+---+-------+

scala> df.show()

+---+-------+

| c1|     c2|

+---+-------+

|  1|Emailab|

|  2|Phoneab|

|  3|  Faxab|

|  4|   Mail|

|  5|  Other|

|  6|  MSL12|

|  7|    MSL|

|  8|    HCP|

|  9|  HCP12|

+---+-------+

scala> df.filter($"c2".like("HCP")).show()

+---+---+

| c1| c2|

+---+---+

|  8|HCP|

+---+---+

https://stackoverflow.com/questions/35759099/filter-spark-dataframe-on-string-contains

scala> df.filter($"c2".like("HC")).show()

+---+---+

| c1| c2|

+---+---+

+---+---+

scala> df.filter($"c2".rlike("HC")).show()

+---+-----+

| c1|   c2|

+---+-----+

|  8|  HCP|

|  9|HCP12|

+---+-----+

scala> df.filter(df("c2")==="HCP").show()

+---+---+

| c1| c2|

+---+---+

|  8|HCP|

+---+---+

scala> df.filter($"c2".contains("HCP")).show()

+---+-----+

| c1|   c2|

+---+-----+

|  8|  HCP|

|  9|HCP12|

+---+-----+

https://www.tutorialspoint.com/spark_sql/spark_sql_dataframes.htm

Put this in employees.json (SPARK JSON format is different from standard JSON format-no commas between records and no square braces for lists of records) one JSON record per line.

   {"id" : "1201", "name" : "satish", "age" : "25"}
   {"id" : "1202", "name" : "krishna", "age" : "28"}
   {"id" : "1203", "name" : "amith", "age" : "39"}
   {"id" : "1204", "name" : "javed", "age" : "23"}
   {"id" : "1205", "name" : "prudvi", "age" : "23"}

val dfs = spark.read.json("employee.json")

dfs.printSchema()

dfs.select("name").show()

dfs.filter(dfs("age") > 23).show()

dfs.groupBy("age").count().show()

scala> val dfs = spark.read.json("employee.json")

dfs: org.apache.spark.sql.DataFrame = [age: string, id: string ... 1 more field]

scala> 

scala> dfs.printSchema()

root

 |-- age: string (nullable = true)

 |-- id: string (nullable = true)

 |-- name: string (nullable = true)

scala> 

scala> dfs.select("name").show()

+-------+

|   name|

+-------+

| satish|

|krishna|

|  amith|

|  javed|

| prudvi|

+-------+

scala> 

scala> dfs.filter(dfs("age") > 23).show()

+---+----+-------+

|age|  id|   name|

+---+----+-------+

| 25|1201| satish|

| 28|1202|krishna|

| 39|1203|  amith|

+---+----+-------+

programming matrix

Thursday, June 29, 2017

SPARK Dataframe filters

No comments:

Post a Comment

Followers

Blog Archive