Wednesday, October 19, 2016

SPARK and AVRO

SBT reminder




Has an example: "Avro Data Source for Apache Spark"
The examples on the wiki use an Avro file available for download here:
https://spark-packages.org/package/databricks/spark-avro



Fix for last line:
avroRDD.map(l => { new String(l._1.datum.get("username").toString()) } ).first




SPARK, PARQUET,


KryoSerializer

KryoRegistrator


SPARK AVRO Packages

MISC.

Maven deps


    org.apache.spark
    spark-core_2.10
    1.6.1



    org.apache.spark
    spark-sql_2.10
    1.6.1



    com.databricks
    spark-csv_2.10
    1.4.0
Submitting apps to a cluster

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % sparkVersion withSources(),
  "org.apache.spark" % "spark-sql_2.10" % sparkVersion  withSources()
)
add this to project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
and this to build.sbt
assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

http://stackoverflow.com/questions/34093715/scala-code-not-compiling-in-sbt
You should add to your build.sbt the following dependency:
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.4.0"
And in your scala-file add the following import:
import org.apache.spark.{SparkConf, SparkContext}