SBT reminder
Has an example: "Avro Data Source for Apache Spark"
The examples on the wiki use an Avro file available for download here:
Fix for last line:
avroRDD.map(l => { new String(l._1.datum.get("username").toString()) } ).first
SPARK, PARQUET,
KryoSerializer
KryoRegistrator
MISC.
Maven deps
org.apache.spark
spark-core_2.10
1.6.1
org.apache.spark
spark-sql_2.10
1.6.1
com.databricks
spark-csv_2.10
1.4.0
Submitting apps to a cluster
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % sparkVersion withSources(),
"org.apache.spark" % "spark-sql_2.10" % sparkVersion withSources()
)
add this to project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
and this to build.sbt
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
You should add to your build.sbt the following dependency:
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.4.0"
And in your scala-file add the following import:
import org.apache.spark.{SparkConf, SparkContext}