Monday, October 31, 2016
Sunday, October 30, 2016
Saturday, October 29, 2016
Spark Dataframe encoders
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset
http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row
http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row
Independent of version you can simply use
DataFrame
API:import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use
map
you should use statically typed Dataset
:import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over
Dataset[Row]
you have to provide required encoder:import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
Friday, October 28, 2016
How to fix pip install SSL certificate problems
On OSX using a previously installed Python virtualenv environment pip installs fail due to certificate problems (SEE BELOW) after upgrading python. The solution is to copy the virtualenv to a back up and reinstall the virtualenv.
https://hackercodex.com/guide/python-development-environment-on-mac-osx/
Then pip install will work in the new virtualenv.
Don't waste time downloading certificates.
If that does not work then add the digital certificate to your system as describe in the Stackoverflow post below.
http://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi/28724886#28724886
$ cd ~
https://hackercodex.com/guide/python-development-environment-on-mac-osx/
Then pip install
Don't waste time downloading certificates.
mkdir .virtualenvs/
cd .virtualenvs/
workon python.7.6.11/
virutalenv python.7.6.11/
pip install certifi
Downloading/unpacking certifi
Downloading certifi-2016.9.26.tar.gz (374kB): 374kB downloaded
Running setup.py egg_info for package certifi
Installing collected packages: certifi
Running setup.py install for certifi
Successfully installed certifi
Cleaning up...
(python.7.6.11)
If that does not work then add the digital certificate to your system as describe in the Stackoverflow post below.
http://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi/28724886#28724886
$ cd ~
$ curl -sO http://cacerts.digicert.com/DigiCertHighAssuranceEVRootCA.crt
$ openssl x509 -inform DES -in DigiCertHighAssuranceEVRootCA.crt -out DigiCertHighAssuranceEVRootCA.pem -text
$ export PIP_CERT=`pwd`/DigiCertHighAssuranceEVRootCA.pem
# test pip with the --cert argument
# Test pip in your local env by installing the package that was failing to install
# without the certs
pip --cert $PIP_CERT install keras
# the package should have installed correctly
# add PIP_CERT to the appropriate file depending on your OS
# Add this line
# Replace $HOME with the full path to the directory where you have moved the cert file
# export PIP_CERT=/DigiCertHighAssuranceEVRootCA.pem
# Linux
vi /Users/depappas/.bashrc
source /Users/depappas/.bashrc
# Mac OSX
vi /Users/depappas/.bash_profile
source /Users/depappas/.bash_profile
# check that the correct path to the cert is set in the env var
ls $PIP_CERT
pip --cert $PIP_CERT install keras
# the package should have installed correctly
Thursday, October 27, 2016
grafana-influxdb-and-statsd
https://github.com/etsy/statsd/wiki
https://docs.influxdata.com/telegraf/v1.0/
https://docs.influxdata.com/influxdb/v1.0/concepts/key_concepts/
https://www.datadoghq.com/blog/statsd/
Influxdata downloads: download the Docker containers
https://www.influxdata.com/downloads/
Docker network connect
http://stackoverflow.com/questions/18460016/connect-from-one-docker-container-to-another#20217593
https://docs.docker.com/engine/userguide/networking/work-with-networks/#connect-containers
Graphana
http://grafana.org/
Choose a statds server:
Docker images:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/
Docker files:
https://github.com/kamon-io/docker-grafana-influxdb
Mapped ports between the server and the Docker container example:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/
Composing a graphite server for Docker that shows how to map the ports in the Docker file
https://thepracticalsysadmin.com/composing-a-graphite-server-with-docker/
https://docs.influxdata.com/telegraf/v1.0/
https://docs.influxdata.com/influxdb/v1.0/concepts/key_concepts/
https://www.datadoghq.com/blog/statsd/
Influxdata downloads: download the Docker containers
https://www.influxdata.com/downloads/
Docker network connect
http://stackoverflow.com/questions/18460016/connect-from-one-docker-container-to-another#20217593
https://docs.docker.com/engine/userguide/networking/work-with-networks/#connect-containers
Graphana
http://grafana.org/
Choose a statds server:
Docker images:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/
Docker files:
https://github.com/kamon-io/docker-grafana-influxdb
Mapped ports between the server and the Docker container example:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/
Composing a graphite server for Docker that shows how to map the ports in the Docker file
https://thepracticalsysadmin.com/composing-a-graphite-server-with-docker/
Friday, October 21, 2016
Wednesday, October 19, 2016
Set up the Apache SPARK keys and verifying your SPARK download
1. Go to this page, select your version of SPARK, and download it from a mirror
https://spark.apache.org/downloads.html
2. Download the MD5 and KEYS file associated with your version of SPARK
The links are in step 5 on the Apache SPARK download page.
https://www.apache.org/dist/spark/KEYS
Example MD5:
4. Import the KEYS file and use it to verify the tar.gz file
https://www.apache.org/dyn/closer.cgi/spark
https://spark.apache.org/downloads.html
2. Download the MD5 and KEYS file associated with your version of SPARK
The links are in step 5 on the Apache SPARK download page.
https://www.apache.org/dist/spark/KEYS
Example MD5:
spark-2.0.1-bin-hadoop2.7.tgz.md5
3. md5 the downloaded tar.gz file and compare the sum to the expected md5 value
4. Import the KEYS file and use it to verify the tar.gz file
https://www.apache.org/dyn/closer.cgi/spark
$ gpg --import KEYS.txtgpg: key 15E06093: public key "Andrew Or" imported gpg: key 82667DC1: public key "Xiangrui Meng (CODE SIGNING KEY)" imported gpg: key 00799F7E: public key "Patrick Wendell" imported gpg: key FC8ED089: public key "Patrick Wendell" imported gpg: key 87FD1A97: public key "Tathagata Das (CODE SIGNING KEY)" imported gpg: key 9E4FE3AF: public key "Patrick Wendell" imported gpg: Total number processed: 6gpg: imported: 6 (RSA: 6)
VERIFY THE INTEGRITY OF THE FILES
It is essential that you verify the integrity of the downloaded file using the PGP signature (
.asc
file) or a hash (.md5
or .sha
file). Please read Verifying Apache Software Foundation Releases for more information on why you should verify our releases.
The PGP signature can be verified using PGP or GPG. First download the
KEYS
as well as the asc
signature file for the relevant distribution. Make sure you get these files from the main distribution site, rather than from a mirror. Then verify the signatures using% gpg --import KEYS % gpg --verify downloaded_file.asc downloaded_file
or
% pgpk -a KEYS % pgpv downloaded_file.asc
or
% pgp -ka KEYS % pgp downloaded_file.asc
Alternatively, you can verify the MD5 hash on the file. A unix program called
md5
or md5sum
is included in many unix distributions. It is also available as part of GNU Textutils. Windows users can get binary md5 programs from here, here , or here.
More help on verifying signatures:
SPARK and AVRO
SBT reminder
Has an example: "Avro Data Source for Apache Spark"
The examples on the wiki use an Avro file available for download here:
Fix for last line:
avroRDD.map(l => { new String(l._1.datum.get("username").toString()) } ).first
SPARK, PARQUET,
KryoSerializer
KryoRegistrator
MISC.
Maven deps
org.apache.spark
spark-core_2.10
1.6.1
org.apache.spark
spark-sql_2.10
1.6.1
com.databricks
spark-csv_2.10
1.4.0
Submitting apps to a cluster
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % sparkVersion withSources(),
"org.apache.spark" % "spark-sql_2.10" % sparkVersion withSources()
)
add this to project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
and this to build.sbt
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs @ _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
You should add to your build.sbt the following dependency:
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.4.0"
And in your scala-file add the following import:
import org.apache.spark.{SparkConf, SparkContext}
Subscribe to:
Posts (Atom)