Monday, October 31, 2016

MkDocs

http://www.mkdocs.org/

Theme
https://github.com/snide/sphinx_rtd_theme

Sunday, October 30, 2016

NVidia GPU comparison for deep learning.

https://www.reddit.com/r/nvidia/comments/4s0h07/gtx_980ti_1060_1070_or_1080_for_deep_learning/

http://timdettmers.com/2014/08/14/which-gpu-for-deep-learning/

Saturday, October 29, 2016

Spark Dataframe encoders

http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset

http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row

Independent of version you can simply use DataFrame API:

import org.apache.spark.sql.functions.{when, lower}

val df = Seq(
  (2012, "Tesla", "S"), (1997, "Ford", "E350"),
  (2015, "Chevy", "Volt")
).toDF("year", "make", "model")

df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))

If you really want to use map you should use statically typed Dataset:

import spark.implicits._

case class Record(year: Int, make: String, model: String)

df.as[Record].map {
  case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
  case rec => rec
}

or at least return an object which will have implicit encoder:

df.map {
  case Row(year: Int, make: String, model: String) => 
    (year, if(make.toLowerCase == "tesla") "S" else make, model)
}

Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:

import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
  StructField("year", IntegerType),
  StructField("make", StringType),
  StructField("model", StringType)
))

val encoder = RowEncoder(schema)

df.map {
  case Row(year, make: String, model) if make.toLowerCase == "tesla" => 
    Row(year, "S", model)
  case row => row
} (encoder)

Friday, October 28, 2016

How to fix pip install SSL certificate problems

On OSX using a previously installed Python virtualenv environment pip installs fail due to certificate problems (SEE BELOW) after upgrading python. The solution is to copy the virtualenv to a back up and reinstall the virtualenv.

https://hackercodex.com/guide/python-development-environment-on-mac-osx/

Then pip install will work in the new virtualenv.

Don't waste time downloading certificates.

mkdir .virtualenvs/

cd .virtualenvs/

workon python.7.6.11/

virutalenv python.7.6.11/

pip install certifi

Downloading/unpacking certifi

Downloading certifi-2016.9.26.tar.gz (374kB): 374kB downloaded

Running setup.py egg_info for package certifi

Installing collected packages: certifi

Running setup.py install for certifi

Successfully installed certifi

Cleaning up...

(python.7.6.11)

If that does not work then add the digital certificate to your system as describe in the Stackoverflow post below.

http://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi/28724886#28724886

$ cd ~

$ curl -sO http://cacerts.digicert.com/DigiCertHighAssuranceEVRootCA.crt

$ openssl x509 -inform DES -in DigiCertHighAssuranceEVRootCA.crt -out DigiCertHighAssuranceEVRootCA.pem -text

$ export PIP_CERT=`pwd`/DigiCertHighAssuranceEVRootCA.pem

# test pip with the --cert argument

# Test pip in your local env by installing the package that was failing to install

# without the certs

pip --cert $PIP_CERT install keras

# the package should have installed correctly

# add PIP_CERT to the appropriate file depending on your OS

# Add this line

# Replace $HOME with the full path to the directory where you have moved the cert file

# export PIP_CERT=/DigiCertHighAssuranceEVRootCA.pem

# Linux

vi /Users/depappas/.bashrc

source /Users/depappas/.bashrc

# Mac OSX

vi /Users/depappas/.bash_profile

source /Users/depappas/.bash_profile

# check that the correct path to the cert is set in the env var

ls $PIP_CERT

pip --cert $PIP_CERT install keras

# the package should have installed correctly

Thursday, October 27, 2016

Installing Scipy on OSX-no Fortran compiler found

http://stackoverflow.com/questions/14821297/scipy-build-install-mac-osx#14822245

grafana-influxdb-and-statsd

https://github.com/etsy/statsd/wiki

https://docs.influxdata.com/telegraf/v1.0/

https://docs.influxdata.com/influxdb/v1.0/concepts/key_concepts/

https://www.datadoghq.com/blog/statsd/

Influxdata downloads: download the Docker containers
https://www.influxdata.com/downloads/

Docker network connect
http://stackoverflow.com/questions/18460016/connect-from-one-docker-container-to-another#20217593
https://docs.docker.com/engine/userguide/networking/work-with-networks/#connect-containers

Graphana
http://grafana.org/

Choose a statds server:

Docker images:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/

Docker files:
https://github.com/kamon-io/docker-grafana-influxdb

Mapped ports between the server and the Docker container example:
https://hub.docker.com/r/samuelebistoletti/docker-statsd-influxdb-grafana/

Composing a graphite server for Docker that shows how to map the ports in the Docker file
https://thepracticalsysadmin.com/composing-a-graphite-server-with-docker/

Friday, October 21, 2016

Scala, RDD's, and AVRO

https://spark.apache.org/docs/1.3.0/sql-programming-guide.html

http://stackoverflow.com/questions/15607038/can-i-get-a-scala-case-class-definition-from-an-avro-schema-definition

http://genslerappspod.github.io/scalavro/

Wednesday, October 19, 2016

Set up the Apache SPARK keys and verifying your SPARK download

1. Go to this page, select your version of SPARK, and download it from a mirror
https://spark.apache.org/downloads.html

2. Download the MD5 and KEYS file associated with your version of SPARK
The links are in step 5 on the Apache SPARK download page.

https://www.apache.org/dist/spark/KEYS

Example MD5:

spark-2.0.1-bin-hadoop2.7.tgz.md5

3. md5 the downloaded tar.gz file and compare the sum to the expected md5 value

4. Import the KEYS file and use it to verify the tar.gz file
https://www.apache.org/dyn/closer.cgi/spark


$ gpg --import KEYS.txt

gpg: key 15E06093: public key "Andrew Or " imported

gpg: key 82667DC1: public key "Xiangrui Meng (CODE SIGNING KEY) " imported

gpg: key 00799F7E: public key "Patrick Wendell " imported

gpg: key FC8ED089: public key "Patrick Wendell " imported

gpg: key 87FD1A97: public key "Tathagata Das (CODE SIGNING KEY) " imported

gpg: key 9E4FE3AF: public key "Patrick Wendell " imported

gpg: Total number processed: 6

gpg:               imported: 6  (RSA: 6)

VERIFY THE INTEGRITY OF THE FILES

It is essential that you verify the integrity of the downloaded file using the PGP signature (.asc file) or a hash (.md5 or .sha file). Please read Verifying Apache Software Foundation Releases for more information on why you should verify our releases.

The PGP signature can be verified using PGP or GPG. First download the KEYS as well as the asc signature file for the relevant distribution. Make sure you get these files from the main distribution site, rather than from a mirror. Then verify the signatures using

% gpg --import KEYS
% gpg --verify downloaded_file.asc downloaded_file

% pgpk -a KEYS
% pgpv downloaded_file.asc

% pgp -ka KEYS
% pgp downloaded_file.asc

Alternatively, you can verify the MD5 hash on the file. A unix program called md5 or md5sum is included in many unix distributions. It is also available as part of GNU Textutils. Windows users can get binary md5 programs from here, here , or here.

More help on verifying signatures:

https://www.qubes-os.org/doc/verifying-signatures/

SPARK and AVRO

SBT reminder

https://spark.apache.org/docs/latest/quick-start.html

https://github.com/databricks/spark-avro

Has an example: "Avro Data Source for Apache Spark"

The examples on the wiki use an Avro file available for download here:

https://spark-packages.org/package/databricks/spark-avro

http://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files

http://www.bigdatatidbits.cc/2015/01/how-to-load-some-avro-data-into-spark.html

Fix for last line:

avroRDD.map(l => { new String(l._1.datum.get("username").toString()) } ).first

https://spark.apache.org/docs/1.5.1/api/java/org/apache/spark/serializer/KryoSerializer.html

https://stackoverflow.com/questions/23944615/how-can-i-load-avros-in-spark-using-the-schema-on-board-the-avro-files

SPARK, PARQUET,

http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/

https://ogirardot.wordpress.com/2015/01/09/changing-sparks-default-java-serialization-to-kryo/

KryoSerializer

http://stackoverflow.com/questions/27046178/how-can-i-read-this-avro-file-using-spark-scala

KryoRegistrator

https://gist.github.com/massie/7224868

SPARK AVRO Packages

https://spark-packages.org/package/databricks/spark-avro

MISC.

Maven deps

http://stackoverflow.com/questions/31596670/failed-to-load-class-for-data-source-com-databricks-spark-csv


    org.apache.spark
    spark-core_2.10
    1.6.1



    org.apache.spark
    spark-sql_2.10
    1.6.1



    com.databricks
    spark-csv_2.10
    1.4.0

Submitting apps to a cluster

https://spark.apache.org/docs/latest/submitting-applications.html

http://stackoverflow.com/questions/24442240/how-to-reference-jar-files-after-sbt-publish-local

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % sparkVersion withSources(),
  "org.apache.spark" % "spark-sql_2.10" % sparkVersion  withSources()
)

https://github.com/sbt/sbt-assembly

add this to project/assembly.sbt


addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")

and this to build.sbt

assemblyMergeStrategy in assembly := {

case PathList("META-INF", xs @ _*) => MergeStrategy.discard

case x => MergeStrategy.first

}

http://stackoverflow.com/questions/34093715/scala-code-not-compiling-in-sbt

You should add to your build.sbt the following dependency:

libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.4.0"

And in your scala-file add the following import:

import org.apache.spark.{SparkConf, SparkContext}

programming matrix