Monday, October 31, 2016

MkDocs

Sunday, October 30, 2016

NVidia GPU comparison for deep learning.

Saturday, October 29, 2016

Spark Dataframe encoders

http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset

http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row

Independent of version you can simply use DataFrame API:
import org.apache.spark.sql.functions.{when, lower}

val df = Seq(
  (2012, "Tesla", "S"), (1997, "Ford", "E350"),
  (2015, "Chevy", "Volt")
).toDF("year", "make", "model")

df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use map you should use statically typed Dataset:
import spark.implicits._

case class Record(year: Int, make: String, model: String)

df.as[Record].map {
  case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
  case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
  case Row(year: Int, make: String, model: String) => 
    (year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over Dataset[Row] you have to provide required encoder:
import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row

// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
  StructField("year", IntegerType),
  StructField("make", StringType),
  StructField("model", StringType)
))

val encoder = RowEncoder(schema)

df.map {
  case Row(year, make: String, model) if make.toLowerCase == "tesla" => 
    Row(year, "S", model)
  case row => row
} (encoder)

Friday, October 28, 2016

How to fix pip install SSL certificate problems

On OSX using a previously installed Python virtualenv environment pip installs fail due to certificate problems (SEE BELOW) after upgrading python. The solution is to copy the virtualenv to a back up and reinstall the virtualenv.

https://hackercodex.com/guide/python-development-environment-on-mac-osx/

Then pip install will work in the new virtualenv. 

Don't waste time downloading certificates.

mkdir .virtualenvs/
cd .virtualenvs/
workon python.7.6.11/
virutalenv python.7.6.11/

pip install certifi
Downloading/unpacking certifi
  Downloading certifi-2016.9.26.tar.gz (374kB): 374kB downloaded
  Running setup.py egg_info for package certifi
    
Installing collected packages: certifi
  Running setup.py install for certifi
    
Successfully installed certifi
Cleaning up...

(python.7.6.11)

If that does not work then add the digital certificate to your system as describe in the Stackoverflow post below.

http://stackoverflow.com/questions/25981703/pip-install-fails-with-connection-error-ssl-certificate-verify-failed-certi/28724886#28724886



$ cd ~

$ curl -sO http://cacerts.digicert.com/DigiCertHighAssuranceEVRootCA.crt 

$ openssl x509 -inform DES -in DigiCertHighAssuranceEVRootCA.crt -out DigiCertHighAssuranceEVRootCA.pem -text

$ export PIP_CERT=`pwd`/DigiCertHighAssuranceEVRootCA.pem

# test pip with the --cert argument

# Test pip in your local env by installing the package that was failing to install 
# without the certs

pip --cert $PIP_CERT install keras

# the package should have installed correctly

# add PIP_CERT to the appropriate file depending on your OS 
# Add this line
# Replace $HOME with the full path to the directory where you have moved the cert file
export PIP_CERT=/DigiCertHighAssuranceEVRootCA.pem

# Linux
vi /Users/depappas/.bashrc
source /Users/depappas/.bashrc

# Mac OSX
 vi /Users/depappas/.bash_profile
 source /Users/depappas/.bash_profile

# check that the correct path to the cert is set in the env var
ls $PIP_CERT

pip --cert $PIP_CERT install keras

# the package should have installed correctly





Wednesday, October 19, 2016

Set up the Apache SPARK keys and verifying your SPARK download

1. Go to this page, select your version of SPARK, and download it from a mirror
https://spark.apache.org/downloads.html

2. Download the MD5 and KEYS file associated with your version of SPARK
The links are in step 5 on the Apache SPARK download page.

https://www.apache.org/dist/spark/KEYS

Example MD5:
spark-2.0.1-bin-hadoop2.7.tgz.md5 


3. md5 the downloaded tar.gz file and compare the sum to the expected md5 value

4. Import the KEYS file and use it to verify the tar.gz file
https://www.apache.org/dyn/closer.cgi/spark



$ gpg --import KEYS.txt
gpg: key 15E06093: public key "Andrew Or " imported
gpg: key 82667DC1: public key "Xiangrui Meng (CODE SIGNING KEY) " imported
gpg: key 00799F7E: public key "Patrick Wendell " imported
gpg: key FC8ED089: public key "Patrick Wendell " imported
gpg: key 87FD1A97: public key "Tathagata Das (CODE SIGNING KEY) " imported
gpg: key 9E4FE3AF: public key "Patrick Wendell " imported
gpg: Total number processed: 6
gpg:               imported: 6  (RSA: 6)

VERIFY THE INTEGRITY OF THE FILES

It is essential that you verify the integrity of the downloaded file using the PGP signature (.asc file) or a hash (.md5 or .sha file). Please read Verifying Apache Software Foundation Releases for more information on why you should verify our releases.
The PGP signature can be verified using PGP or GPG. First download the KEYS as well as the asc signature file for the relevant distribution. Make sure you get these files from the main distribution site, rather than from a mirror. Then verify the signatures using
% gpg --import KEYS
% gpg --verify downloaded_file.asc downloaded_file
or
% pgpk -a KEYS
% pgpv downloaded_file.asc
or
% pgp -ka KEYS
% pgp downloaded_file.asc
Alternatively, you can verify the MD5 hash on the file. A unix program called md5 or md5sum is included in many unix distributions. It is also available as part of GNU Textutils. Windows users can get binary md5 programs from herehere , or here.

More help on verifying signatures:



SPARK and AVRO

SBT reminder




Has an example: "Avro Data Source for Apache Spark"
The examples on the wiki use an Avro file available for download here:
https://spark-packages.org/package/databricks/spark-avro



Fix for last line:
avroRDD.map(l => { new String(l._1.datum.get("username").toString()) } ).first




SPARK, PARQUET,


KryoSerializer

KryoRegistrator


SPARK AVRO Packages

MISC.

Maven deps


    org.apache.spark
    spark-core_2.10
    1.6.1



    org.apache.spark
    spark-sql_2.10
    1.6.1



    com.databricks
    spark-csv_2.10
    1.4.0
Submitting apps to a cluster

libraryDependencies ++= Seq(
  "org.apache.spark" % "spark-core_2.10" % sparkVersion withSources(),
  "org.apache.spark" % "spark-sql_2.10" % sparkVersion  withSources()
)
add this to project/assembly.sbt
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
and this to build.sbt
assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

http://stackoverflow.com/questions/34093715/scala-code-not-compiling-in-sbt
You should add to your build.sbt the following dependency:
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "1.4.0"
And in your scala-file add the following import:
import org.apache.spark.{SparkConf, SparkContext}