Saturday, December 31, 2016
Wednesday, December 28, 2016
Apache Spark gpg
browse to http://www.apache.org/dist/spark/spark-2.1.0/ or the version you want
download the tag and corresponding asc files
wget http://www.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz
wget http://www.apache.org/dist/spark/spark-2.1.0/spark-2.1.0-bin-hadoop2.7.tgz.asc
$ wget http://www.apache.org/dist/spark/KEYS
$ gpg --import KEYS
$ gpg --verify spark-2.1.0-bin-hadoop2.7.tgz.asc spark-2.1.0-bin-hadoop2.7.tgz
gpg: Signature made Thu Dec 15 18:18:33 2016 PST using RSA key ID FC8ED089
gpg: Good signature from "Patrick Wendell " [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg: There is no indication that the signature belongs to the owner.
Primary key fingerprint: EEDA BD1C 71C5 48D6 F006 61D3 7C6C 105F FC8E D089
$ gpg --keyserver pgpkeys.mit.edu --recv-key FC8ED089
gpg: requesting key FC8ED089 from hkp server pgpkeys.mit.edu
gpg: key FC8ED089: "Patrick Wendell " not changed
gpg: Total number processed: 1
gpg: unchanged: 1
$ gpg --verify spark-2.1.0-bin-hadoop2.7.tgz.asc spark-2.1.0-bin-hadoop2.7.tgz
Monday, December 19, 2016
Adding an nltk stop word filter
$ python
Install NLTK Data
This will take a few minutes if your Internet download speed is around 15mbs.
http://www.nltk.org/data.html
$ python
> from nltk.corpus import stopwords > nltk_stopwords = stopwords.words('english')
except LookupError: raise e LookupError: ********************************************************************** Resource u'corpora/stopwords' not found. Please use the NLTK Downloader to obtain the resource: >>> nltk.download() Searched in: - '/Users/depappas/nltk_data' - '/usr/share/nltk_data' - '/usr/local/share/nltk_data' - '/usr/lib/nltk_data' - '/usr/local/lib/nltk_data' ********************************************************************** Process finished with exit code 1
Install NLTK Data
This will take a few minutes if your Internet download speed is around 15mbs.
http://www.nltk.org/data.html
$ python
> import nltk
> nltk.download()
showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
Click on the Corpora tab and select the stopwords package you want. Wait for it to download.
Click on the Corpora tab and select the stopwords package you want. Wait for it to download.
from nltk.corpus import
Now this should work...
$ pythonPython 2.7.12 (default, Oct 11 2016, 05:20:59)[GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.38)] on darwinType "help", "copyright", "credits" or "license" for more information.>>> import nltk>>> nltk.download()showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xmlTrue>>> from nltk.corpus import stopwords>>> nltk_stopwords = stopwords.words('english')>>>
Done!
Sunday, December 18, 2016
New Mac setup for deep learning, Golang, Scala, and SPARK
http://brew.sh
/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
Brew Install
brew install tree
brew install htop
brew install python
brew install wget
Apple command line tools (gcc)
https://developer.apple.com/download/more/
After installation is completed, run “gcc -v” in terminal again. If everything fine, following output will be displayed.
Dmg Downloads
brew install gnupg
gpg --keyserver pgpkeys.mit.edu --recv-key 83135D45
|
gpg --verify KeePassX-2.0-beta2.dmg.sig KeePassX-2.0-beta2.dmg
|
Verify downloads
All downloads and Git tags are signed with the key 164C70512F7929476764AB56FE22C6FD83135D45
Python Virtual Setup
http://www.marinamele.com/2014/05/install-python-virtualenv-virtualenvwrapper-mavericks.html
Languages
Python
http://www.marinamele.com/2014/05/install-python-virtualenv-virtualenvwrapper-mavericks.html
brew install python
pip install virtualenv
pip install virtualenvwrapper
# on OS X if you can’t install virtualenv with pip then use this workaround
pip install --index-url=http://pypi.python.org/simple/ --trusted-host pypi.python.org virtualenv
pip install --index-url=http://pypi.python.org/simple/ --trusted-host pypi.python.org virtualenvwrapper
# for the Homebrew installed path
~/.bash_profile : export PATH=/usr/local/share/python:$PATH
# Python virtualenv workon setup
export WORKON_HOME=~/.virtualenvs
source /usr/local/bin/virtualenvwrapper.sh
#Setup the certificates
http://programmingmatrix.blogspot.com/2016/10/pip-install-fails-with-connection-error.html
$ source ~/.bash_profile
$ cd
$ mkdir .virtualenvs
$ cd .virtualenvs
$ virtualenv test
$ workon test
# now you are using the pip in the .virtualenvs/test tree
# if pip install does not work upgrade pip
easy_install --upgrade pip
Deep Learning
http://programmingmatrix.blogspot.com/2016/12/osx-how-to-install-tensorflow-upgrade.html
https://www.tensorflow.org/get_started/os_setup
wget https://storage.googleapis.com/tensorflow/mac/cpu/tensorflow-0.12.0rc1-py2-none-any.whl
pip install --upgrade tensorflow-0.12.0rc1-py2-none-any.whl
pip install keras
pip install numpy
pip install scipy
pip install matplot
pip install gensim
pip install ioutils
pip install Cython
Java
http://www.oracle.com/technetwork/java/javase/downloads/jdk8-
downloads-2133151.html
https://docs.oracle.com/javase/8/docs/technotes/guides/install/mac_jdk.html
http://stackoverflow.com/questions/1348842/what-should-i-set-java-home-to-on-osx
~/.bash_profile : export JAVA_HOME="`/usr/libexec/java_home -v '1.8*'`"
Scala
http://scala-ide.org
https://www.scala-lang.org/download/
~/.bash_profile:
export SCALA_HOME=$HOME/scala-2.12.1
export SCALA_BIN=$SCALA_HOME/bin
export PATH=$PATH:$SCALA_BIN
Spark
http://spark.apache.org/downloads.html
http://programmingmatrix.blogspot.com/2016/10/set-up-apache-spark-keys-and-verifying.html
cd
mv Downloads/spark-2.0.2-bin-hadoop2.7.tar .
tar xvf spark-2.0.2-bin-hadoop2.7.tar
~/.bash_profile:
export SPARK_HOME=$HOME/spark
export SPARK_BIN=$SPARK_HOME/bin
export PATH=$PATH:$SPARK_BIN
ln -s spark-2.0.2-bin-hadoop2.7 spark
Golang
https://golang.org/dl/
Atom Editor
https://atom.io
http://programmingmatrix.blogspot.com/2016/03/atom-text-editor.html
Saturday, December 17, 2016
Thursday, December 15, 2016
OSX: how to install Tensorflow: upgrade pip and use virtualenv
After you have setup and activated you virtualenv if you run into this error on your Mac while trying to install tensorflow, upgrade pip and retry installing tensorflow.
Get the latest .whl file version from the virtualenv section of the following page:
https://www.tensorflow.org/get_started/os_setup#virtualenv_installation
Now upgrade pip and reinstall tensorflow.
Get the latest .whl file version from the virtualenv section of the following page:
https://www.tensorflow.org/get_started/os_setup#virtualenv_installation
pip install --upgrade tensorflow-0.12.0rc0-py2-none-any.whl
Unpacking ./tensorflow-0.12.0rc0-py2-none-any.whl
Downloading/unpacking protobuf==3.1.0 (from tensorflow==0.12.0rc0)
Could not find a version that satisfies the requirement protobuf==3.1.0 (from tensorflow==0.12.0rc0) (from versions: 3.0.0b4, 3.0.0, 3.0.0b2.post2, 3.0.0a2, 3.0.0b2, 2.6.1, 2.0.3, 2.0.0beta, 2.5.0, 2.4.1, 2.6.0, 3.0.0b2.post1, 3.0.0b3, 3.0.0b1.post2, 3.0.0b2.post2, 3.0.0b2.post1, 2.3.0, 3.0.0a3, 3.1.0.post1)
Cleaning up...
No distributions matching the version for protobuf==3.1.0 (from tensorflow==0.12.0rc0)
Now upgrade pip and reinstall tensorflow.
easy_install --upgrade pip
pip install --upgrade tensorflow
Monday, December 12, 2016
cuDNN not available: how to fix on Linux
Download CudNN from NVidia
$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 95.0% of memory, cuDNN not available)
Download the CudNN for your OS
https://developer.nvidia.com/cudnn
http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html
sudo mkdir /usr/local/cudnn
sudo mkdir /usr/local/cudnn
cp cuda/* /usr/local/cudnn
add the following to your ~/.bashrc
export CUDNN_ROOT=/usr/local/cudnn
export LD_LIBRARY_PATH=$CUDNN_ROOT/lib64:$LD_LIBRARY_PATH
export CPATH=$CUDNN_ROOT/include:$CPATH
export LIBRARY_PATH=$CUDNN_ROOT/lib64:$LD_LIBRARY_PATH
Fixed:
$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5105)
>>>
$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 95.0% of memory, cuDNN not available)
Download the CudNN for your OS
https://developer.nvidia.com/cudnn
http://deeplearning.net/software/theano/library/sandbox/cuda/dnn.html
- Alternatively, on Linux, you can set the environment variablesexample:
LD_LIBRARY_PATH
,LIBRARY_PATH
andCPATH
to the directory extracted from the download. If needed, separate multiple directories with:
as in thePATH
environment variable.
export LD_LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH export CPATH=/home/user/path_to_CUDNN_folder/include:$CPATH export LIBRARY_PATH=/home/user/path_to_CUDNN_folder/lib64:$LD_LIBRARY_PATH
sudo mkdir /usr/local/cudnn
sudo mkdir /usr/local/cudnn
cp cuda/* /usr/local/cudnn
add the following to your ~/.bashrc
export CUDNN_ROOT=/usr/local/cudnn
export LD_LIBRARY_PATH=$CUDNN_ROOT/lib64:$LD_LIBRARY_PATH
export CPATH=$CUDNN_ROOT/include:$CPATH
export LIBRARY_PATH=$CUDNN_ROOT/lib64:$LD_LIBRARY_PATH
Fixed:
$ python
Python 2.7.12 (default, Jul 1 2016, 15:12:24)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import theano
Using gpu device 0: GeForce GTX 1060 6GB (CNMeM is enabled with initial size: 95.0% of memory, cuDNN 5105)
>>>
Tuesday, November 15, 2016
Parallelizing HTMLPage Downloaders
Threading a Python HTML down loader does not result in a significant performance improvement.
Use Python asyncio or Twister or Tulip instead.
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
http://krondo.com/an-introduction-to-asynchronous-programming-and-twisted/
Use Python asyncio or Twister or Tulip instead.
https://pawelmhm.github.io/asyncio/python/aiohttp/2016/04/22/asyncio-aiohttp.html
http://krondo.com/an-introduction-to-asynchronous-programming-and-twisted/
Monday, November 14, 2016
Stochastic Gradient Descent: Momentum and Nesterov Momentum
http://www.christianherta.de/lehre/dataScience/machineLearning/neuralNetworks/nesterov-momentum.php
https://blogs.princeton.edu/imabandit/2013/02/05/orf523-advanced-optimization-introduction/
https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/
https://arxiv.org/pdf/1405.4980v2.pdf
https://en.wikipedia.org/wiki/Convex_hull
http://mathworld.wolfram.com/ConvexHull.html
https://en.wikipedia.org/wiki/Convex_combination
https://en.wikipedia.org/wiki/Tensor
https://blogs.princeton.edu/imabandit/2013/02/05/orf523-advanced-optimization-introduction/
https://blogs.princeton.edu/imabandit/2015/06/30/revisiting-nesterovs-acceleration/
https://arxiv.org/pdf/1405.4980v2.pdf
https://en.wikipedia.org/wiki/Convex_hull
http://mathworld.wolfram.com/ConvexHull.html
https://en.wikipedia.org/wiki/Convex_combination
https://en.wikipedia.org/wiki/Tensor
Sunday, November 13, 2016
Installing CUDA on Ubuntu 16.04
Before you start the installation process make sure you have ssh working so that you can ssh to the machine from another machine if the graphics UI gets stuck in a loop or the black screen appears.
===============================================================
Before reinstalling CUDA remove the old versions
cd ~
Use the following command to uninstall a Toolkit runfile installation:
http://kislayabhi.github.io/Installing_CUDA_with_Ubuntu/
alternatively:
https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07#check-the-installation
REMEMBER: DO NOT FOLLOW THE NVIDIA INSTALLATION INSTRUCTIONS
Note that as of Nov. 2016 that the latest version of CUDA is 8.0...
Ignore the error when the Nvidia installing complains and shows the abort message.
Accept the sym link creation option
Set the LD_LIBRARY_LIBS in .bashrc to use the sym link to /usr/local/cuda and not the version
There may be some missing lib errors during the compilation of the /usr/local/cuda/samples directory
Add the missing libs.
===============================================================
Before reinstalling CUDA remove the old versions
cd ~
rm -fr cuda
If you used apt-get/rpm to install a previous version find installation instructions elsewhere.
$ sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.plUse the following command to uninstall a Driver runfile installation:
$ sudo /usr/bin/nvidia-uninstall
sudo apt-get purge nvidia*
# Note this might remove your cuda installation as well
sudo apt-get autoremove
=======================================================================
# under construction
# download cuda and cudaNN
===============================================================
Read more at: http://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#ixzz4ZNTXuOhQ
Install cudaNN
https://developer.nvidia.com/rdp/cudnn-download
Download cuDNN v5.1 Library for Linux
Follow the installation instructions:
# from https://groups.google.com/forum/#!topic/theano-users/4qKbh5C_9e4
sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl
sudo /usr/bin/nvidia-uninstall
cd ~
rm -fr cuda
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo service lightdm stop
sudo apt-get install linux-headers-$(uname -r)
cd Downloads
./cuda_8.0.61_375.26_linux.run -extract=~/Downloads/nvidia_installers;
cd nvidia_installers
sudo ./NVIDIA-Linux-x86_64-367.48.run --no-opengl-files
sudo ./cuda-linux64-rel-8.0.44-21122537.run
sudo ./cuda-samples-linux-8.0.44-21122537.run
sudo update-initramfs -u
sudo reboot
sudo service lightdm start
========================================================================
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
# you should see something like this
# Using Python virtual environement setup for deep learning
workon deep_learning
cd ~/Dropbox/cuda
THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32,optimizer_including=cudnn' python gpu_test.py
=======================================================================
# under construction
# download cuda and cudaNN
===============================================================
Install cudaNN
https://developer.nvidia.com/rdp/cudnn-download
Download cuDNN v5.1 Library for Linux
Follow the installation instructions:
# from https://groups.google.com/forum/#!topic/theano-users/4qKbh5C_9e4
# First download the file. Example version: cudnn-7.5-linux-x64-v5.0-rc. tgz
Extract it to home directory
and set the LD_LIBRARY_PATH to the above extracted directory
and then follow the below steps assuming that /usr/local/cuda is soft linked to the correct cuda version:
sudo cp $HOME/cuda/ include/cudnn.h /usr/local/cuda/include/
sudo cp $HOME/cuda/ lib64/libcudnn* /usr/local/cuda/lib64/
======================================================================================
sudo /usr/local/cuda-X.Y/bin/uninstall_cuda_X.Y.pl
sudo /usr/bin/nvidia-uninstall
cd ~
rm -fr cuda
sudo apt-get --purge remove nvidia-*
sudo apt-get purge nvidia*
sudo apt-get autoremove
sudo service lightdm stop
sudo apt-get install linux-headers-$(uname -r)
cd Downloads
./cuda_8.0.61_375.26_linux.run -extract=~/Downloads/nvidia_installers;
cd nvidia_installers
sudo ./NVIDIA-Linux-x86_64-367.48.run --no-opengl-files
sudo ./cuda-linux64-rel-8.0.44-21122537.run
sudo ./cuda-samples-linux-8.0.44-21122537.run
sudo update-initramfs -u
sudo reboot
sudo service lightdm start
========================================================================
cd /usr/local/cuda/samples/1_Utilities/deviceQuery
sudo make
./deviceQuery
# you should see something like this
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 6GB"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6072 MBytes (6367150080 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Result = PASS
=======================================================================
# Using Python virtual environement setup for deep learning
workon deep_learning
cd ~/Dropbox/cuda
THEANO_FLAGS='mode=FAST_RUN,device=gpu,floatX=float32,optimizer_including=cudnn' python gpu_test.py
=======================================================================
http://kislayabhi.github.io/Installing_CUDA_with_Ubuntu/
alternatively:
https://gist.github.com/wangruohui/df039f0dc434d6486f5d4d098aa52d07#check-the-installation
REMEMBER: DO NOT FOLLOW THE NVIDIA INSTALLATION INSTRUCTIONS
Note that as of Nov. 2016 that the latest version of CUDA is 8.0...
Ignore the error when the Nvidia installing complains and shows the abort message.
Accept the sym link creation option
Set the LD_LIBRARY_LIBS in .bashrc to use the sym link to /usr/local/cuda and not the version
There may be some missing lib errors during the compilation of the /usr/local/cuda/samples directory
Add the missing libs.
sudo apt-get install freeglut3-dev
/usr/local/cuda/samples$ sudo make
[...]
Finished building CUDA samples
/usr/local/cuda/samples$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 367.48 Sat Sep 3 18:21:08 PDT 2016
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4)
/usr/local/cuda/samples$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2016 NVIDIA Corporation
Built on Sun_Sep__4_22:14:01_CDT_2016
Cuda compilation tools, release 8.0, V8.0.44
/usr/local/cuda/samples$ cd 1_Utilities/deviceQuery
usr/local/cuda/samples/1_Utilities/deviceQuery$ ./deviceQuery
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 1060 6GB"
CUDA Driver Version / Runtime Version 8.0 / 8.0
CUDA Capability Major/Minor version number: 6.1
Total amount of global memory: 6069 MBytes (6363873280 bytes)
(10) Multiprocessors, (128) CUDA Cores/MP: 1280 CUDA Cores
GPU Max Clock rate: 1785 MHz (1.78 GHz)
Memory Clock rate: 4004 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 1572864 bytes
Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers
Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
Device supports Unified Addressing (UVA): Yes
Device PCI Domain ID / Bus ID / location ID: 0 / 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce GTX 1060 6GB
Monday, November 7, 2016
Wednesday, November 2, 2016
Monday, October 31, 2016
Sunday, October 30, 2016
Saturday, October 29, 2016
Spark Dataframe encoders
http://stackoverflow.com/questions/36648128/how-to-store-custom-objects-in-a-dataset
http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row
http://stackoverflow.com/questions/39433419/encoder-error-while-trying-to-map-dataframe-row-to-updated-row
Independent of version you can simply use
DataFrame
API:import org.apache.spark.sql.functions.{when, lower}
val df = Seq(
(2012, "Tesla", "S"), (1997, "Ford", "E350"),
(2015, "Chevy", "Volt")
).toDF("year", "make", "model")
df.withColumn("make", when(lower($"make") === "tesla", "S").otherwise($"make"))
If you really want to use
map
you should use statically typed Dataset
:import spark.implicits._
case class Record(year: Int, make: String, model: String)
df.as[Record].map {
case tesla if tesla.make.toLowerCase == "tesla" => tesla.copy(make = "S")
case rec => rec
}
or at least return an object which will have implicit encoder:
df.map {
case Row(year: Int, make: String, model: String) =>
(year, if(make.toLowerCase == "tesla") "S" else make, model)
}
Finally if for some completely crazy reason you really want to map over
Dataset[Row]
you have to provide required encoder:import org.apache.spark.sql.catalyst.encoders.RowEncoder
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
// Yup, it would be possible to reuse df.schema here
val schema = StructType(Seq(
StructField("year", IntegerType),
StructField("make", StringType),
StructField("model", StringType)
))
val encoder = RowEncoder(schema)
df.map {
case Row(year, make: String, model) if make.toLowerCase == "tesla" =>
Row(year, "S", model)
case row => row
} (encoder)
Subscribe to:
Posts (Atom)