Thursday, April 20, 2017

SPARK RDD operations

How to print the contents of an RDD?

https://stackoverflow.com/questions/23173488/how-to-print-the-contents-of-rdd


down voteaccepted
If you want to view the content of a RDD, one way is to use collect():
myRDD.collect().foreach(println)
That's not a good idea, though, when the RDD has billions of lines. Use take() to take just a few to print out:
myRDD.take(n).foreach(println)

How to save the contents of an RDD to a single file?


If you want to save in a single file, you can coalesce you RDD into one partition before calling saveAsTextFile, but again this may cause issues. 
I think the best option is to write in multiple files in HDFS, then use hdfs dfs --getmerge in order to merge the files – Oussama Jul 21 '15 at 16:10