Skip to content

Write Excel document using Spark 1.x

Jörn Franke edited this page Sep 30, 2018 · 4 revisions

This is a Spark 1.x application demonstrating some of the capabilities of the hadoopoffice library. It takes as input a set of CSV files. It outputs them as an Excel file (.xlsx). It has successfully been tested with the HDP Sandbox VM 2.5, but other Hadoop distributions should work equally well, if they support Spark. You will need at least Spark 1.5

Getting an example CSV

You can create yourself a CSV file using Excel, LibreOffice or a text editor. Alternatively, you can download a CSV file that is used for unit testing of hadoopoffice library by executing the following command:

wget --no-check-certificate https://raw.githubusercontent.com/ZuInnoTe/hadoopoffice/master/examples/scala-spark-exceloutput/src/it/resources/simplecsv.csv

You can put it on your HDFS cluster by executing the following commands:

hadoop fs -mkdir -p /user/spark/office/csv/input

hadoop fs -put ./simplecsv.csv /user/spark/office/csv/input

After it has been copied you are ready to use the example.

Building the example

Note the HadoopOffice library is available on Maven Central.

Execute

git clone /~https://github.com/ZuInnoTe/hadoopoffice.git hadoopoffice

You can build the application by changing to the directory hadoopoffice/examples/scala-spark-exceloutput and using the following command:

sbt +clean +it:test +assembly

Running the example

Make sure that the output directory is empty:

hadoop fs -rm -R /user/spark/office/excel/output

Execute the following command (please take care that you use spark-submit of Spark1)

spark-submit --class org.zuinnote.spark.office.example.excel.SparkScalaExcelOut ./example-ho-spark-scala-excelout.jar /user/spark/office/csv/input /user/spark/office/excel/output                                                                                                                                          

After the Spark1 job has been completed, you can download the Excel file to your local filesystem and view it in Excel or LibreOffice

hadoop fs -copyToLocal /user/spark/office/excel/output/part-r-00000.xlsx

Other features

Find here further configuration options of the HadoopOffice library, such as encryption, decryption, locale, meta data filter, linked workbooks and filtering by sheets.