Apache Spark: build a robust pipeline for generating CSV files

July 29, 2025
Sorigo

In recent years, Apache Spark has emerged as one of the leading distributed data processing and analytical engines, seamlessly operating across both cloud and on-premise environments. Its highly scalable architecture and in-memory computing capabilities have made it a cornerstone for handling large-scale datasets in real time, enabling businesses to derive insights at unprecedented speeds.

While Spark is widely celebrated for tackling complex tasks such as large-scale machine learning, graph processing, and real-time data streaming, it also excels at more routine, yet essential, data operations. From ETL pipelines and data cleansing to reporting and ad hoc analysis, Spark’s versatility ensures that it can handle tasks across the spectrum of data complexity.

Below, we’ll explore an example of how Apache Spark can be used to build a robust pipeline for generating CSV files containing pricing and product information for retail businesses.

This process assumes that the final output will require data integration from two different sources. Spark supports reading data from a wide array of formats, including CSV, JSON, Parquet, and text files, as well as accessing structured and semi-structured data stored in relational databases, Hive tables, and NoSQL systems like MongoDB or Cassandra. Spark’s DataFrame API simplifies data ingestion by providing a unified interface for handling these diverse data sources, enabling seamless extraction, transformation, and loading (ETL) workflows.

In our example we’ll take on the role of a warehouse creating product and pricing files for a number of retailers. Let’s assume that the input data will require a “retailers” table from a postgres relational table and 2 hive tables – “product_pricing” and “product_information” stored on a hadoop cluster.

When it comes to hive tables, Spark is able to seamlessly utilize any data, provided Hive support is configured in the SparkSession and the Hive metastore is configured correctly. We can then create a dataframe from the selected table contents:

Accessing data from SQL databases such as postgres is also made easy, by using the jdbc method. Provided the PostgreSQL JDBC driver is available we can configure the connection properties:

And then create a dataframe with the table contents:

These straightforward steps enabled us to seamlessly access all the necessary input data required for generating the desired files.

With the data now ingested and prepared, we can leverage the full capabilities of Apache Spark, particularly Spark SQL, to construct the final dataset. By utilizing Spark’s distributed and parallel processing architecture, we can efficiently perform complex transformations, aggregations, and joins on large-scale datasets.

In our example we will first perform a simple join between productPricingDF and productInformationDF to create a dataframe that will have all product specific information required for our file.

We’ll start by aliasing the joined tables to avoid ambiguity:

Which will allow us to perform the join:

Next we’ll need to perform a cross join between this dataframe and our retailers in order to create a dataframe that will have both retailer_ids and all product_ids per retailer:

This brings us to the final stage of the pipeline, where we can leverage Apache Spark’s powerful partitioning capabilities to efficiently generate a separate CSV file for each retailer. By using the partitionBy method during the write operation, Spark automatically organizes the output data into distinct directories based on the specified partition column, in this case, the “retailer_id”.

This approach not only simplifies downstream processing but also improves query performance when reading the data later, as each retailer’s data is stored in its own partition. Additionally, Spark’s distributed architecture ensures that the partitioning process is both scalable and efficient, even when working with large datasets, by distributing the workload across the cluster nodes. The resulting files are well-structured, making them ideal for direct use in reporting, analytics, or further integration into other systems.