PySpark Job Optimization Techniques - Part I

Apache Spark stands out as one of the most widely adopted cluster computing frameworks for efficiently processing large volumes of complex data. It empowers organizations to swiftly handle intricate data processing tasks.

In this discussion, we will explore various techniques for optimizing jobs to enhance the performance of PySpark applications -

  1. Repartition and Coalesce

    Storing the data into partitions helps in optimizing the queries and enables secure access to data. When we are dealing with bigger datasets we need to increase the partitions we use repartition() but to reduce the number of partitions we use coalesce().

  2. Predicate PushDown and Projection PushDown

    When working with a large amount of data and performing time-consuming processing, it's important to be aware of techniques that can reduce the time required. Predicate Pushdown is one such technique, where "predicate" refers to the "where clause."

    Imagine a scenario with both a join and a where condition. In this case, two tables with a substantial amount of data will first be joined, and then the filtering specified in the where clause will be applied to that extensive dataset. This process can consume a significant amount of time and adversely affect query performance.

    However, in Predicate Pushdown, the filtering is executed before the join operation. This approach reduces the volume of data processed in advance. Predicate Pushdown specifically focuses on optimizing the filtering process. It applies filters to data stores, so the entire database doesn't need to be scanned during query execution. This minimizes the amount of data that must be read from disk, transmitted across networks, or loaded into memory.

    While Predicate Pushdown is concerned with filtering rows, Projection Pushdown deals with optimizing the selection of columns.

  3. Persist and Cache

    Spark follows lazy evaluation, which means that transformations on the DataFrame are only applied when we attempt to perform an action on them, such as trying to view the DataFrame or store it. In some cases, this can be a disadvantage because to view the DataFrame, we have to go through the set of transformations. If the set of transformations is very complex, it can result in significant time overhead when accessed repeatedly.

    However, by using the "persist" and "cache" functions, we can store intermediate computations, allowing us to reuse them for subsequent actions. "Cache" stores the computation in memory, while "persist" provides the option to store it in both memory and on disk.

  4. Correct file format

    When we run an ETL pipeline, many Spark jobs are running. One job reads the file, while another writes it to an intermediate file for subsequent processing. In such cases, when selecting a file format for the intermediate data, it's important to choose a format that performs better. For example, Avro and Parquet file formats perform better than CSV, text, or JSON.

    For more information on the performance of different file formats check this link

    Big Data Formats: Parquet vs ORC vs AVRO

  5. Dataframes over RDD

    When working with PySpark jobs, consider using DataFrames over RDDs because DataFrames are faster due to Spark's Catalyst Optimizer. When you use DataFrames, Spark internally leverages RDDs to execute queries, but it does so by analyzing the query and creating the execution plan.

These are just a few ways for PySpark optimization. We will learn some more ways in the second part of this article.

Happy Optimizing !!