Modes of handling corrupt data

Today, we have a substantial amount of data, and it's not necessary that all the records are free from corruption. PySpark provides us with three modes for handling corrupted data.

Let's delve into it!

  1. Permissive mode -

    In this approach, PySpark will assign null values to the corrupted records in while reading. This is suitable for scenarios where a few corrupted records will not hinder your ability to gain insights.

     spark.read.option("mode", "permissive").csv("testData.csv")
    
  2. Drop Malformed Mode -

    This mode is most suitable for situations where there is a stringent requirement for data quality and no tolerance for corruption. PySpark drops the rows containing malformed records during the reading process.

     spark.read.option("mode", "dropMalformed").json("testData.json")
    
  3. FailFast Mode -

    When we cannot afford any errors, this mode quickly identifies and rectifies the corrupted data from the beginning.

     spark.read.option("mode", "FAILFAST").parquet("testData.parquet")
    

By default, PySpark is configured in Permissive mode, but we have the flexibility to select the appropriate mode based on our specific requirements.

Happy handling !