Spark is an absolute necessity for file of this size. It is built to handle large files where if we only had base python and pandas, we'd not only run into memory issues, but the analysis would take a very long time. Typically, when observations exceed reach 1 million and up to 10 million, then pandas starts to really struggle. For this analysis, we are processing 233 million observations and 30 GBs of uncompressed data.
Each of the file sizes and observation count per month is visible below.
Given the gamut of changes and intricacies within spark and PySpark specifically, there was a lot of research and troubleshooting needed to get this analysis completed. One of the major changes to spark in recent years was the unification and accessibility. In order to use spark, we first need to initiate the 'SparkSession' which then we use to gain access into other API's, like SQL Context or Spark Context. In older version, these were disparate parts. These API's are visible below.
Before starting the analysis process we need to do some setup programming. This requires that we import the dependencies like pandas, numpy, seaborn, and spark. The color scheme and the initiation of the spark session is visible below.
The data that I imported for this analysis is in four separate files. I decided to also just import them into their separate dataframes so that I could review the information on a per month basis. Each import is visible in this section.
When looking at the header information for each data file, it is important to note that each row is representative of an event by a user. This is recorded with a timestamp and the event classification. The possibilities include view, cart, or purchase. For this analysis, I am mainly concerned with the purchase event types.
In this section I quickly review the total record count per each month. The output is order by month (oct-nov-dec-jan). The range is between 42 million to 67 million, with October with the least, and December with the most total events.