Skip to content

Latest commit

 

History

History
26 lines (20 loc) · 1.78 KB

File metadata and controls

26 lines (20 loc) · 1.78 KB

Spark Log Parser

The Parser for Apache Spark parses unmodified Spark history server event logs extracting information to a compact format that can more readily be applied to generating Sync predictions. See the user guides for information on where to find event logs. Related tools with their documentation may also be helpful: client_tools.

Parsed logs contain metadata pertaining to your Apache Spark application execution. Particularly, the run time for a task, the amount of data read & written, the amount of memory used, etc. These logs do not contain sensitive information such as the data that your Apache Spark application is processing. Below is an example of the output of the log parser. Output of Log Parser

Installation

Install the package in this repo to your Python 3 environment, e.g.

pip3 install https://github.com/synccomputingcode/spark_log_parser/archive/main.tar.gz

Parsing your Spark logs

Step 0: Generate the appropriate Apache Spark History Server Event log

If you have not already done so, complete the instructions to download the Apache Spark event log.

Step 1: Parse the log to strip away sensitive information

  1. To process a log file execute the spark-log-parser command with a log file path and a directory in which to store the result like so:

    spark-log-parser -l <log file location> -r <result directory>

    The parsed file parsed-<log file name> will appear in the result directory.

  2. Send Sync Computing the parsed log

    Email Sync Computing (or upload to the Sync Auto-tuner) the parsed event log.