Hadoop Workflow Example


Part of the HPC Workloads series

Hadoop is a scalable, distributed computing solution provided by Apache. Similar to queuing systems, Hadoop allows for distributed processing of large data sets.

This example sets up hadoop on a single node and processes an example data set.

Installing & Running Hadoop

This example uses the flight environment which will need to be activated before the environments can be created so be sure to run flight start or setup your environment to automatically activate the flight environment.

If the above has been set in the ~/.bashrc then a new session for the user will need to be started for the environment changes to take effect, otherwise the below commands will not be located

Downloading the Hadoop Job

These steps help setup the Hadoop environment and download a spreadsheet of data which will Hadoop will sort into sales units per region.

Preparing the Hadoop Job

Loading Data into Hadoop

Running the Hadoop Job