Hide
Hadoop on Google Cloud Platform
  • Apache Hadoop
    • The open source standard runs smoothly on Google Compute Engine instances.
    • Run Hadoop along with your favorite tools from the vibrant Hadoop community.
  • Quick startup times
  • Run at scale
    • Add more CPUs to your computations without worrying about breaking the bank. Per-minute billing lets you optimize for scale and speed.
Try it now
  1. Sign up

    1. If you don't already have one, sign up for a Google account.
    2. Create a Google Compute Engine and Google Cloud Storage enabled project via the Google Developers Console; make sure the Google Cloud Storage JSON API is enabled.
  2. Install the Cloud SDK

    System requirements:

    • Python 2.6.x or 2.7.x.
    • Cygwin [Windows only].

    gcloud is distributed as part of the Cloud SDK, which contains tools and libraries for managing resources on Google Cloud Platform.

    Installing on Linux or Mac OS X


    1. Download and install the Cloud SDK.

      You can download and install the Cloud SDK using the following command:

      $ curl https://sdk.cloud.google.com | bash

      Alternatively, if you do not want to use curl, you can download and unzip the package manually:

      1. Download google-cloud-sdk.zip
      2. Unzip the file:
        $ unzip google-cloud-sdk.zip
      3. Run the installation script:
        $ ./google-cloud-sdk/install.sh

      Follow the prompts to complete the setup. When prompted if you would like to update your system path, select y.

    2. Restart your terminal to allow changes to your PATH to take affect.

      You can also run source ~/.<bash-profile-file> if you want to avoid restarting your terminal.

    3. Authenticate to the Cloud Platform by running:
      $ gcloud auth login

    Installing on Windows with Cygwin


    1. Download and install Cygwin.

      Cygwin's website contains installation instructions. While installing Cygwin, be sure to select openssh, curl, and the latest 2.6.x or 2.7.x version of python from the package selection screen.

    2. Start Cygwin.

      By default, you can launch Cygwin by going to Start -> All Programs -> Cygwin -> Cygwin Terminal.

    3. Download the Cloud SDK and install it.

      You can download and install the Cloud SDK by issuing the following commands from Cygwin:

      $ curl https://sdk.cloud.google.com | bash

      Alternatively, if you don't want to use curl, you can always download and unzip the package manually:

      1. Download google-cloud-sdk.zip.
      2. Unzip the file by right-clicking on it and selecting Extract all.
      3. Run the installation script by clicking on the install.bat file.

      Follow the prompts to complete the setup. When prompted if you would like to update your system path, select y.

    4. Restart Cygwin (or cmd).
    5. Authenticate to the Cloud Platform by running:
      $ gcloud auth login
  3. Obtain the setup scripts

    Download the setup scripts in zip format or tar.gz format, unzip/untar the archive, and navigate to the newly-created bdutil-* directory.

    For example, on Linux or Mac OS X, the following commands will accomplish this.

    $ curl https://storage.googleapis.com/hadoop-tools/bdutil/bdutil-latest.tar.gz > $HOME/bdutil-latest.tar.gz
    $ tar xfz bdutil-latest.tar.gz -C $HOME
  4. Create a Google Cloud Storage bucket

    1. Choose a unique name per the bucket naming guidelines.
    2. Type gsutil mb -p <PROJECT> gs://<BUCKET> on the command line to create the bucket.

      To determine your project ID, do the following:

      1. Go to the Google Developers Console.
      2. Find and select your project in the table on the main landing page.
      3. The project ID appears at the top of the project overview page.
  5. Optional: Set gcloud compute configuration properties

    Several gcloud compute commands require a target --zone or --region flag. If you configure default properties or local environment variables, you can omit these flags when you run gcloud compute commands. See Setting default gcloud configuration properties for additional information.

    The gcloud compute ssh and gcloud compute copy-files commands allow you to connect to instances, and handle authentication and the mapping of instance names to IP addresses. However, if you wish ssh or scp directly into instances, you can use the gcloud compute config-ssh command to generate an SSH configuration file that contains host aliases for your instances with authentication configuration. See Using SSH-based programs directly for additional information.
  6. Run the setup script

    When ready to deploy the cluster, type ./bdutil deploy --bucket <BUCKET> on the command line. Replace <BUCKET> with the bucket name you specified in step 4. Deployment can take up to a few minutes. The script outputs "Deployment complete" on the command line once the cluster is set up.
  7. Validate the cluster

    Type the following commands on the command line to execute a validation script, included as part of the original setup scripts. The script runs a sample Hadoop job that uses the TeraSort algorithm.

    • ./bdutil --bucket <BUCKET> shell < hadoop-validate-setup.sh

    After a few minutes, the script outputs teragen, terasort, teravalidate passed.

  8. Shut down the cluster

    1. Type ./bdutil --bucket <BUCKET> delete on the command line.
    2. Optionally, you can type gsutil rm -R gs://<BUCKET> to clean up Google Cloud Storage resources used for this tutorial.
  9. What next?

Apache, Hadoop and the elephant logo are trademarks of the Apache Software Foundation.