Hadoop on Google Cloud Platform — Google Cloud Platform

Hadoop on Google Cloud Platform

Apache Hadoop
- The open source standard runs smoothly on Google Compute Engine instances.
- Run Hadoop along with your favorite tools from the vibrant Hadoop community.
Quick startup times
- Compute Engine instances start in seconds.
- Eliminate file system startup time when using the Google Cloud Storage connector for Hadoop.
Run at scale
- Add more CPUs to your computations without worrying about breaking the bank. Per-minute billing lets you optimize for scale and speed.

Try it now

Sign up
1. If you don't already have one, sign up for a Google account.
2. Create a Google Compute Engine and Google Cloud Storage enabled project via the Google Developers Console; make sure the Google Cloud Storage JSON API is enabled.
Install the Cloud SDK

System requirements:
- Python 2.6.x or 2.7.x.
- Cygwin [Windows only].
gcloud is distributed as part of the Cloud SDK, which contains tools and libraries for managing resources on Google Cloud Platform.
Installing on Linux or Mac OS X
1. Download and install the Cloud SDK.
  You can download and install the Cloud SDK using the following command:
```
$ curl https://sdk.cloud.google.com | bash
```
  Alternatively, if you do not want to use curl, you can download and unzip the package manually:
  1. Download google-cloud-sdk.zip
  2. Unzip the file:
    $ unzip google-cloud-sdk.zip
  3. Run the installation script:
    $ ./google-cloud-sdk/install.sh
  Follow the prompts to complete the setup. When prompted if you would like to update your system path, select y.
2. Restart your terminal to allow changes to your PATH to take affect.
  You can also run source ~/.<bash-profile-file> if you want to avoid restarting your terminal.
3. Authenticate to the Cloud Platform by running:
```
$ gcloud auth login
```
Installing on Windows with Cygwin
1. Download and install Cygwin.
  Cygwin's website contains installation instructions. While installing Cygwin, be sure to select openssh, curl, and the latest 2.6.x or 2.7.x version of python from the package selection screen.
  
  Caution: Due to a bug in python on 64-bit cygwin, please install the 32-bit version.
2. Start Cygwin.
  By default, you can launch Cygwin by going to Start -> All Programs -> Cygwin -> Cygwin Terminal.
3. Download the Cloud SDK and install it.
  You can download and install the Cloud SDK by issuing the following commands from Cygwin:
```
$ curl https://sdk.cloud.google.com | bash
```
  Alternatively, if you don't want to use curl, you can always download and unzip the package manually:
  1. Download google-cloud-sdk.zip.
  2. Unzip the file by right-clicking on it and selecting Extract all.
  3. Run the installation script by clicking on the install.bat file.
  Follow the prompts to complete the setup. When prompted if you would like to update your system path, select y.
4. Restart Cygwin (or cmd).
5. Authenticate to the Cloud Platform by running:
```
$ gcloud auth login
```
  Note: If you run into issues with the above command, you might need to reset your PATH. Run the following command, providing the full path to your google-cloud-sdk folder.
  $ setx PATH "%PATH%;C:\full\path\to\google-cloud-sdk\bin"
Obtain the setup scripts

Download the setup scripts in zip format or tar.gz format, unzip/untar the archive, and navigate to the newly-created bdutil-* directory.

For example, on Linux or Mac OS X, the following commands will accomplish this.
```
$ curl https://storage.googleapis.com/hadoop-tools/bdutil/bdutil-latest.tar.gz > $HOME/bdutil-latest.tar.gz
$ tar xfz bdutil-latest.tar.gz -C $HOME
```
Create a Google Cloud Storage bucket
1. Choose a unique name per the bucket naming guidelines.
2. Type gsutil mb -p <PROJECT> gs://<BUCKET> on the command line to create the bucket.
  To determine your project ID, do the following:
  1. Go to the Google Developers Console.
  2. Find and select your project in the table on the main landing page.
  3. The project ID appears at the top of the project overview page.
Optional: Set gcloud compute configuration properties

Several gcloud compute commands require a target --zone or --region flag. If you configure default properties or local environment variables, you can omit these flags when you run gcloud compute commands. See Setting default gcloud configuration properties for additional information.
The gcloud compute ssh and gcloud compute copy-files commands allow you to connect to instances, and handle authentication and the mapping of instance names to IP addresses. However, if you wish ssh or scp directly into instances, you can use the gcloud compute config-ssh command to generate an SSH configuration file that contains host aliases for your instances with authentication configuration. See Using SSH-based programs directly for additional information.
Run the setup script
When ready to deploy the cluster, type ./bdutil deploy --bucket <BUCKET> on the command line. Replace <BUCKET> with the bucket name you specified in step 4. Deployment can take up to a few minutes. The script outputs "Deployment complete" on the command line once the cluster is set up.
Validate the cluster

Type the following commands on the command line to execute a validation script, included as part of the original setup scripts. The script runs a sample Hadoop job that uses the TeraSort algorithm.
- ./bdutil --bucket <BUCKET> shell < hadoop-validate-setup.sh
After a few minutes, the script outputs teragen, terasort, teravalidate passed.
Shut down the cluster
1. Type ./bdutil --bucket <BUCKET> delete on the command line.
2. Optionally, you can type gsutil rm -R gs://<BUCKET> to clean up Google Cloud Storage resources used for this tutorial.
What next?
- Review the command-line setup page for customizing your cluster.
- Sign up for announcements on the latest updates, features, and products related to Hadoop on Google Cloud Platform.
- Contact us at gcp-hadoop-contact@google.com, all questions and comments are welcome!

Apache, Hadoop and the elephant logo are trademarks of the Apache Software Foundation.

Except as otherwise noted, the content of this page is licensed under the Creative Commons Attribution 3.0 License, and code samples are licensed under the Apache 2.0 License. For details, see our Site Policies.

Sign up

Install the Cloud SDK

Installing on Linux or Mac OS X

Installing on Windows with Cygwin

Obtain the setup scripts

Create a Google Cloud Storage bucket

Optional: Set gcloud compute configuration properties

Run the setup script

Validate the cluster

Shut down the cluster

What next?