-
Apache Hadoop
- The open source standard runs smoothly on Google Compute Engine instances.
- Run Hadoop along with your favorite tools from the vibrant Hadoop community.
-
Quick startup times
- Compute Engine instances start in seconds.
- Eliminate file system startup time when using the Google Cloud Storage connector for Hadoop.
-
Run at scale
- Add more CPUs to your computations without worrying about breaking the bank. Per-minute billing lets you optimize for scale and speed.
-
Sign up
- If you don't already have one, sign up for a Google account.
- Create a Google Compute Engine and Google Cloud Storage enabled project via the Google Developers Console; make sure the Google Cloud Storage JSON API is enabled.
-
Install the Cloud SDK
System requirements:
- Python 2.6.x or 2.7.x.
- Cygwin [Windows only].
gcloudis distributed as part of the Cloud SDK, which contains tools and libraries for managing resources on Google Cloud Platform.Installing on Linux or Mac OS X
-
Download and install the Cloud SDK.
You can download and install the Cloud SDK using the following command:
$ curl https://sdk.cloud.google.com | bash
Alternatively, if you do not want to use
curl, you can download and unzip the package manually:- Download google-cloud-sdk.zip
-
Unzip the file:
$ unzip google-cloud-sdk.zip
-
Run the installation script:
$ ./google-cloud-sdk/install.sh
Follow the prompts to complete the setup. When prompted if you would like to update your system path, select
y. -
Restart your terminal to allow changes to your
PATHto take affect.You can also run
source ~/.<bash-profile-file>if you want to avoid restarting your terminal. -
Authenticate to the Cloud Platform by
running:
$ gcloud auth login
Installing on Windows with Cygwin
- Download and install Cygwin.
Cygwin's website contains installation instructions. While installing Cygwin, be sure to select
openssh,curl, and the latest 2.6.x or 2.7.x version ofpythonfrom the package selection screen. -
Start Cygwin.
By default, you can launch Cygwin by going to
Start -> All Programs -> Cygwin -> Cygwin Terminal. -
Download the Cloud SDK and install it.
You can download and install the Cloud SDK by issuing the following commands from Cygwin:
$ curl https://sdk.cloud.google.com | bash
Alternatively, if you don't want to use
curl, you can always download and unzip the package manually:- Download google-cloud-sdk.zip.
- Unzip the file by right-clicking on it and selecting Extract all.
- Run the installation script by clicking on the
install.batfile.
Follow the prompts to complete the setup. When prompted if you would like to update your system path, select
y. - Restart Cygwin (or
cmd). -
Authenticate to the Cloud Platform by
running:
$ gcloud auth login
-
Obtain the setup scripts
Download the setup scripts in zip format or tar.gz format, unzip/untar the archive, and navigate to the newly-created bdutil-* directory.
For example, on Linux or Mac OS X, the following commands will accomplish this.
$ curl https://storage.googleapis.com/hadoop-tools/bdutil/bdutil-latest.tar.gz > $HOME/bdutil-latest.tar.gz $ tar xfz bdutil-latest.tar.gz -C $HOME
-
Create a Google Cloud Storage bucket
- Choose a unique name per the bucket naming guidelines.
- Type
gsutil mb -p <PROJECT> gs://<BUCKET>on the command line to create the bucket.To determine your project ID, do the following:
- Go to the Google Developers Console.
- Find and select your project in the table on the main landing page.
- The project ID appears at the top of the project overview page.
-
Optional: Set gcloud compute configuration properties
Several
Thegcloud computecommands require a target--zoneor--regionflag. If you configure default properties or local environment variables, you can omit these flags when you rungcloud computecommands. See Setting default gcloud configuration properties for additional information.gcloud compute sshandgcloud compute copy-filescommands allow you to connect to instances, and handle authentication and the mapping of instance names to IP addresses. However, if you wishsshorscpdirectly into instances, you can use thecommand to generate an SSH configuration file that contains host aliases for your instances with authentication configuration. See Using SSH-based programs directly for additional information.gcloud compute config-ssh -
Run the setup script
When ready to deploy the cluster, type./bdutil deploy --bucket <BUCKET>on the command line. Replace<BUCKET>with the bucket name you specified in step 4. Deployment can take up to a few minutes. The script outputs "Deployment complete" on the command line once the cluster is set up. -
Validate the cluster
Type the following commands on the command line to execute a validation script, included as part of the original setup scripts. The script runs a sample Hadoop job that uses the TeraSort algorithm.
./bdutil --bucket <BUCKET> shell < hadoop-validate-setup.sh
After a few minutes, the script outputs
teragen, terasort, teravalidate passed. -
Shut down the cluster
- Type
./bdutil --bucket <BUCKET> deleteon the command line. - Optionally, you can type
gsutil rm -R gs://<BUCKET>to clean up Google Cloud Storage resources used for this tutorial.
- Type
-
What next?
- Review the command-line setup page for customizing your cluster.
- Sign up for announcements on the latest updates, features, and products related to Hadoop on Google Cloud Platform.
- Contact us at gcp-hadoop-contact@google.com, all questions and comments are welcome!
