Top / Google Cloud Platform / Google Dataproc / Cluster

Google Dataproc Cluster

This page shows how to write Terraform for Dataproc Cluster and write them securely.

Review your .tf file for Google best practices

Shisho Cloud, our free checker to make sure your Terraform configuration follows best practices, is available (beta).

google_dataproc_cluster (Terraform)

The Cluster in Dataproc can be configured in Terraform with the resource name google_dataproc_cluster. The following sections describe 4 examples of how to use the resource and its parameters.

Example Usage from GitHub

anaik91/tfe

dataproc.tf#L1

resource "google_dataproc_cluster" "mycluster_kkms" {
  provider                      = google-beta
  name                          = "mycluster"
  region                        = "us-central1"
  graceful_decommission_timeout = "120s"

Find out how to use this setting securely with Shisho Cloud

GoogleCloudPlatform/gcpdiag

dataproc_cluster.tf#L1

resource "google_dataproc_cluster" "good" {
  depends_on = [google_project_service.dataproc]
  project    = google_project.project.project_id
  name       = "good"
  region     = "us-central1"

Find out how to use this setting securely with Shisho Cloud

Kacperek0/wsb-dataproc-infra

main.tf#L16

resource "google_dataproc_cluster" "wsb_cluster" {
  name   = var.cluster_name
  region = var.gcp_region
  labels = {
    env = var.label_env
  }

Find out how to use this setting securely with Shisho Cloud

Jorgedelpasado/terraform_blocks

main.tf#L1

resource "google_dataproc_cluster" "dataproc-cluster" {
  #provider = google-beta       # In order for "ednpoint_config to work google-beta must be the provider"
  name     = var.cluster_name
  region   = var.region
  project = var.project_id

Find out how to use this setting securely with Shisho Cloud

Review your Terraform file for Google best practices

Shisho Cloud, our free checker to make sure your Terraform configuration follows best practices, is available (beta).

Parameters

graceful_decommission_timeout optional - string

The timeout duration which allows graceful decomissioning when you change the number of worker nodes directly through a terraform apply

id optional computed - string
labels optional computed - map from string to string

The list of labels (key/value pairs) to be applied to instances in the cluster. GCP generates some itself including goog-dataproc-cluster-name which is the name of the cluster.

name required - string

The name of the cluster, unique within the project and zone.

project optional computed - string

The ID of the project in which the cluster will exist. If it is not provided, the provider project is used.

region optional - string

The region in which the cluster and associated nodes will be created in. Defaults to global.

cluster_config list block
- bucket optional computed - string
The name of the cloud storage bucket ultimately used to house the staging data for the cluster. If staging_bucket is specified, it will contain this value, otherwise it will be the auto generated name.
- staging_bucket optional - string
The Cloud Storage staging bucket used to stage files, such as Hadoop jars, between client machines and the cluster. Note: If you don't explicitly specify a staging_bucket then GCP will auto create / assign one for you. However, you are not guaranteed an auto generated bucket which is solely dedicated to your cluster; it may be shared with other clusters in the same region/zone also choosing to use the auto generation option.
- temp_bucket optional computed - string
The Cloud Storage temp bucket used to store ephemeral cluster and jobs data, such as Spark and MapReduce history files. Note: If you don't explicitly specify a temp_bucket then GCP will auto create / assign one for you.
- autoscaling_config list block
  - policy_uri required - string
  The autoscaling policy used by the cluster.
- encryption_config list block
  - kms_key_name required - string
  The Cloud KMS key name to use for PD disk encryption for all instances in the cluster.
- gce_cluster_config list block
  - internal_ip_only optional - bool
  By default, clusters are not restricted to internal IP addresses, and will have ephemeral external IP addresses assigned to each instance. If set to true, all instances in the cluster will only have internal IP addresses. Note: Private Google Access (also known as privateIpGoogleAccess) must be enabled on the subnetwork that the cluster will be launched in.
  - metadata optional - map from string to string
  A map of the Compute Engine metadata entries to add to all instances
  - network optional computed - string
  The name or self_link of the Google Compute Engine network to the cluster will be part of. Conflicts with subnetwork. If neither is specified, this defaults to the "default" network.
  - service_account optional - string
  The service account to be used by the Node VMs. If not specified, the "default" service account is used.
  - service_account_scopes optional computed - set of string
  The set of Google API scopes to be made available on all of the node VMs under the service_account specified. These can be either FQDNs, or scope aliases.
  - subnetwork optional - string
  The name or self_link of the Google Compute Engine subnetwork the cluster will be part of. Conflicts with network.
  - tags optional - set of string
  The list of instance tags applied to instances in the cluster. Tags are used to identify valid sources or targets for network firewalls.
  - zone optional computed - string
  The GCP zone where your data is stored and used (i.e. where the master and the worker nodes will be created in). If region is set to 'global' (default) then zone is mandatory, otherwise GCP is able to make use of Auto Zone Placement to determine this automatically for you. Note: This setting additionally determines and restricts which computing resources are available for use with other configs such as cluster_config.master_config.machine_type and cluster_config.worker_config.machine_type.
- initialization_action list block
  - script required - string
  The script to be executed during initialization of the cluster. The script must be a GCS file with a gs:// prefix.
  - timeout_sec optional - number
  The maximum duration (in seconds) which script is allowed to take to execute its action. GCP will default to a predetermined computed value if not set (currently 300).
- master_config list block
  - image_uri optional computed - string
  The URI for the image to use for this master/worker
  - instance_names optional computed - list of string
  List of master/worker instance names which have been assigned to the cluster.
  - machine_type optional computed - string
  The name of a Google Compute Engine machine type to create for the master/worker
  - min_cpu_platform optional computed - string
  The name of a minimum generation of CPU family for the master/worker. If not specified, GCP will default to a predetermined computed value for each zone.
  - num_instances optional computed - number
  Specifies the number of master/worker nodes to create. If not specified, GCP will default to a predetermined computed value.
  - accelerators set block
    - accelerator_count required - number
    The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.
    - accelerator_type required - string
    The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.
  - disk_config list block
    - boot_disk_size_gb optional computed - number
    Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    - boot_disk_type optional - string
    The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    - num_local_ssds optional computed - number
    The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.
- preemptible_worker_config list block
  - instance_names optional computed - list of string
  List of preemptible instance names which have been assigned to the cluster.
  - num_instances optional computed - number
  Specifies the number of preemptible nodes to create. Defaults to 0.
  - disk_config list block
    - boot_disk_size_gb optional computed - number
    Size of the primary disk attached to each preemptible worker node, specified in GB. The smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    - boot_disk_type optional - string
    The disk type of the primary disk attached to each preemptible worker node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    - num_local_ssds optional computed - number
    The amount of local SSD disks that will be attached to each preemptible worker node. Defaults to 0.
- security_config list block
  - kerberos_config list block
    - cross_realm_trust_admin_server optional - string
    The admin server (IP or hostname) for the remote trusted realm in a cross realm trust relationship.
    - cross_realm_trust_kdc optional - string
    The KDC (IP or hostname) for the remote trusted realm in a cross realm trust relationship.
    - cross_realm_trust_realm optional - string
    The remote realm the Dataproc on-cluster KDC will trust, should the user enable cross realm trust.
    - cross_realm_trust_shared_password_uri optional - string
    The Cloud Storage URI of a KMS encrypted file containing the shared password between the on-cluster Kerberos realm and the remote trusted realm, in a cross realm trust relationship.
    - enable_kerberos optional - bool
    Flag to indicate whether to Kerberize the cluster.
    - kdc_db_key_uri optional - string
    The Cloud Storage URI of a KMS encrypted file containing the master key of the KDC database.
    - key_password_uri optional - string
    The Cloud Storage URI of a KMS encrypted file containing the password to the user provided key. For the self-signed certificate, this password is generated by Dataproc.
    - keystore_password_uri optional - string
    The Cloud Storage URI of a KMS encrypted file containing the password to the user provided keystore. For the self-signed certificate, this password is generated by Dataproc
    - keystore_uri optional - string
    The Cloud Storage URI of the keystore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.
    - kms_key_uri required - string
    The uri of the KMS key used to encrypt various sensitive files.
    - realm optional - string
    The name of the on-cluster Kerberos realm. If not specified, the uppercased domain of hostnames will be the realm.
    - root_principal_password_uri required - string
    The cloud Storage URI of a KMS encrypted file containing the root principal password.
    - tgt_lifetime_hours optional - number
    The lifetime of the ticket granting ticket, in hours.
    - truststore_password_uri optional - string
    The Cloud Storage URI of a KMS encrypted file containing the password to the user provided truststore. For the self-signed certificate, this password is generated by Dataproc.
    - truststore_uri optional - string
    The Cloud Storage URI of the truststore file used for SSL encryption. If not provided, Dataproc will provide a self-signed certificate.
- software_config list block
  - image_version optional computed - string
  The Cloud Dataproc image version to use for the cluster - this controls the sets of software versions installed onto the nodes when you create clusters. If not specified, defaults to the latest version.
  - optional_components optional - set of string
  The set of optional components to activate on the cluster.
  - override_properties optional - map from string to string
  A list of override and additional properties (key/value pairs) used to modify various aspects of the common configuration files used when creating a cluster.
  - properties optional computed - map from string to string
  A list of the properties used to set the daemon config files. This will include any values supplied by the user via cluster_config.software_config.override_properties
- worker_config list block
  - image_uri optional computed - string
  The URI for the image to use for this master/worker
  - instance_names optional computed - list of string
  List of master/worker instance names which have been assigned to the cluster.
  - machine_type optional computed - string
  The name of a Google Compute Engine machine type to create for the master/worker
  - min_cpu_platform optional computed - string
  The name of a minimum generation of CPU family for the master/worker. If not specified, GCP will default to a predetermined computed value for each zone.
  - num_instances optional computed - number
  Specifies the number of master/worker nodes to create. If not specified, GCP will default to a predetermined computed value.
  - accelerators set block
    - accelerator_count required - number
    The number of the accelerator cards of this type exposed to this instance. Often restricted to one of 1, 2, 4, or 8.
    - accelerator_type required - string
    The short name of the accelerator type to expose to this instance. For example, nvidia-tesla-k80.
  - disk_config list block
    - boot_disk_size_gb optional computed - number
    Size of the primary disk attached to each node, specified in GB. The primary disk contains the boot volume and system libraries, and the smallest allowed disk size is 10GB. GCP will default to a predetermined computed value if not set (currently 500GB). Note: If SSDs are not attached, it also contains the HDFS data blocks and Hadoop working directories.
    - boot_disk_type optional - string
    The disk type of the primary disk attached to each node. One of "pd-ssd" or "pd-standard". Defaults to "pd-standard".
    - num_local_ssds optional computed - number
    The amount of local SSD disks that will be attached to each master cluster node. Defaults to 0.
timeouts single block
- create optional - string
- delete optional - string
- update optional - string

>> from Terraform Registry

Explanation in Terraform Registry

Manages a Cloud Dataproc cluster resource within GCP.
API documentation
How-to Guides
Official Documentation !>Warning: Due to limitations of the API, all arguments except labels,cluster_config.worker_config.num_instances and cluster_config.preemptible_worker_config.num_instances are non-updatable. Changing others will cause recreation of the whole cluster!

>> from Terraform Registry

The Other Related Google Dataproc Resources

Google Dataproc Autoscaling Policy

Google Dataproc Cluster IAM

Google Dataproc Job

Google Dataproc Job IAM

Google Dataproc Workflow Template

Frequently asked questions

What is Google Dataproc Cluster?

Google Dataproc Cluster is a resource for Dataproc of Google Cloud Platform. Settings can be wrote in Terraform.

Where can I find the example code for the Google Dataproc Cluster?

For Terraform, the anaik91/tfe, GoogleCloudPlatform/gcpdiag and Kacperek0/wsb-dataproc-infra source code examples are useful. See the Terraform Example section for further details.

Automate config file reviews on your commits

Fix issues in your infrastructure as code with auto-generated patches.

google_dataproc_cluster
Frequently asked questions

Google Dataproc Cluster

Review your .tf file for Google best practices

google_dataproc_cluster (Terraform)

Example Usage from GitHub

Review your Terraform file for Google best practices

Parameters

Explanation in Terraform Registry

The Other Related Google Dataproc Resources

Frequently asked questions

What is Google Dataproc Cluster?

Where can I find the example code for the Google Dataproc Cluster?

Automate config file reviews on your commits

Table of Contents