Top / Google Cloud Platform / Google Dataproc / Job

Google Dataproc Job

This page shows how to write Terraform for Dataproc Job and write them securely.

Review your .tf file for Google best practices

Shisho Cloud, our free checker to make sure your Terraform configuration follows best practices, is available (beta).

google_dataproc_job (Terraform)

The Job in Dataproc can be configured in Terraform with the resource name google_dataproc_job. The following sections describe 2 examples of how to use the resource and its parameters.

Example Usage from GitHub

yuyatinnefeld/gcp

main.tf#L30

resource "google_dataproc_job" "spark" {
  region       = google_dataproc_cluster.mycluster.region
  force_delete = true
  placement {
    cluster_name = google_dataproc_cluster.mycluster.name
  }

Find out how to use this setting securely with Shisho Cloud

Gvkrishna/Terraform_GCP

dataproc_cluster_autoscale_4job_main.tf#L60

resource "google_dataproc_job" "spark" {
  region       = google_dataproc_cluster.mycluster.region
  force_delete = true
  placement {
    cluster_name = google_dataproc_cluster.mycluster.name
  }

Find out how to use this setting securely with Shisho Cloud

Review your Terraform file for Google best practices

Shisho Cloud, our free checker to make sure your Terraform configuration follows best practices, is available (beta).

Parameters

driver_controls_files_uri optional computed - string

Output-only. If present, the location of miscellaneous control files which may be used as part of job setup and handling. If not present, control files may be placed in the same location as driver_output_uri.

driver_output_resource_uri optional computed - string

Output-only. A URI pointing to the location of the stdout of the job's driver program

force_delete optional - bool

By default, you can only delete inactive jobs within Dataproc. Setting this to true, and calling destroy, will ensure that the job is first cancelled before issuing the delete.

id optional computed - string
labels optional - map from string to string

Optional. The labels to associate with this job.

project optional computed - string

The project in which the cluster can be found and jobs subsequently run against. If it is not provided, the provider project is used.

region optional - string

The Cloud Dataproc region. This essentially determines which clusters are available for this job to be submitted to. If not specified, defaults to global.

status optional computed - list of object

The status of the job.

details - string
state - string
state_start_time - string
substate - string
hadoop_config list block
- archive_uris optional - list of string
HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip.
- args optional - list of string
The arguments to pass to the driver.
- file_uris optional - list of string
HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks. Useful for naively parallel tasks.
- jar_file_uris optional - list of string
HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
- main_class optional - string
The class containing the main method of the driver. Must be in a provided jar or jar that is already on the classpath. Conflicts with main_jar_file_uri
- main_jar_file_uri optional - string
The HCFS URI of jar file containing the driver jar. Conflicts with main_class
- properties optional - map from string to string
A mapping of property names to values, used to configure Spark. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/spark/conf/spark-defaults.conf and classes in user code.
- logging_config list block
  - driver_log_levels required - map from string to string
  Optional. The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'.
hive_config list block
- continue_on_failure optional - bool
Whether to continue executing queries if a query fails. The default value is false. Setting to true can be useful when executing independent parallel queries. Defaults to false.
- jar_file_uris optional - list of string
HCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop MapReduce (MR) tasks. Can contain Hive SerDes and UDFs.
- properties optional - map from string to string
A mapping of property names and values, used to configure Hive. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site.xml, /etc/hive/conf/hive-site.xml, and classes in user code.
- query_file_uri optional - string
HCFS URI of file containing Hive script to execute as the job. Conflicts with query_list
- query_list optional - list of string
The list of Hive queries or statements to execute as part of the job. Conflicts with query_file_uri
- script_variables optional - map from string to string
Mapping of query variable names to values (equivalent to the Hive command: SET name="value";).
pig_config list block
- continue_on_failure optional - bool
Whether to continue executing queries if a query fails. The default value is false. Setting to true can be useful when executing independent parallel queries. Defaults to false.
- jar_file_uris optional - list of string
HCFS URIs of jar files to add to the CLASSPATH of the Pig Client and Hadoop MapReduce (MR) tasks. Can contain Pig UDFs.
- properties optional - map from string to string
A mapping of property names to values, used to configure Pig. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site.xml, /etc/pig/conf/pig.properties, and classes in user code.
- query_file_uri optional - string
HCFS URI of file containing Hive script to execute as the job. Conflicts with query_list
- query_list optional - list of string
The list of Hive queries or statements to execute as part of the job. Conflicts with query_file_uri
- script_variables optional - map from string to string
Mapping of query variable names to values (equivalent to the Pig command: name=[value]).
- logging_config list block
  - driver_log_levels required - map from string to string
  Optional. The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'.
placement list block
- cluster_name required - string
The name of the cluster where the job will be submitted
- cluster_uuid optional computed - string
Output-only. A cluster UUID generated by the Cloud Dataproc service when the job is submitted
pyspark_config list block
- archive_uris optional - list of string
Optional. HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip
- args optional - list of string
Optional. The arguments to pass to the driver. Do not include arguments, such as --conf, that can be set as job properties, since a collision may occur that causes an incorrect job submission
- file_uris optional - list of string
Optional. HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks. Useful for naively parallel tasks
- jar_file_uris optional - list of string
Optional. HCFS URIs of jar files to add to the CLASSPATHs of the Python driver and tasks
- main_python_file_uri required - string
Required. The HCFS URI of the main Python file to use as the driver. Must be a .py file
- properties optional - map from string to string
Optional. A mapping of property names to values, used to configure PySpark. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/spark/conf/spark-defaults.conf and classes in user code
- python_file_uris optional - list of string
Optional. HCFS file URIs of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip
- logging_config list block
  - driver_log_levels required - map from string to string
  Optional. The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'.
reference list block
- job_id optional computed - string
The job ID, which must be unique within the project. The job ID is generated by the server upon job submission or provided by the user as a means to perform retries without creating duplicate jobs
scheduling list block
- max_failures_per_hour required - number
Maximum number of times per hour a driver may be restarted as a result of driver exiting with non-zero code before job is reported failed.
- max_failures_total required - number
Maximum number of times in total a driver may be restarted as a result of driver exiting with non-zero code before job is reported failed.
spark_config list block
- archive_uris optional - list of string
HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip.
- args optional - list of string
The arguments to pass to the driver.
- file_uris optional - list of string
HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks. Useful for naively parallel tasks.
- jar_file_uris optional - list of string
HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
- main_class optional - string
The class containing the main method of the driver. Must be in a provided jar or jar that is already on the classpath. Conflicts with main_jar_file_uri
- main_jar_file_uri optional - string
The HCFS URI of jar file containing the driver jar. Conflicts with main_class
- properties optional - map from string to string
A mapping of property names to values, used to configure Spark. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/spark/conf/spark-defaults.conf and classes in user code.
- logging_config list block
  - driver_log_levels required - map from string to string
  Optional. The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'.
sparksql_config list block
- jar_file_uris optional - list of string
HCFS URIs of jar files to be added to the Spark CLASSPATH.
- properties optional - map from string to string
A mapping of property names to values, used to configure Spark SQL's SparkConf. Properties that conflict with values set by the Cloud Dataproc API may be overwritten.
- query_file_uri optional - string
The HCFS URI of the script that contains SQL queries. Conflicts with query_list
- query_list optional - list of string
The list of SQL queries or statements to execute as part of the job. Conflicts with query_file_uri
- script_variables optional - map from string to string
Mapping of query variable names to values (equivalent to the Spark SQL command: SET name="value";).
- logging_config list block
  - driver_log_levels required - map from string to string
  Optional. The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'.
timeouts single block
- create optional - string
- delete optional - string

>> from Terraform Registry

Explanation in Terraform Registry

Manages a job resource within a Dataproc cluster within GCE. For more information see the official dataproc documentation. !> Note: This resource does not support 'update' and changing any attributes will cause the resource to be recreated. resource "google_dataproc_job" "spark" { region = google_dataproc_cluster.mycluster.region force_delete = true placement { cluster_name = google_dataproc_cluster.mycluster.name } spark_config { main_class = "org.apache.spark.examples.SparkPi" jar_file_uris = ["file:///usr/lib/spark/examples/jars/spark-examples.jar"] args = ["1000"] properties = { "spark.logConf" = "true" } logging_config { driver_log_levels = { "root" = "INFO" } } } } resource "google_dataproc_job" "pyspark" { region = google_dataproc_cluster.mycluster.region force_delete = true placement { cluster_name = google_dataproc_cluster.mycluster.name } pyspark_config { main_python_file_uri = "gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py" properties = { "spark.logConf" = "true" } } } output "spark_status" { value = google_dataproc_job.spark.status[0].state } output "pyspark_status" { value = google_dataproc_job.pyspark.status[0].state }
resource "google_dataproc_job" "pyspark" {
  ...
  pyspark_config {
    main_python_file_uri = "gs://dataproc-examples-2f10d78d114f6aaec76462e3c310f31f/src/pyspark/hello-world/hello-world.py"
    properties = {
      "spark.logConf" = "true"
    }
  }
}
For configurations requiring Hadoop Compatible File System (HCFS) references, the options below are generally applicable: - GCS files with the gs:// prefix - HDFS files on the cluster with the hdfs:// prefix - Local files on the cluster with the file:// prefix
main_python_file_uri- (Required) The HCFS URI of the main Python file to use as the driver. Must be a .py file.
args - (Optional) The arguments to pass to the driver.
python_file_uris - (Optional) HCFS file URIs of Python files to pass to the PySpark framework. Supported file types: .py, .egg, and .zip.
jar_file_uris - (Optional) HCFS URIs of jar files to add to the CLASSPATHs of the Python driver and tasks.
file_uris - (Optional) HCFS URIs of files to be copied to the working directory of Python drivers and distributed tasks. Useful for naively parallel tasks.
archive_uris - (Optional) HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip.
properties - (Optional) A mapping of property names to values, used to configure PySpark. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/spark/conf/spark-defaults.conf and classes in user code.
logging_config.driver_log_levels- (Required) The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG' <a name="nested_spark_config"></a>The spark_config block supports:
resource "google_dataproc_job" "spark" {
  ...
  spark_config {
    main_class    = "org.apache.spark.examples.SparkPi"
    jar_file_uris = ["file:///usr/lib/spark/examples/jars/spark-examples.jar"]
    args          = ["1000"]
    properties = {
      "spark.logConf" = "true"
    }
    logging_config {
      driver_log_levels = {
        "root" = "INFO"
      }
    }
  }
}
main_class- (Optional) The class containing the main method of the driver. Must be in a provided jar or jar that is already on the classpath. Conflicts with main_jar_file_uri
main_jar_file_uri - (Optional) The HCFS URI of jar file containing the driver jar. Conflicts with main_class
args - (Optional) The arguments to pass to the driver.
jar_file_uris - (Optional) HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
file_uris - (Optional) HCFS URIs of files to be copied to the working directory of Spark drivers and distributed tasks. Useful for naively parallel tasks.
archive_uris - (Optional) HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip.
properties - (Optional) A mapping of property names to values, used to configure Spark. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/spark/conf/spark-defaults.conf and classes in user code.
logging_config.driver_log_levels- (Required) The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG' <a name="nested_hadoop_config"></a>The hadoop_config block supports:
resource "google_dataproc_job" "hadoop" {
  ...
  hadoop_config {
    main_jar_file_uri = "file:///usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar"
    args = [
      "wordcount",
      "file:///usr/lib/spark/NOTICE",
      "gs://${google_dataproc_cluster.basic.cluster_config[0].bucket}/hadoopjob_output",
    ]
  }
}
main_class- (Optional) The name of the driver's main class. The jar file containing the class must be in the default CLASSPATH or specified in jar_file_uris. Conflicts with main_jar_file_uri
main_jar_file_uri - (Optional) The HCFS URI of the jar file containing the main class. Examples: 'gs://foo-bucket/analytics-binaries/extract-useful-metrics-mr.jar' 'hdfs:/tmp/test-samples/custom-wordcount.jar' 'file:///home/usr/lib/hadoop-mapreduce/hadoop-mapreduce-examples.jar'. Conflicts with main_class
args - (Optional) The arguments to pass to the driver. Do not include arguments, such as -libjars or -Dfoo=bar, that can be set as job properties, since a collision may occur that causes an incorrect job submission.
jar_file_uris - (Optional) HCFS URIs of jar files to add to the CLASSPATHs of the Spark driver and tasks.
file_uris - (Optional) HCFS URIs of files to be copied to the working directory of Hadoop drivers and distributed tasks. Useful for naively parallel tasks.
archive_uris - (Optional) HCFS URIs of archives to be extracted in the working directory of .jar, .tar, .tar.gz, .tgz, and .zip.
properties - (Optional) A mapping of property names to values, used to configure Hadoop. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site and classes in user code..
logging_config.driver_log_levels- (Required) The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG' <a name="nested_hive_config"></a>The hive_config block supports:
resource "google_dataproc_job" "hive" {
  ...
  hive_config {
    query_list = [
      "DROP TABLE IF EXISTS dprocjob_test",
      "CREATE EXTERNAL TABLE dprocjob_test(bar int) LOCATION 'gs://${google_dataproc_cluster.basic.cluster_config[0].bucket}/hive_dprocjob_test/'",
      "SELECT * FROM dprocjob_test WHERE bar &gt; 2",
    ]
  }
}
query_list- (Optional) The list of Hive queries or statements to execute as part of the job. Conflicts with query_file_uri
query_file_uri - (Optional) HCFS URI of file containing Hive script to execute as the job. Conflicts with query_list
continue_on_failure - (Optional) Whether to continue executing queries if a query fails. The default value is false. Setting to true can be useful when executing independent parallel queries. Defaults to false.
script_variables - (Optional) Mapping of query variable names to values (equivalent to the Hive command: SET name="value";).
properties - (Optional) A mapping of property names and values, used to configure Hive. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site.xml, /etc/hive/conf/hive-site.xml, and classes in user code..
jar_file_uris - (Optional) HCFS URIs of jar files to add to the CLASSPATH of the Hive server and Hadoop MapReduce (MR) tasks. Can contain Hive SerDes and UDFs. <a name="nested_pig_config"></a>The pig_config block supports:
resource "google_dataproc_job" "pig" {
  ...
  pig_config {
    query_list = [
      "LNS = LOAD 'file:///usr/lib/pig/LICENSE.txt ' AS (line)",
      "WORDS = FOREACH LNS GENERATE FLATTEN(TOKENIZE(line)) AS word",
      "GROUPS = GROUP WORDS BY word",
      "WORD_COUNTS = FOREACH GROUPS GENERATE group, COUNT(WORDS)",
      "DUMP WORD_COUNTS",
    ]
  }
}
query_list- (Optional) The list of Hive queries or statements to execute as part of the job. Conflicts with query_file_uri
query_file_uri - (Optional) HCFS URI of file containing Hive script to execute as the job. Conflicts with query_list
continue_on_failure - (Optional) Whether to continue executing queries if a query fails. The default value is false. Setting to true can be useful when executing independent parallel queries. Defaults to false.
script_variables - (Optional) Mapping of query variable names to values (equivalent to the Pig command: name=[value]).
properties - (Optional) A mapping of property names to values, used to configure Pig. Properties that conflict with values set by the Cloud Dataproc API may be overwritten. Can include properties set in /etc/hadoop/conf/*-site.xml, /etc/pig/conf/pig.properties, and classes in user code.
jar_file_uris - (Optional) HCFS URIs of jar files to add to the CLASSPATH of the Pig Client and Hadoop MapReduce (MR) tasks. Can contain Pig UDFs.
logging_config.driver_log_levels- (Required) The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG' <a name="nested_sparksql_config"></a>The sparksql_config block supports:
resource "google_dataproc_job" "sparksql" {
  ...
  sparksql_config {
    query_list = [
      "DROP TABLE IF EXISTS dprocjob_test",
      "CREATE TABLE dprocjob_test(bar int)",
      "SELECT * FROM dprocjob_test WHERE bar &gt; 2",
    ]
  }
}
query_list- (Optional) The list of SQL queries or statements to execute as part of the job. Conflicts with query_file_uri
query_file_uri - (Optional) The HCFS URI of the script that contains SQL queries. Conflicts with query_list
script_variables - (Optional) Mapping of query variable names to values (equivalent to the Spark SQL command: SET name="value";).
properties - (Optional) A mapping of property names to values, used to configure Spark SQL's SparkConf. Properties that conflict with values set by the Cloud Dataproc API may be overwritten.
jar_file_uris - (Optional) HCFS URIs of jar files to be added to the Spark CLASSPATH.
logging_config.driver_log_levels- (Required) The per-package log levels for the driver. This may include 'root' package name to configure rootLogger. Examples: 'com.google = FATAL', 'root = INFO', 'org.apache = DEBUG'

>> from Terraform Registry

The Other Related Google Dataproc Resources

Google Dataproc Autoscaling Policy

Google Dataproc Cluster

Google Dataproc Cluster IAM

Google Dataproc Job IAM

Google Dataproc Workflow Template

Frequently asked questions

What is Google Dataproc Job?

Google Dataproc Job is a resource for Dataproc of Google Cloud Platform. Settings can be wrote in Terraform.

Where can I find the example code for the Google Dataproc Job?

For Terraform, the yuyatinnefeld/gcp and Gvkrishna/Terraform_GCP source code examples are useful. See the Terraform Example section for further details.

Automate config file reviews on your commits

Fix issues in your infrastructure as code with auto-generated patches.

google_dataproc_job
Frequently asked questions