Click here to Skip to main content
15,569,696 members
Articles / DevOps
Posted 9 Oct 2018

Tagged as


3 bookmarked

AWS Analyze Big Data with Terraform

Rate me:
Please Sign up or sign in to vote.
5.00/5 (3 votes)
5 Jun 2019CPOL2 min read
Following 'Infrastructure as Code' rules we get a real project sample from the scratch which describes EMR cluster deploying and running Hive script there. It describes Analyze Big Data with Hadoop project from AWS 'Learn to Build' section.


It's important to describe your infrastructure with a code. Terraform can help us with that.


Don't forget to create file in your project root directory where you should set 3 variables:

  • region - where all your infrastructure will be deployed
  • access_key and secret_key for your user which can be generated via AWS IAM (examples are below)
variable "region" {
    default = "us-east-2"
variable "access_key" {
    default = "JFSKLGD8...UFDJKGJS"
variable "secret_key" {
    default = "sdfs8d9fgEG33VE...343rVFDV3vdfevr"

Step by Step Scripts

After passing exam AWS Solutions Architect Associate not to forget the stuff, I found out projects which AWS suggests at their getting started section to implement them one by one. I chose Analyze Big Data with Hadoop for the first step. For fun, I decided to describe this project via Terraform scripts.

I'd like to share this experience because I faced a couple of not trivial issues.

  1. First of all, we need to set up Terraform provider, see
    provider "aws" {
        access_key = "${var.access_key}"
        secret_key = "${var.secret_key}"
        region = "${var.region}"
  2. Here, we should create S3 Bucket and EC2 Key Pair. Both are quite simple and straightforward steps which are described at and correspondingly.
    resource "aws_s3_bucket" "s3_bucket" {
        bucket = "tf-big-data"
    resource "aws_key_pair" "emr_key_pair" {
        key_name = "tf-big-data"
        public_key = "ssh-rsa A...w== rsa-key-20180822"
  3. Creating EMR cluster via the console needs 5-7 clicks choosing a couple of options and the rest of the options can be left by default. It looks like an apple pie but in fact a lot of actions are happening behind the scenes. So we have to take care about the roles and policies for EMR and its EC2 instances. For each of them, we have to create 2 data objects (aws_iam_policy and aws_iam_policy_document) and 2 resources (aws_iam_role_policy_attachment and aws_iam_role). These roles are at module.
  4. Another important section is about network and security ( Here, we're creating 6 resources:
    • aws_vpc;
    • aws_subnet and aws_internet_gateway at this vpc;
    • aws_route_table at this vpc which has a route via created internet gateway;
    • aws_main_route_table_association which connects our vpc and route table;
    • aws_security_group at our vpc which depends on created subnet.
  5. We also need an aws_iam_instance_profile which is kept at the end of module.
  6. Finally, we can create EMR cluster itself We should describe here all required properties such as: name, release_label, applications, service_role (from step 3), log_uri (from step 2), ec2_attributes (from steps 2, 4, 5), one or more instance groups. I also added there 'step' section where I put Hive-script to execute.

The full code of the project is here.

I will really appreciate any comments or suggestions about how this script could be simplified.

Points of Interest

It's not obvious how many resources are really created behind the scenes when you click the button to create EMR cluster at AWS Console. But it's useful to know to understand underlying things that are happening there.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
Software Developer (Senior) Intetics
Ukraine Ukraine
AWS Solutions Architect Associate

Comments and Discussions

-- There are no messages in this forum --