Composing poésie concrète with AWS Step Function

Bohdan Stupak

Rate me:

5.00/5 (4 votes)

28 Feb 2024CPOL7 min read

1.8K

Creating distributed map-reduce workflow with AWS Step Function to write the poem that automatically re-generates itself

This article highlights interesting points in making a system that scraps search engine, analyses found article content with ML, and serves it to WWW all using the AWS ecosystem.

Concept

Observing signs of Russo-Ukrainian war fatigue, I decided to come up with a poem called “Is the war over?” I’ve labeled it as poésie concrète drawing an analogy with musique concrète - genre where music is composed of non-musical pieces of sound.

In the same way, my poem is composed of search engine results for “Russian war crimes” in the latest week which are then processed by a sentiment analysis model to extract sentences that highlight Russian atrocities most properly. After ML-processing sentences assembled in the poem. Each Sunday, the poem is regenerated with new and new war crimes.

The war will be over on a day when the search engine will return no results and the poem will be blank.

One may argue that search engine might return results long after hostilities will end. This is exactly the point. Many people in Ukraine will have to live with the aftermath of war for their entire lives. Consider war veterans, traumatized children, and families who lost their close ones.

You may access the page that leads to the poem here.

High-level Architecture

I’ve decided to proceed with the serverless offering since it allows me to pay per execution and execution figures are low for this one. While my experience mostly connected with .NET stack for this project, I’ve decided to go with JavaScript since its web-based capabilities exceed any other language I’m familiar with.

The high-level architecture diagram looks as below:

The entire process is launched by EventBridge Scheduler which launches the chain of Lambdas, each following its own responsibility: - Crawling the search engine - Extract article content from a web page - Analyze the sentiment of each sentence inside the article - Assemble the poem from the sentences with the strongest sentiment and put it in S3 bucket that is served to the client via CloudFront.

Since the source code is stored in Github, I decided to deploy them via Github actions. In the article below, we’ll focus on some points of interest found in the code.

Crawling the Search Engine

The algorithm behind Google Crawler service is:

Make a request to https://www.google.com/search?q=russian+war+crimes&tbs=qdr:w
Traverse html page for links to web pages.

Once approaching this task, I was under the impression that I’d leverage document API to traverse the HMTL. However, it relies on a browser which was not the case for Lambda. JSDom came to my rescue.

With its help, extracting necessary values is as simple as:

JavaScript

const dom = new jsdom.JSDOM(data);
const anchors = dom.window.document.querySelectorAll('a[data-ved]');

First Deploy

I decided to design the project for deployment from the start so the next step was to introduce continuous build on GitHub.

name: Google Crawler Build

on:
  push:
    branches: [ "*" ]
  pull_request:
    branches: [ "master" ]

jobs:
  build:

    runs-on: ubuntu-latest

    strategy:
      matrix:
        node-version: [20.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
    - uses: actions/checkout@v3
    - name: Use Node.js ${{ matrix.node-version }}
      uses: actions/setup-node@v3
      with:
        node-version: ${{ matrix.node-version }}
        cache: 'npm'
        cache-dependency-path: "./src/google-crawler/package-lock.json"
    - run: cd ./src/google-crawler && npm ci
    - run: cd ./src/google-crawler && npm test
    - run: cd ./src/google-crawler && npm run lint

I think the code is pretty self-explanatory, however, let’s look through some points.

Here, we rely on ubuntu-latest environment and node-version: [20.x].

First of all, we check out the source code with actions/checkout@v3. For npm to work correctly, we have to specify the path to package-lock.json file with cache-dependency-path: "./src/google-crawler/package-lock.json". Apart from restoring packages with npm ci, we also run unit-tests and linter which are necessary quality gates for our codebase.

The deployment looks as follows:

name: Google Crawler Deploy

on:
  push:
    branches: [ "master" ]
jobs:
  lambda:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [20.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
          cache-dependency-path: "./src/google-crawler/package-lock.json"
      - uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - run: cd ./src/google-crawler && npm ci
      - run: cd ./src/google-crawler && zip -r lambda1.zip ./
      - run: cd ./src/google-crawler && aws lambda update-function-code --function-name=google-crawler --zip-file=fileb://lambda1.zip

It looks pretty similar to the build job, however, we also zip the code and deploy it via aws lambda update-function-code command.

Extracting Article Content

To extract article content from the web page, I’ve used Readability package. Here’s how I download article content from the web page and split it into sentences.

JavaScript

const res = await fetch(url);
const html = await res.text();
const doc = new jsdom.JSDOM(html);
const reader = new readability.Readability(doc.window.document);
const article = reader.parse();
const sentences = splitIntoSentences(article.textContent);

Calling One Lambda From Another

There are many advices on how to synchronously call one lambda from another over the internet. AWS documentation, however, is more prohibitive on that matter and for a good reason:

Quote:

While this synchronous flow may work within a single application on a server, it introduces several avoidable problems in a distributed serverless architecture:

Cost: with Lambda, you pay for the duration of an invocation. In this example, while the Create invoice functions runs, two other functions are also running in a wait state, shown in red on the diagram.

Error handling: in nested invocations, error handling can become much more complex. Either errors are thrown to parent functions to handle at the top-level function, or functions require custom handling. For example, an error in Create invoice might require the Process payment function to reverse the charge, or it may instead retry the Create invoice process.

Tight coupling: processing a payment typically takes longer than creating an invoice. In this model, the availability of the entire workflow is limited by the slowest function.

Scaling: the concurrency of all three functions must be equal. In a busy system, this uses more concurrency than would otherwise be needed.

One of the alternatives is to use AWS Step Functions to orchestrate the execution of the lambdas. And turns out that my problem is a nice example of distributed map. Consider: We extract all necessary links - We map each link in parallel by extracting its content and analyzing its sentiment. - We reduce it into a single S3 bucket.

Here’s the definition of the entire system.

JavaScript

{
  "Comment": "A Step Functions workflow that processes an array of strings concurrently",
  "StartAt": "Extract links from google",
  "States": {
    "Extract links from google": {
      "Type": "Task",
      "Resource": "<google crawler arn>",
      "ResultPath": "$",
      "Next": "ProcessArray",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ]
    },
    "ProcessArray": {
      "Type": "Map",
      "ItemsPath": "$",
      "MaxConcurrency": 10,
      "Iterator": {
        "StartAt": "Extract article content",
        "States": {
          "Extract article content": {
            "Type": "Task",
            "Resource": "<article extractor arn>",
            "InputPath": "$",
            "Next": "Analyze sentiment",
            "Retry": [
              {
                "ErrorEquals": [
                  "States.ALL"
                ],
                "IntervalSeconds": 1,
                "MaxAttempts": 2,
                "BackoffRate": 2
              }
            ],
            "Catch": [
              {
                "ErrorEquals": [
                  "States.ALL"
                ],
                "Next": "Analyze sentiment"
              }
            ]
          },
          "Analyze sentiment": {
            "Type": "Task",
            "Resource": "<sentiment analyzer arn>",
            "InputPath": "$",
            "End": true,
            "Retry": [
              {
                "ErrorEquals": [
                  "States.ALL"
                ],
                "IntervalSeconds": 1,
                "MaxAttempts": 2,
                "BackoffRate": 2
              }
            ]
          }
        }
      },
      "Next": "Reducer",
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ],
      "Catch": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "Next": "Reducer"
        }
      ]
    },
    "Reducer": {
      "Type": "Task",
      "Resource": "<reducer arn>",
      "InputPath": "$",
      "ResultPath": "$",
      "End": true,
      "Retry": [
        {
          "ErrorEquals": [
            "States.ALL"
          ],
          "IntervalSeconds": 1,
          "MaxAttempts": 2,
          "BackoffRate": 2
        }
      ]
    }
  }
}

Here, we enter into the map phase executing step with "Type": "Map", and both are Article extractor and Sentiment analyzer serve as Iterator. Once the map phase is done, we enter reduce phase via "Next": "Reducer".

Another thing worth mentioning is increasing the reliability of our system by adding error handling. The most obvious way is adding retries via:

JavaScript

"Retry": [
  {
    "ErrorEquals": [
      "States.ALL"
    ],
    "IntervalSeconds": 1,
    "MaxAttempts": 2,
    "BackoffRate": 2
  }
]

We also use another tactic - letting a single map instance fail instead of failing the entire map phase.

JavaScript

"Catch": [
  {
    "ErrorEquals": [
      "States.ALL"
    ],
    "Next": "Analyze sentiment"
  }
]

Using Layers to Optimize Monorepo Structure

At this point, the structure of our repository looks suboptimal with package.json and separate build step for each function. What’s more: a separate package.json means a separate node_modules folder which leads to much disk space going to waste since a lot of modules are duplicates.

This won’t scale once we will add more functions. There is however a way to build and package all the dependencies at once using Lambda layers. This approach lets us package all the dependencies into a separate layer and treat it for our functions as a common runtime.

We’ll reorganize our repository to look like this:

Let’s have a look at a separate action that deploys the layer:

name: Deploy Modules Layer

on:
  workflow_call:
    secrets:
      AWS_ACCESS_KEY_ID:
        required: true
      AWS_SECRET_ACCESS_KEY:
        required: true

jobs:
  layer:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        node-version: [20.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
          cache-dependency-path: "./src/package-lock.json"
      - uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - run: cd ./src && npm ci
      - run: cd ./src && zip -r layer.zip node_modules
      - run: cd ./src && aws lambda publish-layer-version --layer-name poeme-concrete-modules --zip-file fileb://layer.zip

Nothing special is happening here apart from the fact that now, we are using aws lambda publish-layer-version. Now let’s jump to consuming the deployed layer when we deploy our functions.

name: Article Extractor Deploy

on:
  push:
    branches: [ "master" ]
jobs:
  layer:
    uses: ./.github/workflows/modules-layer-deploy.yml
    secrets: inherit

  lambda:
    runs-on: ubuntu-latest
    needs: layer
    strategy:
      matrix:
        node-version: [20.x]
        # See supported Node.js release schedule at https://nodejs.org/en/about/releases/

    steps:
      - uses: actions/checkout@v3
      - name: Use Node.js ${{ matrix.node-version }}
        uses: actions/setup-node@v3
        with:
          node-version: ${{ matrix.node-version }}
          cache: 'npm'
          cache-dependency-path: "./src/package-lock.json"
      - uses: aws-actions/configure-aws-credentials@v2
        with:
          aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
          aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
          aws-region: eu-central-1
      - run: cd ./src && npm ci
      - run: cd ./src/article-extractor && zip -r lambda1.zip ./
      - run: cd ./src/article-extractor && aws lambda update-function-code --function-name=article-extractor --zip-file=fileb://lambda1.zip
      - run: echo "layer-arn=$(aws lambda list-layer-versions --layer-name poeme-concrete-modules --region eu-central-1 --query 'LayerVersions[0].LayerVersionArn')" >> $GITHUB_ENV
      - run: aws lambda update-function-configuration --function-name=article-extractor --layers="${{ env.layer-arn }}"

Here, couple of things are worth noting.

First of all, is how we rely on deploy layers job.

jobs:
  layer:
    uses: ./.github/workflows/modules-layer-deploy.yml
    secrets: inherit

The thing that might be unobvious to newcomers is how we use secrets: inherit to pass secrets down to the layer deploy action. One might naturally assume that it will infer secrets from the Github storage, however, this is not true and child action infers secrets from parent workflow.

Another important thing is forcing newly deployed function to use the latest version of the published layer. We achieve this in two steps:

Querying for the latest layer version and storing it inside the environment variable:

echo "layer-arn=$(aws lambda list-layer-versions --layer-name poeme-concrete-modules --region eu-central-1 --query 'LayerVersions[0].LayerVersionArn')" >> $GITHUB_ENV

Using stored value to configure update function configuration:

aws lambda update-function-configuration --function-name=article-extractor --layers="${{ env.layer-arn }}"

Accessing Secrets

When it comes to choosing a sentiment analysis engine, the natural choice is Amazon Comprehend. Why I didn’t stick with it? I didn’t like the results.

Instead, I’ve chosen text2data service. At the end of the day, it’s like calling any other third-party service via HTTP so in this section, I’ll briefly cover retrieving secrets needed to call this API.

JavaScript

import { SecretsManagerClient, GetSecretValueCommand } from 
         "@aws-sdk/client-secrets-manager";

async function getSentimentAnalysisApiKey() {
    const secret_name = "SENTIMENT_ANALYSIS_API_KEY";

    const client = new SecretsManagerClient({
        region: "eu-central-1",
    });

    let response;

    try {
        response = await client.send(
            new GetSecretValueCommand({
                SecretId: secret_name,
                VersionStage: "AWSCURRENT"
            })
        );
    } catch (error) {
        console.log(error);
        throw error;
    }

    return response.SecretString;
}

Writing Down the Result to S3

Cloudfront serves HTML content from S3 bucket. So in order for the poem to be published, we need to generate HTML and store it inside the bucket.

To generate the HTML, we insert sentences inside mustache template

JavaScript

const formatted =
        poem
          .map(p => `<p>${p}</p>`)
          .join("\n");
const html = renderTemplate(formatted);

const renderTemplate = (poem) => {
  const template = fs.readFileSync('./template.html', 'utf8');

  return Mustache.render(template, {
    poem: poem
  });
}

The point of interest in the template is that we have to use triple curly brackets in order for the inserted HTML not to be escaped.

HTML

<html>
    //omitted for brevity
    <body>
        <article>
{{{poem}}}
        </article>
    </body>
</html>

Now we can store the HTML in S3 with the code below:

JavaScript

const putParams = {
  Bucket: 'poeme-concrete',
  Key: 'index.html',
  Body: html,
  ContentType: 'text/html',
};

await s3.putObject(putParams).promise();

Conclusion

Usually, at this point, I write a summary of technologies touched on in the article. This time, however, I encourage you to read through the poem and occasionally revisit it. You may brush it off as something too disturbing but for many people in Ukraine, it’s a grim reality. I am no exception since I have spent the last two years battling to cure my now-4-year-old son’s PTSD and speech disorder. Still, I’m in a more lucky position because I have a roof over my head and live in a relatively peaceful region of the country.

And if the course of the last 30 years has taught us anything, it’s that you can never stop the aggressor by leaving him unpunished. Russia didn’t stop after Transnistria, Ichkeria, Abkhazia, Crimea, and parts of Donetsk and Luhansk regions, and won’t stop now if it feels that it can wage this war unpunished.

History

28^th February, 2024 - Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Bohdan Stupak

Team Leader

Ukraine

Team leader with 8 years of experience in the industry. Applying interest to a various range of topics such as .NET, Go, Typescript and software architecture.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Composing poésie concrète with AWS Step Function

Concept

High-level Architecture

Crawling the Search Engine

First Deploy

Extracting Article Content

Calling One Lambda From Another

Using Layers to Optimize Monorepo Structure

Accessing Secrets

Writing Down the Result to S3

Conclusion

History

License

Comments and Discussions