Click here to Skip to main content
15,748,477 members
Articles / Artificial Intelligence
Posted 19 Sep 2022


6 bookmarked

Human Action Recognition with OpenVINO™ Toolkit

Rate me:
Please Sign up or sign in to vote.
0.00/5 (No votes)
19 Sep 2022CPOL6 min read
In this article, you will learn how to work with live human action recognition using the OpenVINO™ AI toolkit in a synchronized schedule.

This article is a sponsored article. Articles such as these are intended to provide you with information on products and services that we consider useful and of value to developers

Image 1

It’s been a few months since I started my journey here at Intel, and I’m so excited to share with you all what I have been working on. Today, I will walk you through my first notebook on human action recognition. I hope you enjoy it and can apply it in your ongoing developments.

In this blog, you will learn how to work with live human action recognition using the OpenVINO™ AI toolkit in a synchronized schedule.

Human action recognition is an AI capability that can find and categorize an extensive set of activities within a recorded or live video. For example, if you have a large family video collection, and you want to find a specific memory, human action recognition is the simplest and fastest way to do so. Traditional methods would require you to spend a lot of manual effort and time reviewing every video you have until you find the right one. Using human action recognition, you can train AI models to automatically categorize and organize your videos by their recorded activities for you, making it easier to find and access your most cherished memories in a matter of seconds.

This action can also be applied to businesses like manufacturing. For instance, providing human workers with solutions that can recognize their performed tasks, gesture for feedback, and keep them safe by recognizing and alerting managers of any hazards.

But these are just a few examples of what human action recognition can do. Over the next few years, I expect to see a lot more new and exciting use cases in this field. Let me know what other areas you imagine could benefit from this AI capability after running through this notebook. But for now, let’s get started.

About this notebook

For this notebook, I am using the DeepMind Kinetics-400 human action video dataset, which contains 400 actions in total, including Person Actions (e.g., writing, drinking, laughing), Person-Person Actions (e.g., hugging, shaking hands, playing poker), and Person-Object Actions (riding a scooter, doing laundry, blowing a balloon). You can also distinguish a group of Parent-Child interactions, such as braiding or brushing hair, salsa or robot dancing, and playing the violin or the guitar (Figure 1). For more information about the labels and the dataset, see “The Kinetics Human Action Video Dataset” research paper.

Image 2

Figure 1. Human Action Recognition with OpenVINO™ toolkit

You can run this notebook using your general-purpose computer, no hardware accelerators are required. The great thing about using the AI toolkit OpenVINO is that it is designed to work at the edge, so GPUs, CPUs, and VPUs can be optimized for it to run your AI inference models efficiently. But again, those are not necessary. Various video sources can be used, such as a clip coming from a URL, a locally stored file, or a webcam feed.

I will also be using the Action Recognition model from Open Model Zoo, which provides a wide variety of pre-trained deep learning models and demo applications. The model I am using is based on Video Transformer, with a ResNet34 architecture (Figure 2). It contains two models:

  • The Encoder, based on the PyTorch framework, with the input shape of [1x3x224x224] — 1 batch size, 3 color channels, and image dimensions of 224 by 224 pixels; and an output shape of [1x512x1x1], representing embedding of the processed frame.
  • The Decoder, also based on the PyTorch framework, with the input shape of [1x16x512] — 1 batch size, 16 frames for duration of the clip in one second, and 512 dimensions of embedding.

Image 3

Figure 2. Pipeline of the human action recognition notebook.

I am selecting 16 frames per second to be analyzed — that is the number of frames the Kinetics-400 authors averaged to find the class score. The frames are preprocessed to analyze just the center-cropped image as you can see in the GIF in Figure 1.

Both models create a sequence-to-sequence (Seq2Seq) system to identify the human activities for the Kinetics-400 dataset. Due to a non-exhaustive annotation, the model performance is the best, but it could help us to understand the pipeline.

You can start recognizing your own videos by:

  1. Preparing your installation using OpenVINO Notebooks.
  2. Preparing your video source, a webcam or video files with your common activities that you want to detect. Consider the action names to be detected by checking the dataset labels.
  3. Opening a Jupyter notebook on your computer. The notebook can run under Windows, MacOS, and Ubuntu, over different internet browsers.

Live Action Recognition with OpenVINO™

Now, I will show you some highlights of the notebook:

1. Downloading the models

We are working with Open Model Zoo tools, such as omz_downloader. It is a command line tool that automatically creates a directory structure and downloads the selected model. In this case, it is the “action-recognition-0001” model from Open Model Zoo.

if not os.path.exists(model_path_decoder) or not os.path.exists(model_path_encoder):
    download_command = f"omz_downloader " \
                       f"--name {model_name} " \
                       f"--precision {precision} " \
                       f"--output_dir {base_model_dir}"
    ! $download_command

2. Model initialization

To start inference, initialize the inference engine, read the network and weight from files, load the model on the chosen device — in my case, the CPU — and get input and output nodes.

# Initialize inference engine
ie_core = Core()

def model_init(model_path: str) -> Tuple:
    Read the network and weights from file, load the 
    model on the CPU and get input and output names of nodes
    :param: model: model architecture path *.xml
             compiled_model: Compiled model
             input_key: Input node for model
             output_key: Output node for model
    # Read the network and corresponding weights from file
    model = ie_core.read_model(model=model_path)
    # compile the model for the CPU (you can use GPU or MYRIAD as well)
    compiled_model = ie_core.compile_model(model=model, device_name="CPU")
    #Get input and output names of nodes
    input_keys = compiled_model.input(0)
    output_keys = compiled_model.output(0)
    return input_keys, output_keys, compiled_model

3. Helper functions

You need a lot of code to prepare and visualize your results. Create a crop-centered ROI, resize the image, and put text info in each frame.

4. AI Functions

Here is where the magic happens.

a. Preprocessing the frame before running the Encoder (preprocessing)

  • Before passing the frame through the encoder, prepare the image — scale it to its shortest dimension, to the chosen size, by cropping, centering, and squaring it so that both width and height are equal. The frame must be transposed from Height-Width-Channels (HWC) to Channels-Height-Width (CHW).
def preprocessing(frame: np.ndarray, size: int) -> np.ndarray:
    Preparing frame before Encoder.
    The image should be scaled to its shortest dimension at "size"
    and cropped, centered, and squared so that both width and 
    height have lengths "size". Frame must be transposed from
    Height-Width-Channels (HWCs) to Channels-Height-Width (CHW).
    :param frame: input frame
    :param size: input size to encoder model
    :returns: resized and cropped frame
    # Adapative resize 
    preprocessed = adaptive_resize(frame, size)
    # Center_crop
    (preprocessed, roi) = center_crop(preprocessed)
    # Transpose frame HWC -> CHW
    preprocessed = preprocessed.transpose((2, 0, 1))[None,] # HWC -> CHW
    return preprocessed, roi

b. Encoder Inference per frame (encoder)

  • This function calls the network previously configured for the encoder model (compiled_model), extracts the data from the output node, and appends it in an array to be used by the decoder.
def encoder(
    preprocessed: np.ndarray,
    compiled_model: CompiledModel
) -> List:
    Encoder Inference per frame. This function calls the network previously
    configured for the encoder model (compiled_model), extracts the data
    from the output node, and appends it in an array to be used by the decoder.
    :param: preprocessed: preprocessing frame
    :param: compiled_model: Encoder model network
    :returns: encoder_output: embedding layer that is appended with each arriving frame 
    output_key_en = compiled_model.output(0)
    # Get results on action-recognition-0001-encoder model
    infer_result_encoder = compiled_model([preprocessed])[output_key_en]
    return infer_result_encoder

c. Decoder inference per set of frames (decoder)

  • This function concatenates the embedding layer from the encoder output and transposes the array to match the decoder input size. It calls the network previously configured for the decoder model (compiled_model_de), extracts the logits (yes, logits are a real thing; you can find out more here) and normalizes them to get confidence values along the specified axis. It decodes top probabilities into corresponding label names.
def decoder(encoder_output: List, compiled_model_de: CompiledModel) -> List:
    Decoder inference per set of frames. This function concatenates the embedding layer
    forms the encorder output, transpose the array to match with the decoder input size.
    Calls the network previously configured for the decoder model (compiled_model_de), extracts
    the logits and normalize those to get confidence values along specified axis.
    Decodes top probabilities into corresponding label names
    :param: encoder_output: embedding layer for 16 frames
    :param: compiled_model_de: Decoder model network
    :returns: decoded_labels: The k most probable actions from the labels list
              decoded_top_probs: confidence for the k most probable actions
    # Concatenate sample_duration frames in just one array
    decoder_input = np.concatenate(encoder_output, axis=0)
    # Organize input shape vector to the Decoder (shape: [1x16x512]]
    decoder_input = decoder_input.transpose((2, 0, 1, 3))
    decoder_input = np.squeeze(decoder_input, axis=3)
    output_key_de = compiled_model_de.output(0)
    # Get results on action-recognition-0001-decoder model
    result_de = compiled_model_de([decoder_input])[output_key_de]
    # Normalize logits to get confidence values along specified axis
    probs = softmax(results_de - np.max(result_de))
    # Decodes top probabilities into corresponding label names
    decoded_labels, decoded_top_probs = decode_output(probs, labels, top_k=3)
    return decoded_labels, decoded_top_probs

Run the complete notebook pipeline

Now, let’s see the notebook in action.

  1. Select the video for which you would like to run the complete workflow.
    video_file = ""
    run_action_recognition(source=video_file, flip=False, use_popup=False, skip_first_frames=600)
  2. Select the webcam and run the complete workflow again.
    run_action_recognition(source=0, flip=False, use_popup=False, skip_first_frames=0)

Congrats! You’ve done it. I hope you found this topic interesting and useful for your application development. 😉

To learn more about the OpenVINO toolkit and what it can do, visit For more hands-on AI training, check out our AI Dev Team Adventures.


Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from course of performance, course of dealing, or usage in trade.

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By
United States United States
Hi, all! My name is Paula Ramos. I have been an AI enthusiast and worked with Computer Vision since the early 2000s. Developing novel integrated engineering technologies is my passion. I love to deploy solutions that can be used by real people to solve their equally real problems. At the end of this blog post, you can find some links to my previous work. If you’d like to share your ideas on how we could improve our community content, just drop me a line! 😉 I will be happy to hear your feedback.

Comments and Discussions

-- There are no messages in this forum --