Small computers are becoming ubiquitous. Today, cell phones, smart watches, and fitness trackers are everywhere. They are useful for monitoring ourselves and our surroundings so they can notify us when something happens that we might want to be aware of. Smart watches started off primarily as pedometers, but have evolved to provide early detection for serious issues such as atrial fibrillation. We’re only at the beginning of the micro-monitoring movement.
Small computing devices are often constrained by their limited computing power, being disconnected from the internet and typically small storage capacities. Although these smart devices can take many observations in a short period of time, due to storage constraints, storing all of the observations for later syncing when disconnected from the internet is often infeasible. On the same token, sending vast amounts of data via wireless signals is also consumes limited battery power.
To make the most use of the signals being captured, it is critical that the signal-processing step gets pushed to the edge devices themselves.
Machine learning (ML) has advanced enormously in the past decade, improving the accuracy of many signal-processing tasks such as object detection in images, gesture detection in video and speech recognition. Today we’re just scratching the surface of what’s possible. There are countless other ways to make people’s lives better through the use of machine learning when used in small devices.
What else is possible?
In this article we’re going to explore the idea of a dog mood detector. We’ll build a device that listens to the ambient sound around it and, if a dog is present, attempts to determine which type of sound the dog is making: a friendly bark, a scared whimper, or an aggressive growl.
Based on the preferences of the user, the device vibrates when it thinks they should check on their dog. This could be used, perhaps, to help owners keep tabs on their dog when they're out of earshot. Of course, this is just a prototype, and the results of this idea have not been tested in real-world scenarios.
To prototype this device, we’ll use an Arm-based Raspberry Pi, which is my go-to platform for implementing ML on edge devices. Arm processors are not just used in Raspberry Pis, they also power numerous cell phones, mobile gaming consoles and a plethora of other devices. They pack a lot of computing power in an energy efficient processor and you can pick one up for an affordable price just about anywhere.
For the training, we’re going to take a look at Google AudioSet, which is a large collection of 10-second audio clips from YouTube videos. The data is provided in a preprocessed format that is compatible with the YouTube-8M starter kit, which we’ll use to train a model that can classify the audio clips.
The training for this model can take some time, so we’ll offload the processing to Google Cloud AI Platform and download the resulting model when it’s finished. When we have all the pieces, we’ll transfer the model over to the Raspberry Pi. We'll also create a Python script to capture the input from a connected microphone and attempt to make predictions about any dog sounds it identifies once per second.
Let’s get that model built
First, let's make a folder somewhere to house all of the work we’re about to do.
To build this model, we’ll need to download the dataset — it's available through a link under the “Features dataset” heading. It’s easiest to download the single gzip tarball file to a local machine.
Next, unzip it and extract the files. In this package there are three folders, one containing a balanced training set, one for the evaluation set, and one with an unbalanced training set. Each folder contains just over 4,000 files.
The TFRecord files contain the preprocessed features. The file names start with the first two characters of the video ID on YouTube. Since the video IDs are case-sensitive, be careful when extracting the files if the local file system is case-insensitive, like Windows. (Handy tip: we used 7zip to extract these feature files. 7zip supports command-line options that let you automatically rename files that already exist, making sure that files get renamed instead of overwritten.)
Once we have the dataset correctly extracted, we’ll clone the YouTube-8M Github repository, which contains the code to train the model. We recommend cloning it to the same folder you created for the extracted dataset.
Next, update the readers.py file in the YouTube-8M folder to support the old AudioSet TFRecord files. This involves two steps:
- Change all the uses of "
id" to "
- Change the default
num_classes parameter to 527, which is the number of different categories in the audio dataset.
There are five places where the ID needs to be changed and two places where num_classes needs to be changed.
To run the program, spin up a new Python 3.6+ virtual environment and install the tensorflow==1.14 package. Now is also a good time to install the requirements for the inference script we’re going to build in the next step. Although the version numbers are specific for each of the packages, the only hard requirement is that we use tensorflow 1.14. The rest are safe to just install the latest release.
At this point we are ready to train the model. Run the training script locally first to test it out, which should not take very long on the balanced training set. Open a command line, browse to the folder created in the first step of this section, and enter the following command (note that this is all one line):
python youtube-8m/train.py \ --train_data_pattern=./audioset_v1_embeddings/bal_train
Also note that, while the \ line breaks will work fine on Linux systems, you'll have to replace them with a ^ character to work on Windows.
At 100 epochs, this is going to run until it reaches roughly step 8,500. The
FrameLevelLogisticModel will max out at an accuracy of roughly 58-59%. On our test system it took just under 20 minutes to complete.
There are other models included in this starter kit, including
LstmModel. Each of these will achieve a near perfect accuracy on the training data, but both severely overfit on the balanced training set when testing with the evaluation set.
Let’s train this model in the cloud
An alternative is to train on the full collection of sounds using the unbalanced dataset. This will require significantly more processing time, but the GPUs available on the Google Cloud AI Platform can help significantly. The simple logistic model will achieve roughly 88% accuracy on the unbalanced training set.
To run this in the cloud, register and log into a Google Cloud AI Platform account, enable billing and download the command line tools, which is explained in more detail here.
Once everything is set up, go to the Cloud Console, create a new project, and create a new storage bucket. The name of the storage bucket must be globally unique, which is easiest if it includes the user's account name. Upload the entire
youtube-8m folders in this storage bucket.
If this is all done correctly, we should be able to open up the Google Cloud SDK Shell and run the following commands to kick off the training. Make sure to replace
your-storage-bucket-name with the appropriate account values. This is written for a unix based system. Adjust accordingly for a Windows system.
gsutil mb -p your-project-name $BUCKET_NAME
gcloud --verbosity=debug ml-engine jobs submit training $JOB_NAME
--python-version 3.5 --package-path=youtube-8m --module-name=youtube-8m.train --staging-bucket=$BUCKET_NAME --region=us-east1 --config=youtube-8m/cloudml-gpu.yaml -- --train_data_pattern='gs://your-storage-bucket-name/audioset_v1_embeddings/unbal_train/*.tfrecord' --model=FrameLevelLogisticModel --train_dir=$BUCKET_NAME/yt8m_train_frame_level_logistic_model
Again, note that the last gcloud call is one long command with configuration options.
This is going to take upwards of half a day to complete. When it’s all said and done, download the model output from your Cloud Storage bucket:
Running on the Raspberry Pi
We're demonstrating this application on an Arm-based Raspberry Pi 4 running Raspbian with Python 3 installed. Install PyAudio on the device. (If any problems occur, this answer should help.)
Plug in a USB microphone (with an optional headset to output audio for testing). At this point it’s easiest to set up the microphone as the default device. Log into the Raspian desktop and click on the sound icon next to the clock in the upper right hand corner, then select the microphone to use.
The last major step here is to get the tools that process the raw audio into the same 128 feature embedding that is provided by AudioSet. The tool to do this is part of the Tensorflow models Github repository linked earlier. Run the exact same install procedure on the Pi, making sure to install to the Python 3 instance. Also, clone this repository into the same folder that the dataset and YouTube-8M repository were cloned into.
vggish_smoke_test.py script to make sure everything installed correctly.
Now copy the model downloaded from Google Cloud Platform into the same folder, as well as the microphone listening script.
Run this script. It will start listening on the default device and write out the predictions to the console.
If it's not possible to set up the desired device as the default device, then run "
python model-run.py list" to display a list of all devices by index. Find the device index, then run the command again, passing that index in. For example:
python model-run.py 3
Copy the entire contents of this folder to the Raspberry Pi and run the code script again. Once per second we should see predictions of what dog noise it thinks is happening! The output step can be replaced by whatever mechanism is most appropriate for the device and the target user.
Today we explored one possible audio-based ML application enabled by the power of an Arm-based mobile device. This concept would need to be proven out in much greater detail before reaching market, but the capacity to run an arbitrary audio detection model on a mobile device is here.
The AudioSet data includes 527 labels with a robust ontology of urban sounds. There are also opportunities to better process the sound prior to passing it to our predictor, such as applying a cocktail party algorithm and passing each audio source through the vggish filter.
Running a dog mood detector on a Raspberry Pi with an Arm microprocessor is pretty cool. To make it even cooler, we can use the tools provided by TensorFlow to convert and quantize the model, then run it on a low-cost, low power consumption Arm microcontroller with TensorFlow Lite for Microcontrollers.
Sounds interesting? Experiment and see what problem you can solve with this approach. You never know how much of an impact you might make on someone else’s life.
Check out the Arm solutions page to learn more about what we are doing in AI and ML. If you are a business that wants to partner with Arm, check out our Arm AI Partner Program.
Nick is a Principal Data Engineer at Stack Overflow where he has been working since 2011. He has a Masters Degree in Machine Learning from Georgia Tech and a passion for mentoring developers of all experience.