iOS Object Detection with Live Camera Preview

Jarek Szczegielniak

5.00/5 (2 votes)

Nov 27, 2020

CPOL

3 min read

6970

135

In the next article, we’ll start working on the iOS application that will use that model.

Download iOS Preview - 64.1 KB

Introduction

This series assumes that you are familiar with Python, Conda, and ONNX, as well as have some experience with developing iOS applications in Xcode. You are welcome to download the source code for this project. We’ll run the code using macOS 10.15+, Xcode 11.7+, and iOS 13+.

Handling live camera feed in an iOS application may be a little overwhelming. We’ll try to make things as simple as possible, focusing more on code readability than on performance. Also, to reduce the number of scaling options to consider, we’ll use the fixed portrait orientation.

The code for this article was initially inspired by this application.

This demo app was written using Xcode 11.7, and should work reasonably well with any iPhone starting from 7 with iOS 13 or later.

Application Layout

From the storyboard perspective, our application is extremely simple. It contains a single View Controller with a lone Preview View control on it. We’ll use this view for the live camera feed.

Capturing Camera Feed

All the code responsible for handling camera input and video preview are in the Controllers/VideoCapture class, which implements AVCaptureVideoDataOutputSampleBufferDelegate.

The following members store its setup:

private let captureSession = AVCaptureSession()
private var videoPreviewLayer: AVCaptureVideoPreviewLayer! = nil
private let videoDataOutput = AVCaptureVideoDataOutput()
private let videoDataOutputQueue = DispatchQueue(label: "VideoDataOutput", qos: .userInitiated, attributes: [], autoreleaseFrequency: .workItem)
private var videoFrameSize: CGSize = .zero

The setupPreview method, called by the class constructor, binds all the elements together.

First, it obtains the first available back camera as input device:

var deviceInput: AVCaptureDeviceInput!
        
let videoDevice = AVCaptureDevice.DiscoverySession(deviceTypes: [.builtInWideAngleCamera], mediaType: .video, position: .back).devices.first
do {

Next, it starts the configuration process, forcing 640 x 480 frames from the camera. Such resolution is sufficient for our YOLO v2 model because it uses images scaled to 416 x 416 pixels anyway. Note that, due to the fixed portrait orientation, we store 48 x 640 as input dimensions in the videoFrameSize variable for future use:

captureSession.beginConfiguration()
captureSession.sessionPreset = .vga640x480
self.videoFrameSize = CGSize(width: 480, height: 640)

The configuration continues to establish a single-element queue for frames to process (the alwaysDiscardLateVideoFrames flag). It means that until the processing of the current frame is completed, the subsequent frames will be discarded.

captureSession.addInput(deviceInput)
if captureSession.canAddOutput(videoDataOutput) {
    captureSession.addOutput(videoDataOutput)
    videoDataOutput.alwaysDiscardsLateVideoFrames = true
    videoDataOutput.videoSettings = [kCVPixelBufferPixelFormatTypeKey as String: Int(kCVPixelFormatType_420YpCbCr8BiPlanarFullRange)]
    videoDataOutput.setSampleBufferDelegate(self, queue: videoDataOutputQueue)
} else {
    print("Could not add video data output to the session")
    captureSession.commitConfiguration()
    return
}

let captureConnection = videoDataOutput.connection(with: .video)
captureConnection?.isEnabled = true
captureConnection?.videoOrientation = .portrait
captureSession.commitConfiguration()

The fixed portrait orientation will make handling and drawing object detection predictions a little easier.

Camera Feed Preview

In the next steps, the setup method creates a videoPreviewLayer instance and adds it as a sublayer to our application’s view (viewLayer below):

self.videoPreviewLayer = AVCaptureVideoPreviewLayer(session: captureSession)
self.videoPreviewLayer.videoGravity = .resizeAspectFill
        
videoPreviewLayer.frame = viewLayer.bounds
viewLayer.addSublayer(videoPreviewLayer)

Use of the .resizeAspectFill value for videoGravity ensures that the video fills the entire available screen. Because no iPhone has a screen with proportions equal to 1,33:1 (inferred by the 640 x 480 resolution), each frame will be cropped on both sides in the portrait view. If we used .resizeAspect instead, the whole frame would have been visible, but with empty bars above and below it.

Finalising Camera Preview Configuration

We need three more methods in the VideoCapture class:

public func captureOutput(_ output: AVCaptureOutput, didOutput sampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    // We will handle frame(s) here
}

public func captureOutput(_ captureOutput: AVCaptureOutput, didDrop didDropSampleBuffer: CMSampleBuffer, from connection: AVCaptureConnection) {
    // Dropped frame(s) can be handled here
}

public func startCapture() {
    if !captureSession.isRunning {
        captureSession.startRunning()
    }
}

The first two methods are required because they are defined in AVCaptureVideoDataOutputSampleBufferDelegate, which our VideoCapture class implements. For now, an empty implementation is fine. We need the last method, startCapture, to start processing the video feed.

With the complete VideoCapture implementation linked to cameraView from our Main.storyboard, and with the instance variable to store the created VideoCapture instance, we create a new instance of VideoCapture in the viewDidLoad method of the main ViewController:

self.videoCapture = VideoCapture(self.cameraView.layer)
self.videoCapture.startCapture()

Conclusion

We now have a simple iOS application configured to capture and preview live camera stream. In the next – and last – article in this series, we’ll extend the application to use our YOLO v2 model for object detection.