Click here to Skip to main content
Click here to Skip to main content

Tagged as

Go to top

ARM Neon Optimization for image interleaving and deinterleaving

, 12 Aug 2014
Rate this:
Please Sign up or sign in to vote.
ARM Neon Optimization InterLeaving/De-InterleavingIntroduction In this article we will look at basic interleaving and de-interleaving operations using ARM Neon optimization and evaluate the performance improvements on android based mobile device in comparison with standard opencv code ARM Neon ARM

ARM Neon Optimization InterLeaving/De-Interleaving

    Introduction

  • In this article we will look at basic interleaving and de-interleaving operations using ARM Neon optimization and evaluate the performance improvements on android based mobile device in comparison with standard opencv code

    ARM Neon

  • ARM's NEON technology is a 64/128-bit hybrid SIMD architecture designed to accelerate the performance of a piece of code.
  • SIMD technology allows process multiple data with one instruction call, saving time for other computations A set of pixels will be processed at a time.
  • One way to achieve this is to write assembly code ,which requires a steep learning curve and requires knowledge of processor architecture,instruction set etc.
  • Instead of using low-level instructions directly. There are special functions, called intrinsic, which can be treated as regular functions but they works with input data simultaneously.

    Deinterleaving and Interleaving channels of Image

  • NEON structure loads read data from memory into 64-bit NEON registers, with optional deinterleaving. Stores work similarly, reinterleaving data from registers before writing it to memory.
  • A set of neon intrinsic instruction set are provided for deinterleaving data.
  • The simultaneously pull data from the memory and seperate the data into different registers This is called deinterleaving .
  • The Neon structure loads the data from the memory into 64 bit neon registers with optional interleaving.
  • The opencv funtions split and merge are ported to arm neon and performance comparision with opencv code is performed.
  • Data loads interleaves elements based on the size specified in the instruction .

    De-InterLeaving

  • The de-interleave seperates the pairs of adjacenet elements in the memory into seperate registers.
  • the VLD3 instruction seperates/de-interleaves the BGR channels of the image and sperates them into 3 different registers.The BGR values are stored in adjacent memory locations.
  • The result of vld instruction is then stored to registers which point to destination memory location <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ vld3_u8 /*This instruction loads the contents of memory location with interleaving of adjacent memory locations .This results in 8 elements of memory being loaded into single 64 bit register and we have 3 such registers as a result of interleaving process. This may be used when the pointer refers to data of type 8 bit signed or unsigned integers */ vst1_u8 //This instruction is used to store contents of 64 bit register to desired memory location.8 simultaneous elements (8x8 =64) constituting the 64 bit register are written to the memory location. void neon_interlace(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height) { int i; uint8_t *s3 = (uint8_t *)d3; for(i=0;i<(width*height)/8;i++) { uint8x8x3_t loaded = vld3_u8(s3); vst1_u8(r0,loaded.val[0]); vst1_u8(r1,loaded.val[1]); vst1_u8(r2,loaded.val[2]); s3=s3+3*8; r0=r0+8; r1=r1+8; r2=r2+8; } } </script>

    NDK BUILD

  • Since the application is being developed for android applications,the android NDK toolchain is used for cross compilation.
  • ndk-build utility is used to build the application.
  • The ndk-build utility requires that based build directory contain a directory called \textbf{jni}
  • The jni directory contains all the source files as well as \textbf{Android.mk,Application.mk} which are the makefile for build process.
  • In the present application the jni directory contains the files \textbf{neon.cpp,helloneon-intrinsics.c,helloneon-intrinsics.h} source files.
  • To initiated the build process in verbose mode execute the command <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ ndk-build V=1 all </script>
  • This generates the helloneon binary in the \textbf{libs/armeabi-v7a} directory
  • The directory is transferred to the directory \textbf{/data/tmp/local/NEON_TEST} on android mobile device <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ adb push libs/armeabi-v7a/helloneon /data/local/tmp/NEON_TEST </script>
  • The \textbf{/data/tmp/local} directory and files created under this directory can contains files with execute permission.I could not find any other sud-directory under the file system which provided execute permission for binaries or ability to provide execute permissions for binaries.
  • The script a.ksh being called below exports basic variables and then executes the binary. <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ cd /data/local/tmp/NEON_TEST ./helloneon adb shell /data/local/tmp/NEON_TESST/a.ksh </script>
  • The performance of neon intrinsic function is compared with standard opencv split function
    OPENCV : 15ms
    NEON : 11ms
  • There is not a very significant improvement seen due to neon optimization.
  • As per many references and by viewing the disassembly output of the compiler it can be seen that the main reason was found that the arm compiler is not able to generate optimized assembly code .
  • The compiler generates heavily unoptimized code that results in larger number of cycles than required.
  • The compilation commands were taken from the ndk-build verbose build output and the -c flag was replaced with -s to generate the assembly code <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/ linux-x86/bin/arm-linux-androideabi-gcc -MMD -MP -MF /home/pi19404/ ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o.d -fpic -ffunction-sections -funwind-tables -fstack-protector -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ -Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os -fomit-frame-pointer -fno-strict-aliasing -finline-limit=64 -mfpu=neon -I/usr/local/include -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/ -I/opt/android-ndk-r7/sources//android/cpufeatures -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include -I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID -I/usr/local/include/opencv -I/usr/local/include -I/OpenVision -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC -DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp -ffast-math -Wa,--noexecstack -O3 -DNDEBUG -I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include /home/pi19404/ARM//jni/helloneon-intrinsics.c -S </script>
  • The above command will generate the the file \textbf{helloneon-intrinsics.s} in the present directory
  • A lot of unecessary instruction can be observed in the assembly code.
  • The assembly level code corresponding to the functions were optimized and compiled
  • For compilation again the debug build output observed from ndk-build process as modified so that \textbf{helloneon-intrinsics.o} object file is compiled from \textbf{helloneon-intrinsics.s} and helloneon binary file is compiled and linked from all source files. <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/prebuilt/linux-x86/bin/arm-linux-androideabi-gcc \ -MMD -MP -MF \ -fpic -ffunction-sections -funwind-tables -fstack-protector\ -D__ARM_ARCH_5__ -D__ARM_ARCH_5T__ -D__ARM_ARCH_5E__ -D__ARM_ARCH_5TE__ \ -Wno-psabi -march=armv7-a -mfloat-abi=softfp -mfpu=vfp -mthumb -Os -fomit-frame-pointer \ -fno-strict-aliasing -finline-limit=64 -mfpu=neon -I/usr/local/include \ -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision/ -I/opt/android-ndk-r7/sources//android/cpufeatures -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/include -I/opt/android-ndk-r7/sources/cxx-stl/gnu-libstdc++/libs/armeabi-v7a/include \ -I/home/pi19404/ARM//jni -DANDROID -DHAVE_NEON -fPIC -DANDROID -I/usr/local/include/opencv -I/usr/local/include -I/OpenVision \ -I/media/UBUNTU/repository/OpenVisionLibrary/OpenVision -fPIC -DHAVE_NEON=1 -ftree-vectorize -mfpu=neon -O3 -mfloat-abi=softfp -ffast-math -Wa,--noexecstack -O3 -DNDEBUG -I/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/include -c /home/pi19404/ARM/jni/helloneon-intrinsics.s \ -o /home/pi19404/ARM/obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o --sysroot=/opt/android-ndk-r7/platforms/android-14/arch-arm/ /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/ prebuilt/linux-x86/bin/arm-linux-androideabi-g++ -Wl,--gc-sections -Wl,-z,nocopyreloc --sysroot=/opt/android-ndk-r7/platforms/android-8/arch-arm /home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/neon.o /home/pi19404/ARM//obj/local/armeabi-v7a/objs/helloneon/helloneon-intrinsics.o /home/pi19404/ARM//obj/local/armeabi-v7a/libcpufeatures.a /home/pi19404/ARM//obj/local/armeabi-v7a/libgnustl_static.a /opt/android-ndk-r7/toolchains/arm-linux-androideabi-4.4.3/ prebuilt/linux-x86/bin/../lib/gcc/arm-linux-androideabi/4.4.3/libgcc.a -Wl,--fix-cortex-a8 -Wl,--no-undefined -Wl,-z,noexecstack -L/opt/android-ndk-r7/platforms/android-8/arch-arm/usr/lib -fPIC -llog -ldl -lm -lz -lm -lc -lgcc -Wl,-rpath,'libs/armeabi-v7a' -L/home/pi19404/ARM//jni/../libs/armeabi -llog -Llibs/armebi -Llibs/armeabi-v7a -lopencv_core -lopencv_imgproc -lopencv_highgui -lopencv_flann -lc -lm -o /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon cp /home/pi19404/ARM//obj/local/armeabi-v7a/helloneon libs/armeabi-v7a </script>
  • The results of the optimization process is as follows
    OPENCV : 15ms
    NEON : 8ms
    NEON OPTIMIZED : 6 ms
  • Thus a speedup factor of 1.4 and total performance improvement of 2.5x was observed.
  • Thus it can be seen that atleast 2.5x improvement is observed after optimizing the assembly code.
  • This still does not motivate the use of assembly level coding since the developement effort may outweight the optimization benifits. <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ push {r4, r5, r6, r7, r8, r9, sl, fp} @store registers on stack .save {r4, r5, r6, r7, r8, r9, sl, fp} .LCFI0: .pad #64 sub sp, sp, #64 @pointer to top of stack .LCFI1: mov r7, r0 ldr r4, [sp, #96] @load function arguments r4 64+8*4 ldr r5, [sp, #100] @load function arguments r5 64+9*4 mul r6,r4,r5 asr r6, r6, #3 @divide loop count by 8 .loop: # load 8 pixels: vld3.8 {d0-d2},[r7] @load pixels vst1.8 {d0}, [r1] @store interleaved pixels vst1.8 {d1}, [r2] vst1.8 {d2}, [r3] adds r7, r7, #24 @increment counter adds r1, r1, #8 adds r3, r2, #8 adds r3, r3, #8 subs r6, r6, #1 @check loop counter bne .loop add sp, sp, #64 pop {r4, r5, r6, r7, r8, r9, sl, fp} bx lr </script>

    InterLeaving

  • The interleaving operation corresponds to combining 3 independent channels of a image into multi-channel image.
  • Each element of idependent channels are stored in adjacent locations in the multi-channel image. <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ void neon_interleave(uint8_t * __restrict d3,uint8_t * __restrict r0,uint8_t * __restrict r1,uint8_t * __restrict r2,int width,int height) { int i; uint8x8x3_t v; for(i=0;i
  • The performance is as follows :
    OPENCV : 9ms
    NEON OPTIMIZED : 3 ms
  • The interleaving process shows a performance improvement of about 3x.
  • Thus by using neon intrinsics we can achieve performance improvements wrt standard C code and by optimizing the assembly code further performance benifits can be achived.
  • It is to be noted that OPENCV code is compiled with SSE optimization which may also be in play hence the actual code speedup may be higher.
  • However a large speedup was not observed in the interleaving and de-interleaving operation due to optimizing the assembly code .

    Code

  • The code for the same can be found in the git repository https://github.com/pi19404/OpenVisionin the POC/ARM subdirectory.
  • The jni subdirectory consists of the source files as well as the make files.
  • The files \textbf{generate_assembly.ksh} generate the helloneon-intrinsis.s files in the ARM directory.After modifying the file copy it to the jni sub-directory,
  • \textbf{compile_assembly.ksh} compiles the helloneon-intrinsis.s and also the binary file
  • The binary requires the opencv library files which needs to be transferred to the android mobile device <script class="brush: cpp" type="syntaxhighlighter"> <![CDATA[ adb push libs/armeabi-v7a/ /data/local/tmp/NEON_TEST </script>
  • The link Execution Cycle computation : shows the number of execution cycles taken by ARM assembly code ,which can be used to check the performance of compiler generated and optimized code.
The PDF version of the document can be found below
<iframe class="scribd_iframe_embed" data-aspect-ratio="0.708006279434851" data-auto-height="false" frameborder="0" height="600" id="doc_84809" scrolling="no" src="//www.scribd.com/embeds/211500572/content?start_page=1&view_mode=scroll&access_key=key-1a297580korifnxvyfgz&show_recommendations=true" width="100%"></iframe>

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

pi19404
Student IIT Bombay
India India
No Biography provided

Comments and Discussions

 
-- There are no messages in this forum --
| Advertise | Privacy | Mobile
Web03 | 2.8.140916.1 | Last Updated 13 Aug 2014
Article Copyright 2014 by pi19404
Everything else Copyright © CodeProject, 1999-2014
Terms of Service
Layout: fixed | fluid