Click here to Skip to main content
14,174,026 members
Click here to Skip to main content
Add your own
alternative version

Stats

3.4K views
5 bookmarked
Posted 4 Oct 2018
Licenced MIT

Mhook Enhancements: 10x Speed Improvement and Other Fixes

Rate this:
Please Sign up or sign in to vote.
Learn how was increased mhook’s performance, enhanced its capabilities and eliminated certain bugs.

Introduction

Mhook is an open source API hooking library for intercepting function calls (setting hooks). In other words, it's a library for embedding code into application threads. This library comes in handy when you need to monitor and log system function calls or run your own code before, after, or instead of a system call. To intercept a function call, mhook replaces five bytes at the address to be intercepted with the unconditional jump code (jmp #addr) for the interception function. Then mhook transfers those removed five bytes to a special allocated place called a trampoline. When the interception function becomes inactive, it can make an unconditional jump to the springboard that has those 5 stored bytes running. Finally, a jump to the intercepted code happens. To learn more about how to use mhook for API hooking, you can read our mhook tutorial.

Table of contents:

Problems with the Mhook Library

Increasing Performance: Case #1

Increasing Performance: Case #2

Getting Project Files for Different IDEs

Hooking a Function with a Conditional Jump in the First 5 Bytes

Bug: Continuous recursion

Bug: Deadlock

Bug: the Hook Has a Wrong Function

Conclusion

Problems with the Mhook Library

We often use mhook to solve tasks within projects related to cybersecurity and reverse engineering. When using mhook, we've faced the following issues:

  • Poor performance with a large number of system threads and when setting multiple hooks in a row;
  • The necessity to manually create projects for all integrated development environments (IDEs);
  • The impossibility to hook functions that don't have a suitable first five bytes for recording the jump to the hook;
  • Infinite recursion (bug);
  • Deadlock (bug);
  • Hooks leading to the wrong function (bug).

To overcome these issues, we've improved the original version of mhook and made our updated version public. In this article, we'll describe the problems we faced during our work with the original mhook, and how we solved them with our own mhook enhancements.

Increasing Performance: Case #1

We increased the performance of the mhook library using the NtQuerySystemInformation function.

Issue Description

Mhook starts working very slowly with a large number of system threads.

Causes

When setting a hook, information about processes and threads is used to suspend all threads of the current process except its own thread and to change the function address to one specified by the developer. As a result, despite the fast speed of getting a thread status snapshot using CreateToolhelp32Snapshot, the Thread32Next function starts working very slowly with an increasing number of system threads. Microsoft doesn't open its source code, but you can find similar methods in the ReactOS project. It seems that each Thread32Next call triggers the NtMapViewOfSection which performs a rather resource-intensive operation.

Solution

Instead of using CreateToolhelp32SnapshotThread32First, and Thread32Next from tlhelp32.h, have been used the NtQuerySystemInformation function.

The tests showed that when using CreateToolhelp32Snapshot, calling Thread32Next took about 10 times resources than getting a snapshot. While using NtQuerySystemInformation, getting the snapshot was cheap enough (cheaper than the initial implementation), and the thread iterations were almost free (about 10 times cheaper than the snapshot), basically coming down to calculating pointers. In general, the NtQuerySystemInformation-based approach is about 10 times faster than the CreateToolhelp32Snapshot-based one. In a system with about 3000 threads, setting one hook takes about 0.02 seconds, while the original method could take as long as 0.14 seconds per hook.

speed of setting and removing one hook

Here's the code that was measured:

#include <windows.h>
#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include <tlhelp32.h>
#include "mhook-lib/mhook.h"
  
using namespace std;
using namespace chrono_literals;
  
auto TrueSystemMetrics = GetSystemMetrics;
  
// This is the function that will replace GetSystemMetrics once the hook is in place
ULONG WINAPI HookGetSystemMetrics(IN int index)
{
  MessageBoxW(nullptr, L"test", L"test", 0);
  return TrueSystemMetrics(index);
}
  
void testPerformance()
{
  auto startTime = chrono::high_resolution_clock::now();
  
  Mhook_SetHook((PVOID*)&TrueSystemMetrics, HookGetSystemMetrics);
  Mhook_Unhook((PVOID*)&TrueSystemMetrics);
  
  auto timePassed = chrono::duration_cast<chrono::duration<double>>(chrono::high_resolution_clock::now() - startTime);
  
  cout << "Time passed: " << timePassed.count() << endl;
}
  
int main()
{
  HANDLE snap = CreateToolhelp32Snapshot(TH32CS_SNAPTHREAD, 0);
   
    THREADENTRY32 te;
  te.dwSize = sizeof(te);
  
  // count threads in system
  DWORD initialThreadCount = 0;
  
  if (Thread32First(snap, &te))
  {
      do
        {
          ++initialThreadCount;
      }
        while (Thread32Next(snap, &te));
  }
  
  CloseHandle(snap);
  
  cout << "Initial threads count: " << initialThreadCount << endl;
  
  testPerformance();
  
  vector<thread> threadsToTest;
  
  const int kThreadsCount = 1000;
  const int kThreadsCountStep = 100;
  bool testFinished = false;
  
  for (int k = kThreadsCountStep; k <= kThreadsCount; k += kThreadsCountStep)
  {
      for (int i = 0; i < kThreadsCountStep; ++i)
      {
          threadsToTest.push_back(thread([&]()
          {
              while (!testFinished)
              {                  
                    this_thread::sleep_for(10ms);
              }
          }));
      }
      cout << "Start Threads count increased by " << k << endl;
      testPerformance();
  }
  
  testFinished = true;
  
  for (int i = 0; i < kThreadsCount; ++i)
  {
      threadsToTest[i].join();
  }
  
  cout << "End" << endl;
  cin.get();
  
  return 0;
}

The performance was increased of the mhook library using the Mhook_SetHookEx method.

Increasing Performance: Case #2

Issue Description

Setting multiple hooks in a row is slow.

Causes

You have to suspend all threads of the current process to set a hook. If you set 100 hooks in a row, then you have to suspend threads 100 times and restart them 100 times, which is obviously inefficient.

Solution

The Mhook_SetHookEx method was added to set several hooks during a single thread suspension. The input retrieves an array of HOOK_INFO structures containing the same information that used to be transmitted to Mhook_SetHook.

How to use the Mhook_SetHookEx method in mhook: example:

struct HOOK_INFO
{
  PVOID *ppSystemFunction;  // pointer to pointer to function to be hooked
  PVOID pHookFunction;      // hook function
};
  
// returns number of successfully set hooks
int Mhook_SetHookEx(HOOK_INFO* hooks, int hookCount);
int Mhook_UnhookEx(PVOID** hooks, int hookCount);

This modification provides a substantial performance increase, as compared to setting the same hooks consecutively.

For example, here's a comparison of the performance when setting three hooks with both methods:

comparison of the performance when setting three hooks

performance when setting three hooks

On average, it's about 2.8 times faster to set three hooks using the Mhook_SetHookEx method than it is to set just one hook using the traditional setHook method.

The code for this test is basically the same as for the previous one. You need to do to  test a hook is set several hooks in the testPerformance function using the Mhook_SetHook method and the Mhook_SetHookEx method.

Getting Project Files for Different IDEs

Learn how to manage to get project files for different IDEs instead of operating with a single .sln file.

Issue Description

The need for manual creation of different projects for each integrated development environments (IDE) other than Visual Studio. In addition, it's difficult to work with different versions of Visual Studio.

Causes

Mhook only has a .sln file for Visual Studio 2010. Furthermore, there's no project auto-generation system.

Solution

There have been implemented CMake which is a popular cross-platform build automation solution. CMake allows us to easily get project files for different IDEs without using Visual Studio.

Hooking a Function with a Conditional Jump in the First 5 Bytes

We needed to be able to hook functions that contain no suitable first five bytes for a hook.

Issue Description

Some functions don't have a suitable first five bytes for recording the jump to the hook. For example, when assembling with msvs 2015 in x64 release with the /MT switch, the free function doesn't contain a suitable first five bytes:  

00007FF680497214 48 85 C9           test      rcx,rcx 
00007FF680497217 74 37              je        _free_base+3Ch (07FF680497250h) 
00007FF680497219 53                 push      rbx 
00007FF68049721A 48 83 EC 20        sub       rsp,20h 
00007FF68049721E 4C 8B C1           mov         r8,rcx 
00007FF680497221 33 D2              xor       edx,edx 


This situation occurs when the function code assembler contains a conditional or unconditional jump or call to another Windows API function in the first five bytes. In this case, mhook cannot transfer this code to its layer, because the jump addressing will be incorrect and these jumps will be invalid. Mhook can handle unconditional jumps but not conditional ones.Causes

Solution

This issue was solved by using the free function, which has the je operator at the start. A conditional jump should be transferred to the mhook layer and then the instruction and the jump address should be changed so that it points to the same location as before the transfer.

The free function uses near je jump which sets a one-byte offset from the current position. The mhook layer can be located farther than the path that can be stored in one byte. That’s why the jump instruction was replaced for jewith the rel32 argument (a 32-bit offset from the current position).

The system compiles a new jump address by subtracting the target address in the layer from the address where the jump used to lead.

This solution is suitable for near je and near jne since their opcodes and the opcodes of the corresponding long jumps are almost the same.

Bug: Infinite Recursion

We eliminated an issue with infinite recursion.

Issue Description

When trying to set hooks for certain system functions, various issues occurred such as a call stack overflow.

Causes

Functions are called directly inside mhook after setting the jump leading to the hook from the system function and before modifying the layer that leads back to the system function.

Solution

The layer recording was transfered higher in the code, so that between the jump setting and the layer modification there are no calls to system functions.

Bug: Deadlock

The deadlocks were eliminated .

Issue Description

After migrating to NtQuerySystemInformation, deadlocks appear in mhook.

Causes

When migrating to NtQuerySystemInformation, a dynamic buffer in the heap where thread information is stored was allocated. CreateToolhelp32Snapshot handles this itself and returns only HANDLE.

Here's how the whole process works:

  1. A buffer is allocated to get information about the threads
  2. All threads are suspended all threads
  3. The hook is set
  4. The buffer with information about threads is cleared
  5. The thread is executed.

This sequence contains a hard-to-detect bug. If any thread manages to grab the free lock, then our attempt to clear the buffer results in a deadlock because the thread that has captured the free lock isn't active.

To reproduce this bug, you can create several threads that allocate and free memory in the heap while another separate thread sets and removes hooks:  

#include <windows.h>
#include <vector>
#include <thread>
#include <chrono>
#include <iostream>
#include "mhook-lib/mhook.h"
  
using namespace std;
using namespace chrono_literals;
  
auto TrueSystemMetrics = GetSystemMetrics;
  
// This is the function that will replace GetSystemMetrics once the hook is in place
ULONG WINAPI HookGetSystemMetrics(IN int index)
{
  MessageBoxW(nullptr, L"test", L"test", 0);
  return TrueSystemMetrics(index);
}
  
int main()
{
  vector<thread> threadsToTest;
  
  const int kThreadsCount = 100;
  bool testFinished = false;
  
  for (int i = 0; i < kThreadsCount; ++i)
  {
      threadsToTest.push_back(thread([&]()
      {
          while (!testFinished)
          {
              free(malloc(100));
              this_thread::sleep_for(10ms);
          }
      }));
  }
  
  const int kTriesCount = 1000;
  for (int i = 0; i < kTriesCount; ++i)
  {
      Mhook_SetHook((PVOID*)&TrueSystemMetrics, HookGetSystemMetrics);
      Mhook_Unhook((PVOID*)&TrueSystemMetrics);
  
      this_thread::sleep_for(10ms);
      cout << "No deadlocks, go stage " << i + 1 << endl;
  }
  
  testFinished = true;
  
  for (int i = 0; i < kThreadsCount; ++i)
  {
      threadsToTest[i].join();
  }
  
  cout << "Test passed" << endl;
  
  return 0;
}

Solution

Several different solutions for this problem was found. First, we simply moved the buffer clear until after all threads have resumed. But then we decided to use VirtualAlloc/VirtualFree instead of malloc/free. Since memory allocation when installing the hook occurs only a few times (and out of the loop), it doesn’t lead to any measurable performance losses.

Bug: Hook Leads to the Wrong Function

A bug whereby different hooks lead to the same handler was eliminated .

Issue Description

When you install hooks for functions from different modules and the distance in memory between these modules exceeds 2GB, the addresses of hook handlers are recorded incorrectly. For example, let’s set the hook for a function from module 1, then for a function from module 2 located in memory at a distance of more than 2GB from the first hook). Let's then install two more hooks for the functions from the first module. As a result, the last two hooks will lead to the same handler, which they shoudn't.

Causes

In the BlockAlloc function, while adding a new memory block for the module, the allocated memory moves to the cycled list. In the original code, you should not set the pointer to the previous element for the list head. It remains zero.

After adding a new memory block, the following happens:

  1. When searching for a free memory block to set the hook, since the pointer to the previous element is zero, the pointer to the previous element isn’t overwritten by the current element.
  2. The code adds the current item to the list of memory blocks used for hooks. However, this element remains in the list of free memory blocks. The next pointer still points to this element from the previous element.
  3. Every time you try to set a hook in the current module, the first memory block you'll find is the one from point 1, although, it's assigned to the hook.

Thus, all other hooks in this module will lead to the handler of the last hook set in this module.

Solution

The pointer to the previous element of the list head was replaced with a pointer to the last element of this list, as it should be in the cyclical list.

Conclusion

This improvements allow to increase mhook’s performance tenfold and the speed of the hook setting process nearly threefold. In addition, we easily got project files for different IDEs without using Visual Studio and managed to hook functions that don’t contain a suitable first five bytes for recording a jump to the necessary hook. Furthermore, some bugs that led to deadlock, infinite recursion, and hooks leading to the wrong function were eliminated.

You can download improved version of mhook here: https://github.com/apriorit/mhook

License

This article, along with any associated source code and files, is licensed under The MIT License

Share

About the Authors

Apriorit Inc
Chief Technology Officer Apriorit Inc.
United States United States
ApriorIT is a software research and development company specializing in cybersecurity and data management technology engineering. We work for a broad range of clients from Fortune 500 technology leaders to small innovative startups building unique solutions.

As Apriorit offers integrated research&development services for the software projects in such areas as endpoint security, network security, data security, embedded Systems, and virtualization, we have strong kernel and driver development skills, huge system programming expertise, and are reals fans of research projects.

Our specialty is reverse engineering, we apply it for security testing and security-related projects.

A separate department of Apriorit works on large-scale business SaaS solutions, handling tasks from business analysis, data architecture design, and web development to performance optimization and DevOps.

Official site: https://www.apriorit.com
Clutch profile: https://clutch.co/profile/apriorit
Group type: Organisation

33 members


Artur Bulakaiev
Software Developer (Senior) Apriorit
Ukraine Ukraine
No Biography provided

You may also be interested in...

Comments and Discussions

 
QuestionDetours was pushed to github some time ago. Pin
yafan4-Oct-18 3:27
memberyafan4-Oct-18 3:27 
AnswerRe: Detours was pushed to github some time ago. Pin
Sergey Podobry5-Oct-18 2:18
professionalSergey Podobry5-Oct-18 2:18 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.

Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web01 | 2.8.190524.3 | Last Updated 5 Oct 2018
Article Copyright 2018 by Apriorit Inc, Artur Bulakaiev
Everything else Copyright © CodeProject, 1999-2019
Layout: fixed | fluid