Searching for opcode bytes in PE file

Question

0.00/5 (No votes)

See more:

Hi,

I have a task of searching for opcode bytes in a PE file and checking whether a specified opcode byte sequence (constant and predefined) is present in the PE file. I have come across numerous examples online, but the solutions are mostly in C# or Python; however, my requirements are based in C language.

Please tell me how can I check and compare opcode byte values in a PE file by writing a simple program in C. Any help will be greatly appreciated.

Thanks.

Posted 20-Apr-13 9:27am

muneeb131

Add a Solution

Comments

Richard MacCutchan 20-Apr-13 15:29pm

It would not be a 'simple' program. You need to read up the documentation on the PE format and write the code to access the code sections.

muneeb131 20-Apr-13 23:35pm

I already have a code available to display information of all sections. I am unclear about how to search for the byte sequence in the sections?

Richard MacCutchan 21-Apr-13 5:15am

You just need to scan the code section(s) byte by byte looking for the sequences that you are interested in. You can either do it iteratively like that or maybe use the memcmp function to locate the specific strings.

pasztorpisti 20-Apr-13 19:03pm

As the topic is quite large I can not start teaching you, but I can give a somewhat better starting point. Here is one of my tips: http://www.codeproject.com/Tips/133747/Checking-for-exported-symbols-functions-in-a-DLL-w
This tip contains some code that loads the PE headers and does useful things with it - this tip contains a lot of code you will definitely need.

muneeb131 20-Apr-13 23:30pm

Thanks, the article you referenced is quite helpful. The main confusion I'm facing is how to scan the PE file (relevant executable sections) for the particular opcode bytes.
P.S. I already have a PE parser code available that displays information about all sections.

Sergey Alexandrovich Kryukov 20-Apr-13 21:36pm

I wonder why? First of all, if can be useless for detection of some characteristic code, virus or something, because nearly identical code can be interlaced with different addresses and immediate constants. You really need a disassember.
—SA

muneeb131 20-Apr-13 23:49pm

I have used a disassembler to find the opcode bytes. But I want to do the same in a C program that parses sections of the PE file.

Sergey Alexandrovich Kryukov 21-Apr-13 0:55am

Of course, I meant to say that you need the source code of the disassembler...
—SA

1 solution

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

pasztorpisti · Accepted Answer · 2013-04-21T00:47:00

In order to proceed you have to understand that there are 3 different offset types used in conjunction with PE files:
1. "physical" raw offset inside the exe file
2. Relative Virtual Address (RVA)
3. Virtual Address (VA)

I think #1 is obvious. Its really just an offset inside the file. "Virtual" addresses are used to address things in memory after loading the PE. Its very important to mention the "Base Address" here: this is an address (a pointer) inside the memory space of your process and it points to the first byte of the memory area were your PE file was loaded. Note that this Base Address can be different for each program startup as DLL files are usually (but not necessarily!) relocatable. When you call LoadLibrary() to load a DLL and it loads the file successfully the return value is a HMODULE/HINSTANCE handle that stores the value of the Base Address. If you examine the bytes in the memory at this location you will indeed find the bytes of your DOS/PE header here!!! The relation between a VA and an RVA that point to the same bytes of your PE file: VA == (BaseAddress + RVA). If you are just examining PE files without loading than VA offsets are not your interest, and indeed PE files do not contain any VAs because you know the VA only after loading the PE to a particular address. The PE file contains only file offsets and RVAs - mostly RVAs because file offsets are useful only before loading the PE and all other offsets are used after loading (for example to relocate a DLL if it couldn't be loaded to the preferred base address). Why do file offsets and RVAs differ??? The sections inside the PE file have an alignment in both the PE file and in memory after loading but these alignemnt values usually differ. In the PE file the beginning of sections are usually aligned to 512byte or 4K boundaries (disk sector size) to allow efficient loading and in the memory the same sections are put to a start address that are usually aligned to 4K boundaries (memory page size). Often 4K is used for both file and memory alignment to make loading even easier. Lets say a linker decided to use 512byte align in the PE file and 4K in memory. In this case when you load the sections from the file it can happen that the gaps between you sections is zero inside the file but after loading this gap increases because the loader must satisfy the 4K memory alignemnt. Example: Your compiler compiles a hello world program that contains a few bytes of code, lets say 100 bytes and a few bytes of data: "Hello World!". After this the linker combines the output of your compiler into an executable. Note that this can be done in several (infinite) ways but I describe a standard usual way. The linker will probably put the data and the code into different sections. Please note that the header of your PE file is also a "section" that is always placed at file offset zero and RVA zero but it doesn't have its own section header. Still, I always treat it as the minus first section :-). This is very important because it implies that your first "real section" that contains code/data can not reside on the zero offsets in file/memory! Lets say your linker does a great job and assembles a small header for your PE (~384 bytes). In this case the next section must be placed at file offset 512 and RVA 4096 because of the 512 byte file alignment and 4K memory alignment. This means that there will be some gap between the header of your PE and the first section in both the file and in the memory after loading your file, but the gap size is different before and after loading! This is why its important to use both file offsets and Relative Virtual Addresses (RVAs). Of course the linker could decide to put your first section to higher offsets but it usually doesn't do that. The RVA of your first section could be anything that is a multiple of the memory alignment (k*4096 where k is integer and k>0). I mention this because I could forge you very strange PE files with a hex editor that your program couldn't analize if it handles only the output of some popular linkers. So in case of our hello world program one possible output can be the following:
Header (file_offset=0, RVA=0)
Code (file_offset=512, RVA=4096)
Data (file_offset=1024, RVA=8192)
Why on earth do we need to split the whole stuff to sections??? Because the memory protection flags of each section can be different. The compiler can decide to put exec and readonly flags on your code section and noexec/readonly, maybe noexec/readwrite flags on your data sections. Memory protection can be set only per 4K page on x86 platforms. All sections (except the header section) has an IMAGE_SECTION_HEADER entry in the PE header. This IMAGE_SECTION_HEADER contains info about a section: name, file offset, RVA, Characteristics (flags that define memory protection, note that many characteristics flags map to the same memory protection flags). Note that the name of the section is perfectly useless, section names are used mainly by the linker and ignored by the windows PE loaded. For example MS linkers use ".text" as the name of the code section but what if I forge a PE and give ".text" as a name to my data section? (By the way: PE antidebug progs zero out the section names...) Note that the header section doesn't have a corresponding IMAGE_SECTION_HEADER so its flags are set by windows. On most windows programs the header section is also executable (that is a security hole), for example the CIH virus infected files by putting itself into the gap between the header section and the first section in the file. This was possible because most linkers used 4K as a file alignment and the header is usually much smaller than 4K so there is a fairly large gap in the file after the header that will be loaded along with the header and get the same memory protection as the header if you adjust the SizeOfHeaders field in the header!

As a final step lets make things a bit more clear so I give you my brain dump on how I percieve these things in general. The PE file contains sections with small gaps between them (including the PE headers as a special very first section). After loading the PE these gaps MAY increase because of rounding to memory page boundaries. Basically the only file offsets you will use are probably the IMAGE_SECTION_HEADER.PointerToRawData fields that are offsets in the file to the beginning of sections. After loading file offsets are useless so most of the other header fields that point to something are expressed as RVAs. A good example to this is for example the IMAGE_OPTIONAL_HEADER.AddressOfEntryPoint that specifies where the execution of your program begins. You can use the section headers to convert between file offsets and RVAs like I did in my example program. Note that any section that has the right memory protection flags (executable) can contain code (including headers on some windows versions)!!! And of course you can change protection flags of 4K memory pages programmatically, or you can copy some parts of a data section to an executable memory area or whatever... Of course if we speak of executables put together by nice linkers than executable code, readonly and read/write data are well separated but if you want to prepare for anything that you must take into account the extreme things as well, for example: Entry point pointing to the header section! (like in case of the CIH virus!) large gaps between the sections after loading (My "manually added" section can reside on file offset 1024 while its RVA is 0x10000 or larger leaving a big gap in memory between this and the previous section). Unfortunately you can not exactly tell which bytes are executable and which ones aren't. If you trust the linker and memory flags then you can tell... Good exe examiners (like IDA pro) just start out the IMAGE_OPTIONAL_HEADER.AddressOfEntryPoint and examine all executable bytes and all possible flows of control and disassemble on the fly every possibly reachable bytes. You wont find all executable bytes even in this case. Lets say I compute an address into the eax register and then I say "jmp eax". Even if the address is always the same because I just add two constant numbers you need a very good static analyzer built into your prog to find out the jump target. I can put a small twist to it for example by reading an int value from a config file that can have either 0 or 1 as a value and calculate this in into my jump target address, in this case you have no chance to find out the target. In IDA pro you can manually add hints to IDA and you can say "Hey IDA, examine these bytes here as code. Examine these bytes as an array of 5 XYZ structs.".

Just came to my mind that the size of a section in memory can be larger than its size in file. For example some linkers put your initialized read/write data to a section (lets say you have 4096 bytes initialized global data) but the linker can specify 8192 as IMAGE_SECTION_HEADER.VirtualSize that means that the file will contain just the 4K initialized data, but after loading 8K memory is allocated. This is a nice trick for compilers to allocate space for you uninitialized global variables without wasting space in your PE file!

I've never read these things together but I truly believe that these are the foundations of PE file hacking and most people starting in the topic can start/progress slowly because they are not in clearence of these. I would also mention that you've started writing a program that is probably far beyond your current knowledge so you either have to read and UNDERSTAND hundreds of pages on related tricks you need or give up with the idea of writing the next IDA Pro. Another important thing is that your question is not specific, involves a lot of difficult topics, this is why noone will really answer it. My long comments are forming just the intro to the topic (the very beginning) so they can help you to decide whether to invest the significant amount of time or not. And definitely check out IDA Pro if you don't know it. That tool has a free version (5.x if I remember right).