WmlScript Disassembler

liangml

3.95/5 (6 votes)

Nov 17, 2005

4 min read

31319

843

A useful tool for WmlScript disassembling.

DWmlsc UI Screen-shot

Introduction

About WMLScript

Today, more and more cell-phones (or other mobile clients) support WAP browsing functions. To enhance the ability of the browser side, now many WAP browsers support WMLScript. The WMLScript language is based on the ECMAScript [ECMA262] but it has been modified to better support low bandwidth communication and thin clients. WMLScript can be used not only with WML but also as a standalone tool.

One of the main differences between ECMAScript and WMLScript is the fact that WMLScript has a defined bytecode and an interpreter reference architecture. That can give better performance in narrowband and small memory environments. To make the language smaller, and easier to compile into bytecode, many advanced features of the ECMAScript have been dropped. For example, WMLScript is a procedural language and it supports locally installed standard libraries.

What DWmlsc can do

Since WMLScript can be compiled into bytecode (usually using extension file name .wmlsc), sometimes we need to decompile the bytecode to view the source. So I wrote a tool named "DWmlsc" to do this job.

Background

Resources about WML and WMLScript can be found at:

www.wapForum.org

Implementation

DWmlsc is a MFC SDI program. When a user open a .wmlsc file, this function will be called:

void CDWmlscDoc::Serialize(CArchive& ar)
{
    if(ar.IsStoring() == FALSE)
    {
        m_codeLen = ar.GetFile()->GetLength();
        m_binCodeBuf = new BYTE[m_codeLen];
        ar.Read(m_binCodeBuf,m_codeLen);

        m_result_code.RemoveAll();

        if(DeCompile(m_binCodeBuf,m_codeLen,m_result_code) == false)
        {
            MessageBox(NULL,"Sorry,Decompile faild!","Error",MB_OK|MB_ICONSTOP);
            return;
        }

        SetModifiedFlag(TRUE);
    }
    //...
}

In the function Serialize, I read the whole file into a buffer, and then call the core function.

bool DeCompile(BYTE *bin_code,int len,CList<CString,CString&> &result);

The output parameter result will be used to store the de-compilation result.

The WMLScript bytecode consists of the following sections: HeadInfo, ConstantPool, PragmaPool and FunctionPool. (Refer to the WMLScript specifications please.)

The function DeCompile reads and parses the file into these parts:

The information read from ConstantPool is stored in a list g_ConstTable.
The information of PragmaPool is almost ignored.
The information of FunctionPool is stored in a list g_FuncTable.

Now, we can start to decompile the bytecode in the functions. The following code segment visit through the g_FuncTable:

struct Function
{
    BYTE findex;
    CString func_name;
    BYTE arg_num;    //arguments num
    BYTE lvar_num;    //local variable num
    unsigned int func_size;
    BYTE *CodeArray;
};

The function TransCode will do the real decompiling job:

    //...
    i = 0;
    while(i < func.func_size)
    {
        int n = TransCode(func.CodeArray + i, i,func.arg_num);
        if(n < 0)
        {
            return false;
        }

        CString code;
        //...
    }
    //...

The function TransCode will translate the bytecode into textual instructions.

int TransCode(BYTE *data,int addr,int arg_num)

To make the bytecode smaller, WMLScript uses the "Inline parameters" technique.

Signature	Available Instructions	Used for
1XXPPPPP	4	`JUMP_FW_S`, `JUMP_BW_S`, `TJUMP_FW_S`, `LOAD_VAR_S`
010XPPPP	2	`STORE_VAR_S`, `LOAD_CONST_S`
011XXPPP	4	`CALL_S`, `CALL_LIB_S`, `INCR_VAR_S`
00XXXXXX	63	The rest of the instructions

TransCode parses these "Inline parameter" instructions with an "if/else..." statement.

The other 63 instructions will be parsed by indexing the array: Instruction InArray[].

    //...
    const ins_count = sizeof(InArray)/sizeof(InArray[0]);
    if(op_code >= ins_count)
    {
        return -1;
    }

    Instruction *ip = InArray + op_code;
    if(ip->parser == NULL)
    {
        sprintf(tmp,"%s",ip->ins_name);
    }
    else
    {
        int n = ip->parser(data,addr,arg_num);
        i = i + n - 1;
    }
    //...

What then is "InArray"? See this:

Instruction InArray[] =
{
    {"",NULL},    //0
    {"JUMP_FW", JUMP_FW},    //1
    {"JUMP_FW_W", JUMP_FW_W},    //2
    {"JUMP_BW", JUMP_BW},    //3
    {"JUMP_BW_W", JUMP_BW_W},    //4
    {"TJUMP_FW", TJUMP_FW},    //5
    {"TJUMP_FW_W", TJUMP_FW_W},    //6
    {"TJUMP_BW", TJUMP_BW},    //7
    {"TJUMP_BW_W",TJUMP_BW_W},    //8
    {"CALL", CALL},    //9
    {"CALL_LIB", CALL_LIB},    //10
    //...
}

JUMP_FW, JUMP_FW_W etc...are all function pointers of type parser_t (see decompiler.h):

typedef int (* parser_t)(BYTE *data,int addr,int arg_num);

The program checks the instruction parsing function by indexing "InArray". Simple, and very fast.

Points of Interest

Multi-byte Integer Format

In many places, the byte code uses the "Multi-byte Integer Format" to represent an integer.

A multi-byte integer consists of a series of octets, where the most significant bit is the continuation flag and the remaining seven bits are a scalar value. The continuation flag is used to indicate that an octet is not the end of the multibyte sequence. A single integer value is encoded into a sequence of N octets. The first N-1 octets have the continuation flag set to a value of one (1). The final octet in the series has a continuation flag value of zero.

The remaining seven bits in each octet are encoded in a big-endian order, e.g., the most significant bit first. The octets are arranged in a big-endian order, e.g. the most significant seven bits are transmitted first. In the situation where the initial octet has less than seven bits of value, all unused bits must be set to zero (0).

For example, the integer value 0xA0 would be encoded with the two-byte sequence 0x81 0x20. The integer value 0x60 would be encoded with the one-byte sequence 0x60.

The function get_mb_uint helps us to decode the "Multi-byte Integer".

unsigned int get_mb_uint(BYTE *data,int len,int &k)
{
    unsigned int r = 0;
    int i = 0;
    for(i=0;i<len;i++)
    {
        BYTE b = data[i];
        r = (r << 7) | (b & 0x7F);
        if( (b & 0x80)==0 )
        {
            break;
        }
    }

    k = k + i + 1;

    return r;
}

Name Translation of WMLScript Standard Libraries

WMLScript bytecode uses "lib index" and "func index" to identify which standard library function is to be called.

char * make_call_name(int lindex,int findex);

Check the lib index and function index in the internal string table, and return the result. Users can read the library and function name in the decompiled text result directly, instead of checking documents.

Summary

Currently, the DWmlsc can only decompile the bytecode into "WMLScript Assembly Language". In future, I will enhance it to decompile bytecode into WMLScript, to be a real "Decompiler" :).