Click here to Skip to main content
Click here to Skip to main content

How to encode/decode URLs to the UTF8 format (with %20 and so)

By , 17 Jul 2007
 

Introduction

This article will show you how to encode/decode URLs to the UTF-8 format. If you are writing an application that must have web support, and for example navigating a WebBrowser ActiveX control to a certain URL, you have to encode it, for there are many characters (e.g., Hebrew, accented Latin, spaces, and so on...) that cannot be in a URL.

I have written a class to do all the work, and it is the simplest to use. Enjoy!

Background

URLs support only about 60 characters, and all other characters are written in the UTF-8 format, using the %XX hexadecimal format.

For more information about the main rules of URL encoding, you can have a look here.

Using the Code

I have included the source code in this article, and you can use it without any effort:

CUrlEncode cEncoder;
cEncoder.Encode(_T("http://www.google.com/search?q=my search"));
// This will result in http://www.google.com/search?q=my%20search
cEncoder.Decode(_T("http://www.google.com/search?q=%22my%20search%22"));
// This will result in http://www.google.com/search?q="my search"

This class can deal with much more than spaces, and this is just a simple example.

The usage for the functions is as follows:

CString Encode(CString strURL, BOOL bEncodeReserved/*=FALSE*/);
CString Decode(CString strURL);

Here, bEncodeReserved means that you want to encode the reserved characters too. This parameter is dangerous for full URLs because it will also encode characters like '/', and will destroy your URL. But if you are encoding keywords, for example, you should set this parameter to TRUE.

That's about it, hope I helped.

History

  • 17th July, 2007: Initial post.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

About the Author

Daniel Cohen Gindi
Software Developer (Senior)
Israel Israel
Member
No Biography provided

Sign Up to vote   Poor Excellent
Add a reason or comment to your vote: x
Votes of 3 or less require a comment

Comments and Discussions

 
You must Sign In to use this message board.
Search this forum  
    Spacing  Noise  Layout  Per page   
GeneralKorean/Chinese/Japanese supportmemberTaein Kim4 Sep '07 - 3:53 
Thank you Daniel Cohen Gindi for this great article.
 
For those of you who are having a problem using this class to encode Korean/Chinese/Japanese characters, do followings:
 
The very first things to do is to make your Visual Studio 6 Unicode compatible.
By default, Visual Studio 6 doesn't support Unicode.
To use Unicode,
go to Project Settings->C/C++ tab-> for preprocessor definitions, add
_UNICODE
 
On the same window, go to 'Link' tab and enter
wWinMainCRTStartup
for Entry-point symbol.
If you don't have mfc42ud.dll, you will need to install it. Do google search on this topic.
 

 
Add the following macros at the beginning of URLEncode.cpp
 
#define MAKEWORD2(a, b, c) ( (DWORD) ( ((BYTE) (a)) | ((DWORD) ((BYTE) (b))) << 8 | ((DWORD) ((BYTE) (c))) << 16 ) )
#define HIBYTE2(w) ((BYTE) (((DWORD) (w) >> 16) & 0xFF))
#define MDBYTE2(w) ((BYTE) (((DWORD) (w) >> 8) & 0xFF))
#define LOBYTE2(w) ((BYTE) (w))
 
You need these because unlike Hebrew, Korean/Chinese/Japanese requires 3 bytes in UTF8 encoding. (If I am wrong, please correct me. I am no expert in this area)
 
Change
return MAKEWORD(mb[1],mb[0]);
in toUTF8 to
return MAKEWORD2(mb[2],mb[1],mb[0]);
 
In CURLEncode::Encode function, change
w=toUTF8(tc);
nc=toHex(HIBYTE(w));
nc+=(toHex(LOBYTE(w)));
 
to
 
w=toUTF8(tc);
nc=toHex(HIBYTE2(w));
nc+=(toHex(MDBYTE2(w)));
nc+=(toHex(LOBYTE2(w)));
 
and the data type of the variable w from WORD to DWORD
 
That should fix the problem.
Have a great day!
 


GeneralRe: Korean/Chinese/Japanese supportmemberKenny Zhao29 Sep '07 - 20:04 
yes, it works for supporting Chinese now
thanks for your fix,as well as thanks Daniel.
Smile | :)
 
...

GeneralRe: Korean/Chinese/Japanese supportmemberchris_cppteam26 Nov '08 - 15:12 
halo,
how about the decode part of chinese....?!
would you mind giving me some idea?!
thanks a lot! Smile | :)
GeneralRe: Korean/Chinese/Japanese supportmemberNever Winter20 Jul '08 - 19:04 
I got a problem while encoding Vietnamese characters (two bytes, not three bytes like Chinese/Japanese..., so the code below
 
w=toUTF8(tc);
nc=toHex(HIBYTE2(w));
nc+=(toHex(MDBYTE2(w)));
nc+=(toHex(LOBYTE2(w)));
 
should be changed to:
 
w=toUTF8(tc);
nc=toHex(HIBYTE2(w));
nc+=(toHex(MDBYTE2(w)));
//This will enable support for both two/three bytes encoding
if (LOBYTE2(w) != 0)
{
nc+=(toHex(LOBYTE2(w)));
}
 
Does anyone have any comments? Pls let me know.
 
Regards,
Winter
GeneralRe: Korean/Chinese/Japanese supportmemberDaniel Cohen Gindi20 Jul '08 - 20:58 
I really did not take in account the situation of 3 byte characters, for I dont know even one character in Vietnamese...
Your fix should work, but make sure that the results are correct, and fix the decoding too.
I'll make the changes later on when I find the time!
 
In the meanwhile, have a GREAT day!
 
-----
Daniel Cohen Gindi
danielgindi (at) gmail dot com

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Rant Rant    Admin Admin   

Permalink | Advertise | Privacy | Mobile
Web02 | 2.6.130523.1 | Last Updated 17 Jul 2007
Article Copyright 2007 by Daniel Cohen Gindi
Everything else Copyright © CodeProject, 1999-2013
Terms of Use
Layout: fixed | fluid