Reading the Internet Explorer Cache






4.80/5 (10 votes)
Apr 26, 2006
3 min read

81804

2838
An article on using two different methods to return information stored in the IE cache.
Introduction
There are two basic ways to read the cache files that Internet Explorer produces. One method is to use the WinInet cache functions to do the job. The other is to use a custom built solution to read the cache files. There are no clear advantages to using one method over the over, except perhaps that one is Microsoft based, and the other is not. In this article, I will present both methods of reading the cache.
Cache Structure
The cache has a 28 byte header tag that identifies the cache version: Client UrlCache MMF Ver 5.2. At index 0x48 from the file beginning is a two byte value containing the number of folders. Immediately following are 8 byte folder names, followed by a 4 byte value (unknown what it is for). From the end of the folder list, up until about 0x6000, is a bunch of unknown data. Then comes entries which typically are one of four possible types: Leak, Redr, URL, and Hash. It is unknown what the Hash entries are for, so we just read them, discard them, then continue on. The Leak and URL entries appear to have the same structure.
//this is the basic structure for the url entries typedef struct UrlEntry { //=URL_ID TCHAR szRecordType[4]; //="02 00 00 00" : // ActualSize = dwRecordSize* (128 Bytes or 0x80) DWORD dwRecordSize; FILETIME modifieddate; FILETIME accessdate; DWORD dwUnsure1; DWORD dwUnsure2; DWORD wFileSizeLow; DWORD wFileSizeHigh;//??? BYTE uBlank[8];//expire time? #ifdef __IE40__ DWORD dwExtra;//Extra one here if its IE4 #endif DWORD uSame; //= "60 00 00 00" DWORD dwCookieOffset;//="68 00 00 00" BYTE uFolderNumber;//="FE" FE=No Folder BYTE unknown[3];//="00 10 10" DWORD uFilenameOffset; DWORD dwCacheEntryType; //= "01 00 10 00" DWORD unSure;//="00 00 00 00" DWORD dwHeaderSize;// 00 00 00 00 DWORD dwUnknown;//"00 00 00 00" DWORD dwUnsure3;//??? DWORD wHitCount; DWORD dwUseCount;//00 00 00 00 DWORD dwData2;//?? BYTE uMiscExtraData[8]; //this will contain the url, filename, //http response, and user with 0x00, //0xF0, 0xAD, and 0x0B as separating characters. BYTE lpText[1]; //lpText containing: //Format: //WebUrl //LocalFileName // The order of the following changes and is optional //HTTP 1.1 / OK //Pragma: no cache //Content Type //ETag //Content Length //~U : username }URLENTRY, *LPURLENTRY;
The Redr has the following structure:
//a redr entry typedef struct RedrEntry { //=REDR_ID TCHAR szRecordType[4]; //="01 00 00 00" : ActualSize = // dwRecordSize* (128 Bytes (0x80 Bytes)) DWORD dwRecordSize; FILETIME dwNotSur; BYTE lpWebUrl[1];//Url till end }REDRENTRY, *LPREDRENTRY;
And the Hash structure is:
//a hash entry typedef struct HashEntry { //=HASH_ID TCHAR szRecordType[4]; //="20 00 00 00" : ActualSize = // dwRecordSize* (128 Bytes (0x80 Bytes)) DWORD dwRecordSize; BYTE lpHashText[1]; }HASHENTRY,*LPHASHENTRY;
This continues until the end of the file. Each URL size is in terms of blocks. One data block is 0x80 bytes. Most are two or three blocks long. They do not appear to be ordered by type, perhaps by date, but I have not looked that deeply into them.
Using the code
The first way to read the cache files is using custom built functions to read the file structures. The functions to accomplish this are:
//this opens the cache file HANDLE OpenCacheFile(TCHAR* szCacheFilePath); //deal with the cache folders WORD GetCacheFolderNum(HANDLE hFile); void GetCacheFolders(HANDLE hFile, WORD wFolders,LPCACHEFOLDERS& pFolders); CString GetFolderName(LPCACHEFOLDERS pFolders, WORD wFolderNum); //these get the cache entires DWORD GetFirstCacheEntry(HANDLE hFile,DWORD* lpdwOffset); //Takes a current offset, returns new offset DWORD GetNextCacheEntry(HANDLE hFile,DWORD* lpdwOffset); //these read the various types of entries void ReadCacheEntry(HANDLE hFile, DWORD* lpdwOffset, LPURLENTRY& lpData); void ReadCacheLeakEntry(HANDLE hFile, DWORD* lpdwOffset, LPLEAKENTRY& lpData); void ReadCacheRedrEntry(HANDLE hFile, DWORD* lpdwOffset, LPREDRENTRY& lpData); void ReadCacheHashEntry(HANDLE hFile, DWORD* lpdwOffset, LPHASHENTRY& lpData);
The code to use and process using this method is a bit complicated. It includes looping through the file, calling GetNextCacheEntry
, and then a ReadCache*Entry
call. The GetNext
call merely moves the file pointer to the correct position to read the next entry, and returns the entry type. While, the Read calls actually read the data, fill the structure, and set the pointer to the end of the block. You could call these with arbitrary file positions but they would quickly break if you call them with incorrect file positions. An example of using these functions is:
//set cursor to busy HCURSOR hCur = SetCursor(LoadCursor(NULL, IDC_WAIT)); CString str; //get the path to the cache to open m_path.GetWindowText(str); //open the cache specified HANDLE hFile = OpenCacheFile(str.GetBuffer(str.GetLength())); str = _T("This is a custom View ") _T("reading bytes from the DAT files.\r\n"); m_test.SetWindowText(str); //get the number of cache folders WORD wNum = GetCacheFolderNum(hFile); // if the cache folders=0 then we // did not read the file correctly so exit if (wNum == 0) { return; } CacheFolders* pFolders,*p; //get the list of cache folders GetCacheFolders(hFile,wNum,pFolders); //loop the list and write out the folder names for (int n = 1; n <= wNum; n++) { CacheFolder lpFolder = (pFolders->folders[n-1]); str.Format("Folder: %s\r\n",lpFolder.szFolderName); //do something with folder name } //we do not delete the list here //because we will reference these later DWORD dwOffset = 0; //retrieve only first 50 entries because //they will be very large text on the screen int nEntries = 50; DWORD dwType = GetFirstCacheEntry(hFile, &dwOffset);//get first entry do { if ((dwOffset >= 0xB5C00)) dwType = dwType; if (dwType == URL_ID)//if its a url do this { URLENTRY *url; ReadCacheEntry(hFile,&dwOffset,url); //process data here CoTaskMemFree(url);//free url info } else if (dwType == LEAK_ID)//if its a leak do this { LEAKENTRY *url; ReadCacheLeakEntry(hFile,&dwOffset,url); //process data here CoTaskMemFree(url);//free leak info } else if (dwType == REDR_ID)//if its a redr do this { REDRENTRY *url; ReadCacheRedrEntry(hFile,&dwOffset,url); //process data here CoTaskMemFree(url);//free redr info } else if (dwType == HASH_ID)//if its a hash do this { HASHENTRY *url; ReadCacheHashEntry(hFile,&dwOffset,url); //we dont know what the hash stuff //is so we just read it and do not display it. CoTaskMemFree(url);//free hash info } dwType = GetNextCacheEntry(hFile,&dwOffset); } //while (dwType != 0); //This would read the entire file, //takes about 20-25 minutes for a 8.5MB File //This is limited to 50 because otherwise //the edit box and strings run out of memory while ((nEntries-- >= 0) && (dwType != 0)); //free folders info ::CoTaskMemFree(pFolders->folders); //free cache folders holder ::CoTaskMemFree((void*)pFolders); CloseHandle(hFile);//close cache file SetCursor(hCur);//return cursor to normal
Using WinInet to read the cache
The other method to reading the cache is using the WinInet functions. This is a rather simple method, and it returns pretty much the same information. A few function calls are needed but they can easily be wrapped into two functions. The first is to get the first cache entry, the next to get each subsequent entry. These functions do not appear to return results that are in the same order as the information in the file. And they do not allow you to read data from a location other than the default Internet Explorer cache. The two functions are:
//these two functions are use the winInet //functions for dealing with caches //the LPINTERNET_CACHE_ENTRY_INFO must be //allocated space before these functions are called. //Ideally you can allocate only one structure, //and then re-use it for each call... HANDLE GetFirstInetCacheEntry(LPINTERNET_CACHE_ENTRY_INFO lpCacheEntry, DWORD &dwEntrySize/*=4096*/); //It is important to note that the dwEntrySize is //modified from within the function calls to represent //the size of the data actually returned. //Therefore if you are using one //variable as the size, you need to //reset the variable to the actual allocated //entry size before the next call to the function. BOOL GetNextInetCacheEntry(HANDLE &hCacheDir, LPINTERNET_CACHE_ENTRY_INFO lpCacheEntry, DWORD &dwEntrySize/*=4096*/);
And a sample of using them is:
//this does not use the cache path, //it automatically uses the default IE cache //I have not figured out how //to open another cache file yet. LPINTERNET_CACHE_ENTRY_INFO lpCacheEntry; DWORD MAX_CACHE_ENTRY_INFO_SIZE=4096; DWORD dwEntrySize=MAX_CACHE_ENTRY_INFO_SIZE; HANDLE hCacheDir; int nCount=0;//set count to 0 //init cache entry holder lpCacheEntry = (LPINTERNET_CACHE_ENTRY_INFO) new char[dwEntrySize]; lpCacheEntry->dwStructSize = dwEntrySize; //get first cache entry using winInet functions hCacheDir = GetFirstInetCacheEntry(lpCacheEntry,dwEntrySize); //process entry here nCount++;//increase count do { //reset entry size because it was changed in the cache call dwEntrySize = MAX_CACHE_ENTRY_INFO_SIZE; //attempt to get the cache entry if (!GetNextInetCacheEntry(hCacheDir, lpCacheEntry,dwEntrySize)) break; //process entry here nCount++; } //while (TRUE); //This would read the entire file, //takes about 20-25 minutes for a 8.5MB File //loop for first 100 strings only because //otherwise you get a cstring format error. while (nCount < 100); //delete the cache entry delete [] lpCacheEntry; //close cache if not already close. FindCloseUrlCache(hCacheDir);
History
- v1.0 - Tested and worked fine with IE4 using Win98 and VS 6.0. Then I got XP and .NET 2003, and it would not work anymore. So I updated the code to work with IE6, XP, and .NET 2003. I do not have any other system, so I can not test on other platforms.