|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
IntroductionThe MSN Desktop Search application is an evolutionary version of the standard Indexing Service that has been shipping as part of Windows since Windows 2000. They both make use of the same I often find myself searching for a web page that I remember reading a couple of months ago. If I then use Google or MSN to search for the page. It can take me a long time to find the same page again since you're often overwhelmed with other pages containing your search term. But rather than being the page you want, they're sales pages. The ideal thing would be to perform an indexed search based on some terms you remember reading in the page but limiting the search to only pages you've actually visited on the web. In other words if you had a complete copy as opposed to just a small cached subset of every web page you've visited then you could use a local desktop search against this complete browser history. Taking this to the next level in terms of keeping an electronic copy of everything you've visited or received including paper copies is the MyLifeBits project. Channel9 also has an interview with a couple of people involved in the MyLifeBits project. Implementation DetailsThere were a couple of options I could've taken in order to store a copy of every web page visited. One approach is to implement a proxy server which sees every response coming back to the browser and the other approach is to enumerate the browser's cache periodically. The advantage of the proxy server approach is that it is browser independent where as the enumeration of the browser cache is browser specific. I went with the simpler approach of periodically enumerating the browser's cache using the In order to enumerate the browser cache, the Next we check the MIME type to determine whether we are interested in making a copy of this particular cache entry. The current set of MIME types that I check for are the following: text/html
application/pdf
application/msword
So I basically only make a copy of the text associated with a web page and don't make copies of the associated images. In addition I also make copies of any PDF and Word documents that I may have read in the browser. Next I determine where to store the cache entry and create a file name for the copy we're going to make. An example of a cache entry's name and location is shown below: My Documents\WebCache\2005\4\10\1c582c0-cab8d650-18be.html
I store my WebCache history under the "My Documents" folder so that the contents will automatically be indexed by MSN Desktop Search. The sub-directory tree is based on the date the URL was visited and the filename is the If the MIME type is HTML then a header similar to the headers displayed by Google and MSN is added to the top of the HTML file to allow you to easily load the current version of the URL when viewing the cached version. The last modification made to the file being copied is to create a property set on the file and set the path:webcache keywords:cnn "space shuttle launch"
This will limit the query to items that have 'webcache' in their path name, i.e. only files in our webcache directory and not in any other document locations or in email messages etc. In addition, the query will be limited to files that contain 'cnn' in the The void AddURLKeywordProperty(LPCWSTR pszFileName, LPWSTR pszURL) { IPropertySetStorage *pPropSetStg = NULL; IPropertyStorage *pPropStg = NULL; HRESULT hr = StgOpenStorageEx( pszFileName, STGM_SHARE_EXCLUSIVE|STGM_READWRITE, STGFMT_FILE, 0, NULL, 0, IID_IPropertySetStorage, reinterpret_cast<void**>(&pPropSetStg) ); if(SUCCEEDED(hr)) { hr = pPropSetStg->Create( FMTID_SummaryInformation, NULL, PROPSETFLAG_DEFAULT, STGM_CREATE|STGM_READWRITE|STGM_SHARE_EXCLUSIVE, &pPropStg ); if(SUCCEEDED(hr)) { PROPSPEC propspec; PROPVARIANT propvarWrite; propspec.ulKind = PRSPEC_PROPID; propspec.propid = PIDSI_KEYWORDS; propvarWrite.vt = VT_LPWSTR; propvarWrite.pwszVal = pszURL; hr = pPropStg->WriteMultiple(1, &propspec, &propvarWrite, PID_FIRST_USABLE); pPropStg->Release(); } pPropSetStg->Release(); } } ConclusionI've been using this application for just over 3 months now and my WebCache has a total of 8126 files with a total size of 368 MB which is compressed to 218 MB using NTFS file compression. So at this rate my WebCache will consume just under 1 GB per year of browsing. In conjunction with MSN Desktop search, it has made it a lot easier to find web pages that I've visited in the past that I need to look for at a later stage.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||