Thumbsuck Introdutory Image

Introduction

This sample client-side C# .NET program is a specialized front-end for Google's Image Search™. Google ordinarily returns search results as a set of pages, each with a 4-column by 5-row grid of thumbnails. Thumbsuck instead presents the search results as a tiny slide show, one thumbnail at a time. The client provides two views: the main app, which presents a series of floating images with optional captioning, and a "Search and Settings" view. The latter, typically hidden when not needed, combines a property page with an embedded microbrowser viewer. The viewer offers a second, alternative representation of each image and its caption, more similar to what you'd see in Google's normal image search results.

The program does its work by screen scraping (e.g., parsing HTML), rather than using special Google APIs. This makes it more fragile, but doesn't require any special licensing. Because the HTML being parsed has a fairly rigid pattern, simple string manipulation suffices, rather than using a document object model.

The program supports some of the options found in Google's "Preferences" and (indirectly) "Advanced Search", and, for fun, an additional "unsafe only" choice on image filtering.

This was my effort, as an experienced C++/MFC developer, to try my hand at building a modest, real-world C# and Windows Forms application. I'm pretty much done with this now; there are suggestions throughout where any interested reader could extend such code.

Using the Program

Getting Started

There is no installer, and no use of the registry. Extract the .exe and other files from Thumbsuck_demo.zip to some folder on a machine with .NET framework (e.g. 1.1) installed, click the .exe, and you're running. The distribution includes a few JPEGs and a custom DLL for round buttons... see the References for more about that. If you're building from Thumbsuck_src.zip, include the demo's DLLs in your project. The distribution includes versions of standard files such as Microsoft.mshtml.dll, Interop.SHDocVw.dll, and AxInterop.SHDocVw.dll, tested on the XP SP2 platform. If Thumbsuck doesn't run on your different flavor of Windows, you may need to replace these DLLs and rebuild; see the article by "smallguy78".

Thumbsuck's main window consists mostly of a floating image with a bound caption beneath it. When first started, both of these direct you to right-click, as shown below. Before doing that, drag on either the image or caption to position it where you'd like it.

Thumbsuck at startup

Right-click, then select "Search and Settings".

The Search and Settings Window

Within this window, seen below, start by entering a search term in the "Search term(s)" field. Then adjust any other search options (discussed later), and hit the button with the arrow on it. After loading, the search results should appear, at a rate controlled by the value in the "Image dwell time" field. The example screenshot here shows a paused moment during one such playback:

While the fields found in Google's Advanced Search are largely absent here, you can use the special syntax quietly provided by Google's standard image search to accomplish these search goals:

All words - simply enumerate them.
Any word - separate them by " OR ".
Exact phrase - place in ASCII double quotes ("); only one phrase allowed.
None of the words - precede each word by minus (-).

In addition, you can include at the end of the search string, the phrase:

site: some site

to restrict the search to a given site.

There is a small text below the search edit box to remind you of these possibilities. Parentheses are also supported (to the extent Google does).

Some of the other fields will be discussed further below. The [X] in upper right closes this window by hiding it.

Playback Controls

At the top of the main window is a bar (shown blue here due to the desktop scheme, and called the PlayBar internally), with a row of round buttons. At any moment, some buttons are hidden. The image below shows the Form1.cs design view, so you can see them all. Images, of various sizes and aspect ratios as provided by Google, appear within the central white rectangle, against the top edge and centered horizontally. The white background, and red beneath that, are transparent at runtime.

Main window in design view

From left to right, the buttons:

Go back one image when paused
Pause
Go forward one image when paused
Play at requested normal rate
Fast forward at maximum speed
Skip over remaining "More images" from this site (explained more fully later)
Minimize
Exit

Alternatively, you can right-click on the floating image or its caption and select the equivalent commands from the context menu. Also, the choice of default navigation control is dynamically managed. Consequently, the <Enter> key can often be used to toggle between Paused and Play.

Pausing on an image makes it convenient to right-click on the image in the "Search and Settings" microbrowser, for the additional functionality provided by the Windows browser control: Save as, print, email-to, and so on.

Filtering

For image filtering, the "Search and Settings" pull-down offers four choices:

"Strict safe" - Image filtering as provided by Google's "Strict safe" option.
"Moderate safe" - Google's default option. Google claims that only web text searches differ between this and Strict Search, and that the image results should be the same. (But the parameter passed on the URL differs, and the number of images coming back can differ.)
"Unsafe" - Unfiltered. Thumbsuck's default.
"Unsafe only (2 pass)" - This additional filter option does a fetch of the first and third categories, then finds the difference.

Reading an Optional Search-Term File

A list of sequential searches can be created, to in effect produce a channel of your favorite search results. With a text editor like Notepad, create a .txt file. Each search request should be on a separate line, in the same format as in the "Search term(s)" box. Then enter that filename in the "Load search lines from file" box, or browse to it with the neighboring […] button. The first line of the file will appear in the "Search term(s)" box. Hit the arrow key as usual to begin that search. When the first search ends, the second line from the file will appear in the "Search Term(s)" field, and that search automatically begins. This will continue forever, with wrapping back to the first line. To end the file-based search, the first step is to clear the filename box. You may then enter a new manual search term box and hit the arrow as before, which ends the old search and begins the new.

Handling of "More" Images

With regular browser-based Google image search, when not restricted to one site, some returned captions may include a "[More images from this site]" line. Clicking on this would jump to a set of extra thumbnail pages, which I'll call here "More" pages. Likewise, Thumbsuck has a checkbox to optionally expand these additional pages automatically. For presentation, such pages are queued in order, following the page on which their reference occurs. Because of this page-oriented processing, a set of "More" images seldom appear in the slide show immediately after the reference image.

If such an expansion is enabled, during play or fast-forward, the "Skip More Images" control will appear whenever the image being shown comes from a "More" page. You can skip the remaining "More" images from that site by clicking this control, or choosing the corresponding context-menu item. The control will not necessarily disappear when clicked, since there may be multiple "More" sites on a parent page; then each click jumps to the start of the next group.

There are subtleties. Google introduces the "More" line under the second image from a site, when there is at least one more image. To prevent duplication, Thumbsuck suppresses playback of the first two images in a set of "More" pages. (Of course, there are lots of other reasons why Google might deliver duplicates or near-duplicates in the results.) This may cause the total number of images reported by Thumbsuck to be less than that returned by Google Image. Also, Google Image often reports an enormous numerical figure for results found, but doesn't actually return much more than 40 main pages, plus "More" pages, which in aggregate can run into hundreds of pages.

Data Structures and Other Points of Interest

Thumbsuck's main work is done in the class GoogImg, which allocates two fixed arrays, googpages and googpagesx, each with 1000 GoogPage "page objects". (The second array is used just with the unsafe-only mode, for the second pass.) Each GoogPage can hold up to 20 images in 4 columns by 5 rows, the standard Google format. Only the slightly-modified URLs to images are stored, not the images themselves. This page-oriented hierarchy is convenient for an application like this one, where manipulation of the image presentation order is not planned. (Otherwise, a flatter table would be better. The reader can also imagine ways to avoid a fixed, pre-allocated array.)

In operation, all pages of a new search are fetched and stored first, before any images are displayed. This makes the user wait, of course. The reader may wish to revise this towards more sophistication, by introducing producer/consumer threads so that display and fetch are concurrent... and greatly complicating both play control and processing of the unsafe-only mode.

URLs are slightly modified as they are stored. Relative URLs are converted to absolute. For the main URL (shown in the "Image Anchor with URL" text box), a "target=_blank" phrase is added so that clicking on the image in the microbrowser, which links to the Google cache/context window for the image (as with normal Google image search), will open the latter in a separate full-size window, not in the too-small microbrowser. If a caption has a "[More images from this site] link, it is removed from the caption, and the link string itself stored as a separate field in the image data structure, for expanding into separate "More" pages.

The unsafe-only mode works by finding the most pertinent substring of each URL and doing a brute-force equality comparison between passes, which seems fast enough. The resulting winnowing of images is often plausible, but sometimes just highlights the "noise" in Google's unsafe tagging and the odd failure of my perhaps-too-simple substring-finding code.

The "Search and Settings" page is a persistent window. It is instantiated once (a relatively slow process) on app startup, in a hidden state, then revealed when requested. In the main window, Form1, there is a searchAndSettingPage pointer to it.

As indicated above, the main floating image viewer uses the transparent key feature of Windows Forms (for which "Red" was chosen as the key color at design time), with a borderless window. The thumbnails will vary in size and aspect ratio, just as in Google, and the code centers them horizontally. As originally coded, when images are partially obscured by another window, moderately-persistent artifacts occur, namely, chunks of images not properly repainted. This was solved by invalidating the entire form panel in which the picture box lives. A right-click menu item, "Refresh", also invokes this manually.

Search terms read from a file are limited to 100 non-blank lines by the fixed size of the string array.

Known Problems and Further Ideas for Improvements

This program is reasonably reliable, but not perfect.

Image Transitions and Transparency

Some experiments were done with fading between images, but, given the heterogeneous image sizes, this caused exposure of the red transparency key background and artifacts thereof. (If you really require sophisticated transitions, this simple approach won't cut it... you'll probably have to struggle with DirectX or Win32.) Similarly, experiments with having a transparent background behind the caption text showed rather jagged letters edged with the key color... not pretty.

Degree of Persistence

For better or worse, Thumbsuck neither reads nor writes to disk explicitly, other than reading a requested search-term file. (There could be implicit activity, e.g., virtual memory page swaps; I suppose IE-control caching.) Thus, Thumbsuck doesn't remember where its windows are placed, or remember your search settings, nor does it consult any Google cookies for settings.

Navigation

Play controls could be further improved. One small bug: whenever the "Skip More Images" control appears, it gets the focus, in spite of the code to prevent it. An idea for extending this control would be to make it visible and workable during pausing/single-stepping.

Ideas for additional controls include rewind, play backwards, jump to a particular image number, and jump to start or end. Or you might wish to add VCR-style micro-controls to the "Search and Settings" window. Another change might be to make the caption under the floating image into a separate window, independently positionable. (An experiment to do this was only partially successful, exhibiting dead-response and repaint problems. This wasn't a surprise; Windows frameworks often seem to get unhappy with multiple sibling windows open, unless using the MDI pattern.)

The Search-Term File

When using the search list from a file, pausing and stepping backwards can't step to a previous search line correctly.

Ideas for improvement to the search-term file feature might include:

allow image size, filtering, etc., to be optionally specified per line, using a comma-separated format.
Make wrapping back to start at the end of the search list, optional.

Other Display Issues

If you try to close the "Search and Settings" window at the moment the browser and related fields are being updated, the close request will be ignored, and you must re-click.

The actual display time for an image will be longer than the nominal display time if it takes a while to fetch the next image. More clever time management could help smooth this out.

Thumbsuck's threading and synchronization is relatively unsophisticated. Readers interested in this, or in routinely displaying full-size images, should consider Dan Fernandez's work, referenced below.

There's a slow expansion in the amount of paged kernel memory needed during the slide show. I tried to fix this (but probably didn't) with a System.GC.Collect call every 3000 images displayed. There may be machine problems if you run Thumbsuck for hours.

Run with It

You have my permission to use this code in any way you'd like. I've not asserted a copyright privilege, but would welcome your citing this site.

References and Credits

For a different take on presenting Google Image results (discovered after this work was well underway), with better threading but far less comprehensive screen-scraping, see "Dan Fernandez's Blog: Pulling Images from Google using C# Express".

Tricks for modifying this project's references to AxInterop.SHDocVw.dll, Interop.SHDocVw.dll, and Microsoft.mshtml.dll are mentioned in smallguy78's "Extended web browser control for .NET 1.0/1.1".

Information about Saikat Sen's splendid round buttons DLL can be found here. Thanks, Saikat, for allowing me to include the DLL here.

My first effort to read a Google Image web page was as a simple throw-away project, broadly adapted from the sample code by Eric Gunnerson ("A Programmer's Introduction to C# (Second Edition)" : publ. Apress L.P., ISBN: 1-893115-62-3, 2000, Chapter 32 - .NET Frameworks Overview\Reading Web Pages). However, there's not much trace left of that sample now.

Another worthwhile reference is "Using the WebBrowser control in .NET" by Nikhil Dabas.

A thorough treatment of building a custom web browser can be found in "Chapter 8: Creating Front Ends with the WebBrowser Component" of Ted Faison's Book "Component-Based Development with Visual C#", Wiley Tech Publ. ISBM 0-7645-4914-6, April 2002.

The .NET SDK also has a sample for using the web browser control (available here).

The unsafe-only mode was broadly inspired by this article, which does the corresponding "unsafe search" on the server-side for regular Google text searches. See also this page about its creator.

The displaying of images in the main window was influenced by the sample code from "How to Load/Display images with C#" by Rakesh Rajan.

Some experiments were separately undertaken to try fading between images, but these suffered from artifacts when used with heterogeneous image sizes and transparent backgrounds. The experiments were based on the FormFader sample from Bob Powell (an MS MVP). For an overview of transparency, see Joe Pardue's "Transparency Tutorial with C#", particularly Part 3.

The method for getting the HTML source retrieved from the browser object was adapted from Steven Willem's "Printing HTML to C#" example. Steven's function to print HTML is present, but not actually exercised at this time.

The technique for dragging a borderless window was adapted from the VB code posted by Jacob Grass of Abiliti Solutions, available here.

History

As of July 12, 2006, the downloadable files are updated. This is because Google changed the HTML slightly (probably several months earlier), with the consequence that images were not being harvested from most pages. As the main article states, screen scraping is a fragile art.

The only file changed is Form1.cs. The code that searches for HTML items at the bottom of the page like the "Next" control, to determine if there's more pages to come, had to be revised.

The code used to look for "<div class=n>" and "<img src="nav_next.gif". It now looks for "<div id=navbar class=n>" and "<img src=/intl/" + langabbr + "/nav_next.gif", where langaddr is the 2-letter language abbreviation.

There's one other minor change: the generated URL string for each page now includes "&ndsp=20", to indicate the number of thumbs displayed per page, and drops "&svnum=10".