![]() |
General Programming »
Internet / Network »
Internet
Intermediate
License: The Code Project Open License (CPOL)
An API for Google Image SearchBy Ilan AssayagQuerying images from Google programmatically. |
C#.NET 2.0, WinXPVS2005, Dev
|
|
Advanced Search |
|
|
|
||||||||||||||||

As part of the research that I am doing for my thesis I had to perform a lot of image search queries against the most popular search engines. The Yahoo! guys provide us with the Yahoo! SDK API which also supports image search and was really useful to me. Google, however, for some obscure reasons, provides only an API for the regular search engine and does not provide anything for image search. A few weeks ago, I came across a very simple implementation of a Google web translation service by Peter Bromberg. So I thought, why not do something similar for the Google image search? Well, I did that, and here it is. The regular expressions used are far more complex than the one used in the translation service, but in the end, it's basically the same.
Note: The article has been translated to Chinese, and the translation is available here.
The source files include two projects. The Ilan.Google.API project includes a DLL which can be used to query Google image search programmatically. The Ilan.Test.Google.API project includes a simple application that enables you to run a query and display all the resulting images dynamically on the form. When you double-click a thumbnail the full original image is displayed. This application is aimed at showing how simple it is to use the API. I wouldn't recommend it as a real searching tool (at least as is) because:
If you're not interested in how it works, but just want to use the library, this section is especially for you :-)
using directive:
using Ilan.Google.API.ImageSearch;
string.Replace call. You could use the regular expression that I have used in my demo project:
string formattedQuery =
Regex.Replace(nonFormattedQuery, @"\s{1,}", "+");
SearchService.SearchImages method:
SearchResponse response =
SearchService.SearchImages(formattedQuery, 1, 50, true);
response object holds *all* the first 50 results for the given query (or less if there are no more results). For example, you can retrieve the URL of the first image through the response.Results array:
string firstImageUrl = response.Results[0].ImageUrl;
SearchService.SearchImages method are:
string) query- The query to be sent. int) startPosition- The index of the first item to be retrieved (must be positive). int) resultsRequested- The number of results to be retrieved (must be between 1 and (1000 - startPosition)). bool) filterSimilarResults- Set to 'true' if you want Google to automatically omit similar entries. Set to 'false' if you want to retrieve every matching image. safeSearch - Indicates what level of safeSearch to use.] ... then I hope this section will be of help to you.
The SearchResponse and SearchResult classes are pretty straightforward. A query returns one SearchReponse which holds the total number of available results for the query as well as an array of SearchResult objects, each representing a separate image returned by Google. The SearchResult objects hold the URL of the thumbnail of the image (located somewhere at Google) and the URL of the actual image (at its source).
After sending a few queries to Google using Google Image Search, it turns out that you can run a simple query for "apple cake" with this URL.
Digging a little further, you can fetch results at a certain position by adding "&start=". So, if you want to fetch results from 21 to 40 (i.e. the second page if you were using their web site), the URL should be this.
Note that the index of the images is 0-based, so to start with the 21st result you must mention "&start=20". Then, I found out that there is a default filter that omits results if they resemble the previous results. If you want to disable this filter you need to add "&filter=0". A quick test will show you that "&filter=1" turns the filter on. To see the effect of the filter I suggest you run the following two queries, which return the results starting at result nr. 900:
Finally, pedrito68 indicated to me that you can also choose the SafeSearch mode by adding "&safe=...". Google's SafeSearch blocks web pages containing explicit sexual content from appearing in search results. There are three options: "active" (filter both explicit text and explicit images), "moderate" (filter explicit images only - default behavior) and "off" (do not filter the search results)
I have tried to find out a way to define the number of returned results, but didn't succeed. I keep getting 20 results at a time. More on this later...
So, to sum up the querying part, building the URL for a query, given the query, start position and filter can be done as follows:
string requestUrl =
string.Format("http://images.google.com/images?" +
"q={0}&start={1}&filter={2}&safe={3}",
query, startPosition.ToString(),
(filterSimilarResults)?1.ToString():0.ToString(),
SafeSearchFiltering.Moderate );
Sending the query and retrieving the HTML file returned is rather simple and pretty common:
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(requestUrl);
string resultPage = string.Empty;
using (HttpWebResponse httpWebResponse =
(HttpWebResponse)request.GetResponse())
{
using (Stream responseStream =
httpWebResponse.GetResponseStream())
{
using (StreamReader reader =
new StreamReader(responseStream))
{
resultPage = reader.ReadToEnd();
}
}
}
Here comes the ugly part. We have to parse the HTML and extract the number of available results for the query as well as information for each one of the retrieved images. After having analyzed the HTML I got from the Google, I managed to find a recurring pattern that accurately allows you to know where each of these interesting information bits can be located in the HTML. Needless to say, that if Google changes the format of the returned HTML the parsing will fail!!! Of course, I relied on regular expressions to parse the text. Following are the different patterns used in the API:
[Editor comment: Line breaks used to avoid scrolling.]
1. Regex imagesRegex = new Regex(@"(\x3Ca\s+href=/imgres\" +
@"x3Fimgurl=)(?<imgurl>http" +
@"[^&>]*)([>&]{1})" +
@"([^>]*)(>{1})(<img\ssrc\" +
@"x3D)(""{0,1})(?<images>/images" +
@"[^""\s>]*)([\s])+(width=)" +
@"(?<width>[0-9,]*)\s+(height=)" +
@"(?<height>[0-9,]*)");
This pattern is used to retrieve information about each image. The URL of the original image is captured into the "imgurl" group, the URL of the thumbnail is captured into the "images" group and the width and height of the thumbnail image are captured in the "width" and "height" groups respectively.
[Editor comment: Line breaks used to avoid scrolling.]
2. Regex dataRegex = new Regex(@"([^>]*)(>)\s{0,1}(<br>){0,1}\s{0,1}" +
@"(?<width>[0-9,]*)\s+x\s+(?<height>[0-9,]*)" +
@"\s+pixels\s+-\s+(?<size>[0-9,]*)(k)");
This pattern is used to retrieve additional information about each image - the original images' widths, heights and sizes (in groups "width", "height" and "size" respectively). I didn't find a way to use the same pattern for all the images' information - I guess there is but I gave up searching after a while...
[Editor comment: Line breaks used to avoid scrolling.]
3. Regex totalResultsRegex = new Regex(@"(?<lastResult>" +
@"[0-9,]*)(\s*</b>\s*)(of(\s)" +
@"+(about){0,1}(\s*<b>\s*)" +
@"(?<totalResultsAvailable>[0-9,]*)");
This pattern is used to retrieve the total number of results available for the query (can be found on the upper right portion of the HTML when you look at the result of a query). I have also extracted the last result index - to find when there are no more results.
Since I'm not a regular expressions pro, if you want some more information about it and want to get a better understanding of how the pattern works, I suggest reading the following: Regex Language Reference Introduction to Regular Expressions and of course you must check out The Regulator.
That's straightforward. Once you know the "start=" portion of the URL, you can run the queries in a loop until you reach the requested number of results.
Hmmm, sorry. I didn't find a way to work around that one. I assume, however, that virtually in all applications 1000 results should be more than enough. Besides, once you get to the last result, most of them become totally irrelevant to the actual query anyway...
I wish to thank Roy Osherove and his Regulator. I have used regular expressions a few times in the past, but mostly with very simple patterns. The expressions used here are by far the most complicated ones ever written by me, and I wouldn't have tested it and successfully written it without the help of "The Regulator". Which brings me to the apologies - there is a good chance that the patterns I'm using could be simplified. If you find a way to simplify it (with no performance penalty), then please let me know, and I'll update the code. Finally, I would like to thank "pedrito68", who has provided me with very useful comments (and code) based on the first version of this API, which I have added in the current version (see History section).
The Google Image Search API is essentially a tool that you can use if you need to perform an image search against Google programmatically. Since it parses the HTML returned by Google, if the format of this HTML changes, the library's implementation will have to change accordingly. The implementation is rather simple. It shows a simple example of how to send a URL to a web server (using the HttpWebRequest object) and retrieve the HTML returned by the web server. It also uses regular expressions (using the System.Text.RegularExpressions.Regex class) with some pretty complicated patterns to extract the interesting data from the HTML. Finally, the demo application shows how to use the API.
On a personal note - I have been using this API for the past few days to run over 40,000 single-word queries. It has proven to be very accurate and never did the regular expression break. One very interesting feature is that it does not suffer from any quantity-limit as the regular SDK. For instance, Google's web search API won't let you run more than 1000 queries with the same key in a single day (24 hours). Similarly, Yahoo! has 5000 queries per day limit. It might be good to adapt this API to provide regular search capabilities and work around the Google's 1000-qeries-per-day limitation or adapt it to Yahoo and work around their limitation...
If this article was useful to you, please don't forget to vote. I'd like it to get out of the 'unedited' section as soon as possible. Also, you're welcome to visit my blog.
SafeSearch. SearchService.LoadRegexStrings() function. I've added a button in the sample application that does just that, so it's easy to see how it works. I put the regexes in a text file and not directly inside the config file, in order to simplify the regex, and not have to make it even more complicated to comply to XML format.
General
News
Question
Answer
Joke
Rant
Admin
|
PermaLink |
Privacy |
Terms of Use
Last Updated: 11 Mar 2007 Editor: Chris Maunder |
Copyright 2005 by Ilan Assayag Everything else Copyright © CodeProject, 1999-2009 Web18 | Advertise on the Code Project |