|

Introduction
As part of the research that I am doing for my thesis I had to perform a lot of image search queries against the most popular search engines. The Yahoo! guys provide us with the Yahoo! SDK API which also supports image search and was really useful to me. Google, however, for some obscure reasons, provides only an API for the regular search engine and does not provide anything for image search. A few weeks ago, I came across a very simple implementation of a Google web translation service by Peter Bromberg. So I thought, why not do something similar for the Google image search? Well, I did that, and here it is. The regular expressions used are far more complex than the one used in the translation service, but in the end, it's basically the same.
Note: The article has been translated to Chinese, and the translation is available here.
What's in the source files?
The source files include two projects. The Ilan.Google.API project includes a DLL which can be used to query Google image search programmatically. The Ilan.Test.Google.API project includes a simple application that enables you to run a query and display all the resulting images dynamically on the form. When you double-click a thumbnail the full original image is displayed. This application is aimed at showing how simple it is to use the API. I wouldn't recommend it as a real searching tool (at least as is) because:
- It fetches all images (thumbnails) on one single thread. Although it does provide the ability to view the full image before all the thumbnails have been downloaded, a "real" application would have to download several thumbnails at the same time and significantly boost performance.
- Only simple queries are supported (space-separated words). Special characters are not handled. If you need to support more complicated queries, you'll have to parse the query and transform it to a format that complies with the URL.
QuickStart - How do I use the API?
If you're not interested in how it works, but just want to use the library, this section is especially for you :-)
- Add a reference to the Ilan.Google.API library in your project.
- Add the
using directive:
using Ilan.Google.API.ImageSearch;
- When you need to run a query, make sure it conforms to the URL supported by Google. For this I suggest you to check the query on Google Image Search and look at how they build the URL. For instance, if you need to support only simple queries of space-separated words, you just need to transform the query to a list of words separated by the plus (+) sign. For example, the query "apple cake" must be transformed to "apple+cake". Notice that several space characters must be transformed to a single + sign, so I wouldn't recommend a simple
string.Replace call. You could use the regular expression that I have used in my demo project:
string formattedQuery =
Regex.Replace(nonFormattedQuery, @"\s{1,}", "+");
- To run the query (say for the first 50 results), use the
SearchService.SearchImages method:
SearchResponse response =
SearchService.SearchImages(formattedQuery, 1, 50, true);
- The
response object holds *all* the first 50 results for the given query (or less if there are no more results). For example, you can retrieve the URL of the first image through the response.Results array:
string firstImageUrl = response.Results[0].ImageUrl;
- The parameters for the
SearchService.SearchImages method are:
- (
string) query- The query to be sent.
- (
int) startPosition- The index of the first item to be retrieved (must be positive).
- (
int) resultsRequested- The number of results to be retrieved (must be between 1 and (1000 - startPosition)).
- (
bool) filterSimilarResults- Set to 'true' if you want Google to automatically omit similar entries. Set to 'false' if you want to retrieve every matching image.
- [optional: (SafeSearchFiltering)
safeSearch - Indicates what level of safeSearch to use.]
- Well, I think that should be it. Just remember that Google does not return results beyond the first 1000 results for a query, so if you're trying to get a result that exceeds the first 1000 I'll throw you an exception...
And if you want to know why and how it works...
... then I hope this section will be of help to you.
Returned objects
The SearchResponse and SearchResult classes are pretty straightforward. A query returns one SearchReponse which holds the total number of available results for the query as well as an array of SearchResult objects, each representing a separate image returned by Google. The SearchResult objects hold the URL of the thumbnail of the image (located somewhere at Google) and the URL of the actual image (at its source).
Building the query URL
After sending a few queries to Google using Google Image Search, it turns out that you can run a simple query for "apple cake" with this URL.
Digging a little further, you can fetch results at a certain position by adding "&start=". So, if you want to fetch results from 21 to 40 (i.e. the second page if you were using their web site), the URL should be this.
Note that the index of the images is 0-based, so to start with the 21st result you must mention "&start=20". Then, I found out that there is a default filter that omits results if they resemble the previous results. If you want to disable this filter you need to add "&filter=0". A quick test will show you that "&filter=1" turns the filter on. To see the effect of the filter I suggest you run the following two queries, which return the results starting at result nr. 900:
Finally, pedrito68 indicated to me that you can also choose the SafeSearch mode by adding "&safe=...". Google's SafeSearch blocks web pages containing explicit sexual content from appearing in search results. There are three options: "active" (filter both explicit text and explicit images), "moderate" (filter explicit images only - default behavior) and "off" (do not filter the search results)
I have tried to find out a way to define the number of returned results, but didn't succeed. I keep getting 20 results at a time. More on this later...
So, to sum up the querying part, building the URL for a query, given the query, start position and filter can be done as follows:
string requestUrl =
string.Format("http://images.google.com/images?" +
"q={0}&start={1}&filter={2}&safe={3}",
query, startPosition.ToString(),
(filterSimilarResults)?1.ToString():0.ToString(),
SafeSearchFiltering.Moderate );
Sending the query and retrieving the result
Sending the query and retrieving the HTML file returned is rather simple and pretty common:
HttpWebRequest request =
(HttpWebRequest)WebRequest.Create(requestUrl);
string resultPage = string.Empty;
using (HttpWebResponse httpWebResponse =
(HttpWebResponse)request.GetResponse())
{
using (Stream responseStream =
httpWebResponse.GetResponseStream())
{
using (StreamReader reader =
new StreamReader(responseStream))
{
resultPage = reader.ReadToEnd();
}
}
}
Extracting information from the retrieved HTML
Here comes the ugly part. We have to parse the HTML and extract the number of available results for the query as well as information for each one of the retrieved images. After having analyzed the HTML I got from the Google, I managed to find a recurring pattern that accurately allows you to know where each of these interesting information bits can be located in the HTML. Needless to say, that if Google changes the format of the returned HTML the parsing will fail!!! Of course, I relied on regular expressions to parse the text. Following are the different patterns used in the API:
[Editor comment: Line breaks used to avoid scrolling.]
1. Regex imagesRegex = new Regex(@"(\x3Ca\s+href=/imgres\" +
@"x3Fimgurl=)(?<imgurl>http" +
@"[^&>]*)([>&]{1})" +
@"([^>]*)(>{1})(<img\ssrc\" +
@"x3D)(""{0,1})(?<images>/images" +
@"[^""\s>]*)([\s])+(width=)" +
@"(?<width>[0-9,]*)\s+(height=)" +
@"(?<height>[0-9,]*)");
This pattern is used to retrieve information about each image. The URL of the original image is captured into the "imgurl" group, the URL of the thumbnail is captured into the "images" group and the width and height of the thumbnail image are captured in the "width" and "height" groups respectively.
[Editor comment: Line breaks used to avoid scrolling.]
2. Regex dataRegex = new Regex(@"([^>]*)(>)\s{0,1}(<br>){0,1}\s{0,1}" +
@"(?<width>[0-9,]*)\s+x\s+(?<height>[0-9,]*)" +
@"\s+pixels\s+-\s+(?<size>[0-9,]*)(k)");
This pattern is used to retrieve additional information about each image - the original images' widths, heights and sizes (in groups "width", "height" and "size" respectively). I didn't find a way to use the same pattern for all the images' information - I guess there is but I gave up searching after a while...
[Editor comment: Line breaks used to avoid scrolling.]
3. Regex totalResultsRegex = new Regex(@"(?<lastResult>" +
@"[0-9,]*)(\s*</b>\s*)(of(\s)" +
@"+(about){0,1}(\s*<b>\s*)" +
@"(?<totalResultsAvailable>[0-9,]*)");
This pattern is used to retrieve the total number of results available for the query (can be found on the upper right portion of the HTML when you look at the result of a query). I have also extracted the last result index - to find when there are no more results.
Since I'm not a regular expressions pro, if you want some more information about it and want to get a better understanding of how the pattern works, I suggest reading the following: Regex Language Reference Introduction to Regular Expressions and of course you must check out The Regulator.
What with the 20-results-per-query?
That's straightforward. Once you know the "start=" portion of the URL, you can run the queries in a loop until you reach the requested number of results.
And the 1000 results limit?
Hmmm, sorry. I didn't find a way to work around that one. I assume, however, that virtually in all applications 1000 results should be more than enough. Besides, once you get to the last result, most of them become totally irrelevant to the actual query anyway...
Thanks and apologies
I wish to thank Roy Osherove and his Regulator. I have used regular expressions a few times in the past, but mostly with very simple patterns. The expressions used here are by far the most complicated ones ever written by me, and I wouldn't have tested it and successfully written it without the help of "The Regulator". Which brings me to the apologies - there is a good chance that the patterns I'm using could be simplified. If you find a way to simplify it (with no performance penalty), then please let me know, and I'll update the code. Finally, I would like to thank "pedrito68", who has provided me with very useful comments (and code) based on the first version of this API, which I have added in the current version (see History section).
Conclusions
The Google Image Search API is essentially a tool that you can use if you need to perform an image search against Google programmatically. Since it parses the HTML returned by Google, if the format of this HTML changes, the library's implementation will have to change accordingly. The implementation is rather simple. It shows a simple example of how to send a URL to a web server (using the HttpWebRequest object) and retrieve the HTML returned by the web server. It also uses regular expressions (using the System.Text.RegularExpressions.Regex class) with some pretty complicated patterns to extract the interesting data from the HTML. Finally, the demo application shows how to use the API.
On a personal note - I have been using this API for the past few days to run over 40,000 single-word queries. It has proven to be very accurate and never did the regular expression break. One very interesting feature is that it does not suffer from any quantity-limit as the regular SDK. For instance, Google's web search API won't let you run more than 1000 queries with the same key in a single day (24 hours). Similarly, Yahoo! has 5000 queries per day limit. It might be good to adapt this API to provide regular search capabilities and work around the Google's 1000-qeries-per-day limitation or adapt it to Yahoo and work around their limitation...
If this article was useful to you, please don't forget to vote. I'd like it to get out of the 'unedited' section as soon as possible. Also, you're welcome to visit my blog.
History
- October 5th, 2005 - First version.
- October 9th, 2005 - Added changes recommended by pedrito68 and some bug fixes:
- Extract more information about each image: file size/width/height/name/extension, thumbnail width/height.
- Get thumbnails on separate threads.
- Double-click a thumbnail pops-up the full image.
- Support for
SafeSearch.
- Bug fix - when a query has only a few results and more are requested, we get the same result multiple times. For example, if a query returns only three images, and we request for 100, we would get 15 results (the same 3 results are repeated 5 times).
- January 4th, 2006 – Updated the second regular expression due to changes in the format of the HTML returned by Google.
- Decemeber 27th, 2006 – Updated source code:
- Works under .NET 2.0 instead of .NET 1.1 (and VS2005 accordingly).
- The three regular expressions used by the API are now loaded from an external text file, whose name is read from the config file. So now, you can change the regular expressions without needing to recompile or even rerun the application, just update the regex in the text file. To reload the regular expressions, call the new
SearchService.LoadRegexStrings() function. I've added a button in the sample application that does just that, so it's easy to see how it works. I put the regexes in a text file and not directly inside the config file, in order to simplify the regex, and not have to make it even more complicated to comply to XML format.
- January 28, 2007 - Updated source code.
- Support for the new format of the results returned by Google Image Search
- Thumbnails are downloaded on separate threads (test application)
- Better UI thread handling (test application)
- March 11, 2007
- Updated to comply with new format of results returned by Google Image Search
| You must Sign In to use this message board. |
|
| | Msgs 1 to 25 of 55 (Total in Forum: 55) (Refresh) | FirstPrevNext |
|
 |
|
|
Cool API, i just needed configuration for proxy :
SearchService ligne 125 : if (UseProxy) { request.Proxy = new WebProxy(adresse, port); request.Proxy.Credentials = new NetworkCredential(login, pwd, domain); }
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Something is wrong with the library now because i can't get any image. The program starts, I send a query, and after some time, it gives an exception: "Unable to connect to the remote server".
Am I doing something wrong?, Do i need to do any additional step?, or there is something wrong about the code.
Thanks anyway, the application look really nice and useful
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Could you provide more details of how you're trying to use it? It works for me, so I assume there is something wrong with your configuration.
|
| Sign In·View Thread·PermaLink | 1.00/5 (1 vote) |
|
|
|
 |
|
|
The Debugger keeps complaining about the search service not being initialized properly. I tried to use the following code:
Dim URL As String Dim FaceSearch As SearchService Dim Response As New SearchResponse
Response = FaceSearch.SearchImages(URL, 1, 5, True)
What am I doing wrong?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I'm sorry to read that. You code seems OK, assuming you are filling a valid URL in the URL variable. Did you try running the sample application from the same machine?
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
hanks a lot for the quick reply. The sample application is running properly on the same PC. I am trying to use VB since I haven't done any C# programming so far.
Here is my VB Declaration:
Imports Ilan.Google.API.ImageSearch
Dim ImgSearch As SearchService Dim Response As New SearchResponse Dim ResultSet As SearchResult
Here's the code for querying. I am testing it with a hardcoded query.
URL = "http://images.google.com/images?q=apple+cake&start=900&filter=0" Response = ImgSearch.SearchImages(URL, 1, 5, True) MsgBox(Response.TotalResultsAvailable)
I also tried
Response = SearchService.SearchImages(URL, 1, 5, True)
I will try to play around with it a bit more. The API is great, thanks a lot.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
OK, now I see your problem. The URL you are using contains the "start" and the "filter" fields. I'm taking care of these two fields as part of the API (second and fourth parameters in the SearchImage method). So when you are adding them the result is that you send the same fields twice (and what's more - they don't have the same values). Also, you have written the complete URL, including the first portion of it, which I'm also taking care of.
What you need to provide is ONLY the query, so in your case you should only provide the string: "apple+cake"
So your code should read:
Dim Response As New SearchResponse Dim query as String query = "apple+cake" Response = SearchService.SearchImages(query, 1, 5, True)
Following is a description of each of the parameters in the SearchImage method, for your convenience: query - The query to be sent. startPosition - The index of the first item to be retrieved (must be positive). resultsRequested - The number of results to be retrieved (must be between 1 and (1000 - startPosition). filterSimilarResults - Set to 'true' if you want Google to automatically omit similar entries. Set to 'false' if you want to retrieve every matching image. safeSearch - Enumerator, indicates what level of SafeSearch to use. Values are: Active, Moderate or Off. If not supplied, the default is set to Moderate (which is Google's default in IE).
|
| Sign In·View Thread·PermaLink | 5.00/5 (1 vote) |
|
|
|
 |
|
|
Thanks a lot, it works after the changes. Again, this is a great API, the Google AJAX API's are not this flexible.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hey Ilan,
I used your api successfully in my own C# search application, that uses various sources (flickr, yahoo, live, youtube - and your google sollution).
However, i would like to add websearch next to imagesearch on google as well. And... I have no idea how these regular expressions work.
Is there a chance u will be creating a similar api for websearch? I dont want to use Google's SOAP Api cause of its limitations.
Thanx in advance With kind regards, Roger
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hello Roger,
I'm happy this API was helpful. I don't have plans of adding websearch anytime soon. I think that, in general, using the published API is healthier (i.e. changes in the returned HTML format won't break your system). I only wish there was such an API for image search as well. As for the limitations, the main limitation is the amount of queries per day. If you need more, you're probably doing something commercial and should contact Google to avoid legal issues. In any case - it's easy to work around it by generating a set of keys. I think it's easier to manage a set of keys rather than deal with a broken program due to changes in the HTML format. The former requires a one-time effort. The latter requires every once in a while very urgent dealing with complicated regex... Anyway, if in the end I do decide to add such a support, I'll make sure to update the article accordingly.
Thanks,
Ilan
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Is there a chance you could look how to do a Search in a specfic language. Now it returns the English results. But if you add the query string "hl=nl" querystring it should return the dutch results. I can't get this to work. In your you have this piece
string requestUri = string.Format("http://images.google.com/images?q={0}&ndsp={1}&start={2}&filter={3}&safe={4}", query, RESULTS_PER_QUERY.ToString(),(startPosition + i).ToString(), (filterSimilarResults) ? "1" : "0", safeSearchStr);
And I add this piece to it.
requestUri += "&hl=nl";
After compiling I get the error message
Parsing of query ok failed - collections count mismatch
Any ideas how I can get this to work? Is there anything in the regex that's not good?
Thanks in advance!
BRSF
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
It doesn't matter. Any query crashes. Some Dutch words: fiets (bike), tas (bag)
It seems that the Data Regex doesn't work in other languages. It does get the images, but not the data..
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
I'll try to take a look, but I hope it's not urgent, it might take me a while to find the time... I'll send another reply when I'm done. Ilan
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Allthough I put the file in the location mentioned, and also in the root of my project, in the directory of the class using it and at the same location as the DLL u provided I keep getting this exception. What am I doing wrong?
I am making use of the DLL in the Bin directory... In the class using the DLL I added: using System.Text.RegularExpressions; using Ilan.Google.API.ImageSearch;
Its a asp.net 2.0 project and the weg.config has been adapted with the appsetting lines.
...
I tried your demo project and everything works fine there. One difference, mine is a web app.
Exception:
[FileNotFoundException: Could not find file 'C:\Program Files\Microsoft Visual Studio 8\Common7\IDE\imagesSearchRegex.txt'.] System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) +1971213 System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy) +998 System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share, Int32 bufferSize, FileOptions options) +115 System.IO.StreamReader..ctor(String path, Encoding encoding, Boolean detectEncodingFromByteOrderMarks, Int32 bufferSize) +85 System.IO.StreamReader..ctor(String path) +112 Ilan.Google.API.ImageSearch.SearchService.LoadRegexStrings() +149
[Exception: Image Search API Could not load Regex file.] Ilan.Google.API.ImageSearch.SearchService.LoadRegexStrings() +600 Ilan.Google.API.ImageSearch.SearchService..cctor() +38
[TypeInitializationException: The type initializer for 'Ilan.Google.API.ImageSearch.SearchService' threw an exception.] Ilan.Google.API.ImageSearch.SearchService.SearchImages(String query, Int32 startPosition, Int32 resultsRequested, Boolean filterSimilarResults) +0 SearchGoogle.SearchByTags(String tags, Int32 pageNr) in c:\Projects\tags\App_Code\Photos\SearchGoogle.cs:22 SearchPhotos.SearchByTags(String tags, Int32 pageNr) in c:\Projects\tags\App_Code\Photos\SearchPhotos.cs:97 _Default.btnSearch_Click(Object sender, EventArgs e) in c:\Projects\tags\Default.aspx.cs:20 System.Web.UI.WebControls.Button.OnClick(EventArgs e) +105 System.Web.UI.WebControls.Button.RaisePostBackEvent(String eventArgument) +107 System.Web.UI.WebControls.Button.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(String eventArgument) +7 System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler sourceControl, String eventArgument) +11 System.Web.UI.Page.RaisePostBackEvent(NameValueCollection postData) +33 System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +5102
Thanx in advance... u did a great job.
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
 |
|
|
Hi Friend,
I am using your code using like this: SafeSearchFiltering ss = SafeSearchFiltering.Moderate; SearchResponse sr = SearchService.SearchImages("apple", 0, 3, true);
I got the following error: System.TypeInitializationException was unhandled by user code Message="Se produjo una excepción en el inicializador de tipo de 'Ilan.Google.API.ImageSearch.SearchService'." Source="Ilan.Google.API" TypeName="Ilan.Google.API.ImageSearch.SearchService" StackTrace: en Ilan.Google.API.ImageSearch.SearchService.SearchImages(String query, Int32 startPosition, Int32 resultsRequested, Boolean filterSimilarResults) en Movil_Movil.Page_Load(Object sender, EventArgs e) en e:\Proyecto\Movil.aspx.cs:línea 24 en System.Web.Util.CalliHelper.EventArgFunctionCaller(IntPtr fp, Object o, Object t, EventArgs e) en System.Web.Util.CalliEventHandlerDelegateProxy.Callback(Object sender, EventArgs e) en System.Web.UI.Control.OnLoad(EventArgs e) en System.Web.UI.MobileControls.MobilePage.OnLoad(EventArgs e) en System.Web.UI.Control.LoadRecursive() en System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint)
Whats the problem? am I doing something wrong?
Excellent
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) |
|
|
|
 |
|
|
Hi there!
Want to use your neat API, but in the moment i can't get it running, because it complains about not finding the regex file, which again is mentioned in the app.config file. My Question now is, how do i make use of a dll which wants to access such a config file in a new project?
Sorry, I am new to C# and not familiar with this concept in detail and what i found on the net so far wasn't answering my question.
cheers, stefan
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Is there a way to get the full-size images, and not the thumbnailed ?!?!?!
Thanks !
knoledge is like a journey of millions miles, it start but with one question...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Nevermind...
PictureBox fullImage = new PictureBox(); fullImage.Image = image; fullImage.Height = image.Height; fullImage.Width = image.Width;
Didn't catch that double-click thing...
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|

Just for completeness, in case other people have the same question: The original image's URL is available in the SearchResult.ImageUrl property. By double-clicking on an thumbnail in the test application you can see how the original image is loaded.
Note: In some cases, you may have a thumbnail, but the original image does not exist. This is because Google keeps a cache of the thumbnails, but the original site may have changed and the image may be gone since the last time Google's bot went through the site. I found it doesn't happen often, though.
Ilan Assayag
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Just to tell others:
This API works really like google.
Google allow to search on specific web site, by the synthax Site:www.website.com searchcriteria
So, to search images from, say, codeproject.com with your API :
site:codeproject.com google
And your image is the first result !!!
|
| Sign In·View Thread·PermaLink | |
|
|
|
 |
|
|
Hi there,
Again the API dose not work. I think Google changed the structure of results page after less than two months of the last change. This for sure stops the API. I hope if you could update the patterns and it will be great to discuss the way you write the patterns so it become easier to for the API to be updated.
For now !!! No Google Image API 
Hope to solve this soon
Cheers, B.
|
| Sign In·View Thread·PermaLink | 2.00/5 (1 vote) | | | | | |