Add your own alternative version
Stats
142.3K views 67 bookmarked
Posted
20 Jul 2003
|
Comments and Discussions
|
|
hi all,
i am a vc++ programmer, i made webpage extractor using MSHTML::IHTMLDocument2Ptr , InternetOpenUrl and InternetReadFile. but before saving data i have to navigate to that page using Navigate2 function for retriving web links inside in webpage. is there any function or trick so i extract web links without navigate ? because it takes to much time on rich graphics websites.
thanks...
|
|
|
|
|
hi all
please, i need help in how can i extract/read data/properties from tomcat server (using Java Servlet) with SSO (Single sign on).
please
urgently
|
|
|
|
|
hi
May i want to know about website extractor.What techniques can i follow to generate website extractor.
|
|
|
|
|
I am a fresher at crawlers. I want to write an application that will search the index pages of all the websites on the internet. Can anybody tell me from where can I get a list of all the websites on the internet?
If not what is the best way that leading search engines deploy.
Satya
|
|
|
|
|
http://newyork.craigslist.org/mnh/eng/255305939.html
Search Engine Expert: Building Web 3.0 Social Networking People Search Engine, Do You Want To Be The King Of The Internet?
Company Description:
PeoplePeople.com: The Superior MySpace Alternative .....
PeoplePeople.com is combining proprietary Natural Language Extraction, Artificial Intelligence Algorithms and Information Integration logic to build a Social Networking Search Engine.
Using Natural Language Extraction tools, our programs are able to read English sentences and understand what they mean. PeoplePeople.com then extracts relevant pieces of information about people, such as the companies they work for and their job titles or a social networking page like a person's page on MySpace.
Artificial Intelligence Algorithms allow our computers to analyze a Web site and extract information based on an understanding of how the Web site is constructed. PeoplePeople.com can deduce that a specific paragraph describes a company, or a social networking page like a person's page on MySpace.
Position Purpose:
This person will work with the PeoplePeople.com Search Technology Team to develop the core search engine and web crawlers. This individual will be the search engineer on the design and implementation of a large scale crawling, processing and serving system. Tasks include implementing search algorithms, data mining, improving relevancy or search results, managing terabytes of data and scaling algorithms to work on very large data sets, and serving search results using a large network of Windows 2003 Servers.
Accountabilities:
This position is an integral part of PeoplePeople.com's core technology team involving the design, development and implementation of PeoplePeople.com's search engine: the crawling, indexing and ranking of billions of documents on the Internet. As such, this person will be expected to make a significant contribution to this effort by designing innovative technical solutions to this significant challenge.
Requirements/Qualities:
*Must have experience in building a search engine crawler and indexer
*Configuring crawlers and indexing content
*Must have the desire and commitment to build a leading-edge search technology
*Must have extensive programming experience in Microsoft C#.NET and SQL Server.
*Must have experience with search engine relevance and information retrieval techniques
*Must have a minimum of 5 years experience in software development in either an academic or corporate environment
*Must be able to communicate and work with both technical and non-technical people
This position is open to telecommuting, consulting or full time work.
Send your resume to searchengine@GeeWHIZConsulting.com, attention Leo Loiacono.
Best,
Leo
201-923-9595">http://newyork.craigslist.org/mnh/eng/255305939.html
Search Engine Expert: Building Web 3.0 Social Networking People Search Engine, Do You Want To Be The King Of The Internet?
Company Description:
PeoplePeople.com: The Superior MySpace Alternative .....
PeoplePeople.com is combining proprietary Natural Language Extraction, Artificial Intelligence Algorithms and Information Integration logic to build a Social Networking Search Engine.
Using Natural Language Extraction tools, our programs are able to read English sentences and understand what they mean. PeoplePeople.com then extracts relevant pieces of information about people, such as the companies they work for and their job titles or a social networking page like a person's page on MySpace.
Artificial Intelligence Algorithms allow our computers to analyze a Web site and extract information based on an understanding of how the Web site is constructed. PeoplePeople.com can deduce that a specific paragraph describes a company, or a social networking page like a person's page on MySpace.
Position Purpose:
This person will work with the PeoplePeople.com Search Technology Team to develop the core search engine and web crawlers. This individual will be the search engineer on the design and implementation of a large scale crawling, processing and serving system. Tasks include implementing search algorithms, data mining, improving relevancy or search results, managing terabytes of data and scaling algorithms to work on very large data sets, and serving search results using a large network of Windows 2003 Servers.
Accountabilities:
This position is an integral part of PeoplePeople.com's core technology team involving the design, development and implementation of PeoplePeople.com's search engine: the crawling, indexing and ranking of billions of documents on the Internet. As such, this person will be expected to make a significant contribution to this effort by designing innovative technical solutions to this significant challenge.
Requirements/Qualities:
*Must have experience in building a search engine crawler and indexer
*Configuring crawlers and indexing content
*Must have the desire and commitment to build a leading-edge search technology
*Must have extensive programming experience in Microsoft C#.NET and SQL Server.
*Must have experience with search engine relevance and information retrieval techniques
*Must have a minimum of 5 years experience in software development in either an academic or corporate environment
*Must be able to communicate and work with both technical and non-technical people
This position is open to telecommuting, consulting or full time work.
Send your resume to searchengine@GeeWHIZConsulting.com, attention Leo Loiacono.
Best,
Leo
201-923-9595[<a href="http://newyork.craigslist.org/mnh/eng/255305939.html
Search Engine Expert: Building Web 3.0 Social Networking People Search Engine, Do You Want To Be The King Of The Internet?
Company Description:
PeoplePeople.com: The Superior MySpace Alternative .....
PeoplePeople.com is combining proprietary Natural Language Extraction, Artificial Intelligence Algorithms and Information Integration logic to build a Social Networking Search Engine.
Using Natural Language Extraction tools, our programs are able to read English sentences and understand what they mean. PeoplePeople.com then extracts relevant pieces of information about people, such as the companies they work for and their job titles or a social networking page like a person's page on MySpace.
Artificial Intelligence Algorithms allow our computers to analyze a Web site and extract information based on an understanding of how the Web site is constructed. PeoplePeople.com can deduce that a specific paragraph describes a company, or a social networking page like a person's page on MySpace.
Position Purpose:
This person will work with the PeoplePeople.com Search Technology Team to develop the core search engine and web crawlers. This individual will be the search engineer on the design and implementation of a large scale crawling, processing and serving system. Tasks include implementing search algorithms, data mining, improving relevancy or search results, managing terabytes of data and scaling algorithms to work on very large data sets, and serving search results using a large network of Windows 2003 Servers.
Accountabilities:
This position is an integral part of PeoplePeople.com's core technology team involving the design, development and implementation of PeoplePeople.com's search engine: the crawling, indexing and ranking of billions of documents on the Internet. As such, this person will be expected to make a significant contribution to this effort by designing innovative technical solutions to this significant challenge.
Requirements/Qualities:
*Must have experience in building a search engine crawler and indexer
*Configuring crawlers and indexing content
*Must have the desire and commitment to build a leading-edge search technology
*Must have extensive programming experience in Microsoft C#.NET and SQL Server.
*Must have experience with search engine relevance and information retrieval techniques
*Must have a minimum of 5 years experience in software development in either an academic or corporate environment
*Must be able to communicate and work with both technical and non-technical people
This position is open to telecommuting, consulting or full time work.
Send your resume to searchengine@GeeWHIZConsulting.com, attention Leo Loiacono.
Best,
Leo
201-923-9595
|
|
|
|
|
hi all
please, i need help in How can i Extract data from HTML table using C# or VB.NET
please
urgently
|
|
|
|
|
Where I can get .h and .lib files needed? I do not see these in my VS.
-- modified at 19:00 Sunday 27th November, 2005
|
|
|
|
|
|
I have PlatformSDK that comes with VS 2002.NET.
But I do not see there winhttp.h and winhttp5.lib. Maybe some others needed?
|
|
|
|
|
Can anybody send these two (winhttp.h and .lib) via e-mail? Please.
|
|
|
|
|
Hi,
i´m trying to read the Refresh Header.
I have a site where this code is included:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head >
<title>My Site</title>
<meta http-equiv="refresh" content="0; URL=http://www.somedide.com">
<meta name="keywords" content="test">
<meta name="My Website" content="My Website">
<meta name="robots" content="follow">
</head>
<body>
</body>
</html>
When i use WinHTTP i cant get the content of this site.
I´m automaticaly forwarded to http://www.somedide.com
I can´t figure out how to stop redirecting couse WinHttpSetOption seems not to work, or i´m doing something wrong.
|
|
|
|
|
Hey Guys
There seems to be loads of info for this control for downloading files, but I need to upload files to a server using this control and I will have to provide username and password.
Does anyone have any examples of how this is done?
I would really appreciate it!
Regards
|
|
|
|
|
You are calling SysFreeString() on the bstr_t, which causes your code to crash when freeing the SAFEARRAY in the subsequent call.
If you are going to cut-and-paste the MSFT example from their docs, as this code is, then use a regular BSTR like they do.
|
|
|
|
|
obviously I am not promoting mindless copying and pasting, otherwise I could have provided the whole code. The whole point of that part was using the write method of IHTMLDocument to populate the object for any further manipulation, which MSFT docs won't give you. the MSDN page for the write method specifies "Writes one or more HTML expressions to a document in the specified window" which is wrong because in fact it can simply be an object without the window. why don't you simply handle the exception, and copy how to do that from the MSFT docs too.
And God said "Let there be light." But then the program crashed because he was trying to access the 'light' property of a NULL universe pointer
|
|
|
|
|
i m trying to download webpages from URL containing https in VC++. Can we download pages from URL conataining https in any language.
|
|
|
|
|
i m trying to download webpages from URL containing https in VC++. Can we download pages from URL conataining https.
|
|
|
|
|
Yes you can, just use the "https://" [no quotes ] instead of the "http://" in your request...
I'm not sure, but, if the site certificate is in some state other than valid (expired, revoked, etc) and you still want to get the pages you will have to set some options regarding the "acceptance" of invalid certificates BEFORE you open the request.
Take a small look ate the MSDN doc on WinHTTP.
If you don't need all the power of the raw handling WinHTTP VC++ implementation provides, use the WinHTTP object inside VC++
Best Regards
RFL
|
|
|
|
|
Hello,
Your article is interresting.
But I'm not sure it can solve all problems when doing a massive crawling against thousands of different sites.
For example, by using WinHTTP you are relying on something better than WinInet, I agree, but don't have a cache system, a cookie system... WinInet gives you access to the cache and cookies IE uses.
Another example, when you read a page using WinHTTP and then send it to MSHTML to access the DOM, what happens if the page is made of frames ? What happens with the Javascript code that can be executed in the page and sometimes generate large parts of a page.
Believe me, in a large real-world application, you have to be close from what a browser like IE is able to do. For more and more sites, getting links from pages is not as easy as looking for anchor and href. This solution was good 3 years ago. Today you miss the whole thing if you don't load the frames, you don't execute the Javascript... And so on...
Unfortunatly the Web is not just about HTML anymore. This job is way more demanding today that it was. One good example are search engines. They were built just like you describe, loading pages and parsing anchors. They evolved. Today a lot of pages are not indexed because search engines can't find links generated by Javascript on a page, because somes pages rely on Flash menus... Some sites are not indexed at all.
However, your solution is good for a few tasks and WinHTTP scales better than WinInet. So your article deserves attention but don't conclude so quickly than you can easely turn your solution into a full-featured automated agent.
Regards,
R. LOPES
Just programmer.
|
|
|
|
|
HI,
thanks for your useful insight. My main purpose of publishing this article was that there was no useful information on codeproject or associated sites for WINHTTP or URL cracking and to give a basic overview of crawling using C++.
I have developed a fully feaured agent system which takes care of most of the issues you have mentioned.
the answers to some of your questions are as follows :
--You are right that WinInet has a cache but in crawlers we don't usually employ a cache while crawling ( it is useful for showing dead link/cached results in search engines). Regarding the cookies WinInet does support persistent cookies and has just one call to get the value of a cookie but WINHTTP internally maintains session cookies too as per MSDN.
--If the page consists of frames you can have a small code piece which detects a frame in higher level DOM nodes and brings in the html data for the individual frames and makes it into a single page ( replacing FRAME with the actual HTML page inside the parent FRAMESET).
--For the Javascript generated links we can extract URLs from the script blocks alongwith a sandboxed script execution module consisting of the web browser control for code generation but the application speed suffers.
--The range of documents that you can index on the web is directly dependent of the range of parsers for different content formats like flash (I remember making a flash file parser to obtain URLs from action and data tags in a flash file), pdf etc.
WINHTTP was designed to obtain the best of HTTP transfer performance and i guess it does the job adequately better than WinInet in that aspect.
Thanks and Regards.
-Prabhdeep
|
|
|
|
|
Hi,
I'm always glad to find people working on automated agents.
I have been working on this subject for more than 3 years and developed different solutions in various language like Java, C++ and C#.
I have struggled for a long time to achieve a result close to what IE gets.
Today I have a solution working pretty nicely but I'm always exploring new options. For example I recently looked at the WebRequest classes of .NET, very close of what WinHTTP provides. Actually I think those .NET classes are built around WinHTTP but I'm not 100% sure. One thing sure is that they provide the same level of capabilities and scalability. Something WinInet lacks, I agree.
To do my job, I had to develop a cache system. For a long time I have used my own. For cookies I did the same. For Javascript I took piece of code from the Mozilla project. For Flash I never did a complete parsing, but I made experiment using the Macromedia SDK.
All the solution you present here are good but sometimes not enough:
- caching page elements can be a way to save bandwidth (and if you crawl the same page several time a day). At least you have to build your own.
- cookies may have to be saved locally. This is my case I have to keep cookies for new crawling days after. You can built your own or rely on WinInet.
- for FRAMES I like your solution but I had cases of frames using variables from other frames to build something like a banner ad. Don't smile it really happened. And sometimes you have frames refreshing other pages or even generating other frames. If you join all frames you break the scripts behaviours. Trust me I would like to get rid of those frames !
- for Javascript, I agree it kills efficiency but I also have to take care of that. This is maybe the most difficult and time consuming part but I always tried to balance that with a good multithreaded solution.
- and for all kind of documents you just can't have a parser for everything I agree. And I'm not sure any programme needs all this stuff. I do HTML and should do a better Flash and I know I can get almost 100% of all links.
Nevertheless, things are going more and more complicated on the crawling field and I'm wondering if I will be happy someday with a solution. Until then I continue to explore. I'm glad to had this opportunity to exchange some views with you.
Best regards,
R. LOPES
Just programmer.
|
|
|
|
|
What about pages that use relative links, etc.? If you don't tell MSHTML somehow what the URL of the page is, then the URLs are garbage if they are relative, etc.
How do you handle that?
|
|
|
|
|
hi arun,
how i deal with the problem of relative URLs is something like this.
- detection of relative URL : when u fire WinHTTPCrackURL on a relative link it gives u a blank hostname which should be an indication that it is not an absolute URL but relative one.
-conversion of relative to absolute : then you remember(store in a global or pass to the crawling function) the current URL and simply concatenate the relative part you got from the hyperlink in the page to the base URL passed to you.
-again run WinHTTPCrackURL and see if you get something intelligible and resume the extraction process.
you probably need a small routine to make absolute urls from the given base url and the relative url. all the nuances of relative urls are listed here and might be helpful to you :
http://www.webreference.com/html/tutorial2/3.html
-prabhdeep
|
|
|
|
|
Ah. It sounds like a hack, but I don't know of any better way for dealing with MSHTML.
|
|
|
|
|
Hello,
Who says that you want to cache, use cookies or run javascript? Most of the time javascript does some weird crap like colors flying across the page. This is an introductory to the subject and not a software package that has 10 years of development.
What the sh*t dude.
|
|
|
|
|
Interesting conversation. Although I am a pioneer in this area, my question is: Why wouldn't we just let IE handle all these tricky stuff (handling Java Scripts, relative URLs, frames, etc) and take advantage of using IWebBrowser2 interface. I know it is slower, but the question is: Is it really so much slower? Is there any way to programatically instruct IWebBrowser2 to don't download images (in case we need just a text) ?
Any comments?
|
|
|
|
|
|
General News Suggestion Question Bug Answer Joke Praise Rant Admin
Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.
|
|