|
Indeed a good theory and I am sure they could limit me to a sub-set of just 850k domain names but in general when they catch you and think you are upto no good (Try using google from Tor) Google will send you to a Captua screen where you need to type in a number.
They cannot just block based on the number of searches at such slow rates because most work places sit behind a router so everyone in the office shares the same public IP-Address and the code I use changes the User-Agent now and then to give them that impression.
Some times I also switch over to a VPN so that the requests come from various locations from around the world.
Google are good, very good with masive data centers that have more security than fort knocks but they still need to use web-farms and the IP-Address for Google.com gets changed all the time and you get a more local address depending on where you request the DNS lookup from.
I don't need to spread the requests to take advantage of this but i would if I had too.
|
|
|
|
|
Dr Gadgit wrote: and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.
|
|
|
|
|
"Expecting it why? Google uses algorithms to return 'best' matches"
No they don't and instead always try to sell you something.
Don't take my word for it and read up on bidding for Goiogle add-words or read comments from people who complain about how google returns results.
I did 1000 hand coded search request and took the first 1000 links and towards the end i found that i had recorded each host name already in 99.999% of cases so i must had about maxed google out.
Not convinced by this, 6 months later I decided to try 27,000 unique search terms to see if i could tease any more names from google and did manage another 150k ish but in the same period of time 150k names had been removed, 404, no DNS entry.
I am working on the logic that if i play enought hands of cards then sooner or later I will come to learn just how many cards their are in the pack.
Maybe the only google search know to man that will return abc123def456ghi.com is to type that domain into the google search box but having tried 27,000 search terms only to get no new results in 99.99999 of cases says to me that I am about home for english results.
Please advise of a better method if you know of one or a dictionary I should be using and i will give it a try.
|
|
|
|
|
Dr Gadgit wrote: No they don't and instead always try to sell you something.
Xince the vast majority of sites do not have any financial relationship with google then, by your very own assumption, then it would explain why there are so few results. But google doesn't solely base results on paid relationships.
Dr Gadgit wrote: Don't take my word for it and read up on bidding for Goiogle add-words
You do of course realize that the vast majority of those 1 billion sites do not in fact have anything to do with ad-words? So of course that is meaningless. Except of course for the fact that those that do pay will in fact occupy higher spots and thus, guarantee, that some slots way down the list will be pushed out.
Dr Gadgit wrote: Please advise of a better method
Facts are
1- 1 billion sites
2- Results that are sorted based on some 'best' criteria
3- Results are limited to 1000.
4- Certainly some factor that many of those sites are not in english.
The above together would seem to insure, to me, that there is no way, one is going to get close.
Dr Gadgit wrote: I decided to try 27,000 unique search terms
Oxford says there are 170,000 english words. If each of those words returned 1000 unique sites then that would be 170,000,000 sites.
Naturally of course there is no way you are going to get unique sets for each. Thus 170 million is the absolute theoretical maximum and common sense would suggest it will be far lower.
|
|
|
|
|
if you make enought requests and read 1000 links each time then sooner or later you will cover every site that google is not hiding and no you don't need to pay google money to be on page ten of the search results.
Google never use to be like this and i think that the nanny state is working to limit just what you can read on the internet or where else have the other 90% plus of sites all gone
|
|
|
|
|
Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right?
Interesting.
|
|
|
|
|
newton.saber wrote: Okay, for arguments sake let's say there are 1 billion sites out there (unlikely)
That is however what they say there are.
http://www.internetlivestats.com/total-number-of-websites/[^]
newton.saber wrote: No one would ever find the others, right?
No one? Obviously someone created it so presumably that person can find it.
And some sites are intentionally not supposed to be findable of course.
The topic here isn't of course whether one can get to it at all but rather whether one can get to it via a search engine. And per my other reply that seems unlikely (on average.)
|
|
|
|
|
If i run a query on the 850k domains and GROUP BY IP, Count Doamins then the winner is WordPress with 1485 domains on the same IP-Address 192.0.78.16
using
SELECT Host, IP, ASN, DateScan
FROM dbo.Hosts
WHERE (Host LIKE '%WordPress.%')
ORDER BY Host
Then this gets me 3235 results with the first host being '02varvara.wordpress.com' and the last one being 'zxksinglegaymendating.wordpress.com' << SORRY DO NOT VIEW
(The above two domains might not be on 192.0.78.16)
if you check out http://www.my-ip-neighbors.com/?domain=192.0.78.16[^]
For the address then you get 263 results for the address that i found 1485 domain names on so I must be getting close with my research by having six times that number of domains hosted on that single IP-Address.
I will eat my hat if anyone can find a host name below "02varvara" and ends in ".wordpress.com" on any google search results page that works in english.
|
|
|
|
|
What kind of hat? I just want to be sure it is worth the effort.
|
|
|
|
|
|
Google search algorithm ranks websites mainly by their number of links from other external domains.
I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.
|
|
|
|
|
"Google search algorithm ranks websites mainly by their number of links from other external domains."
To some degree you must be right because lots of people banging on about SEO who earn a living from it spend their nights spamming sites (Organic grouth thye call it) but nothing beats paying google some money to hit top of the page.
I was reading 1000 links per search term and i don't think i would have got many more unique Url's even if i read 10,000 per search, they just ran out of domains as far as i could see at about 800k
I am sure i could get lots more domains from alexa.com and faster by running a web-bot but the point I want to make is that Google now hides most of the internet, goodbye little guys, well hello controlled opersition or paying customers.
Gone are the days where "Little Mrs Smith" will help you knit a pair of socks and pay for her site using a few adverts, no sir, it's all in Ebay or Wikipedia,Facebook it's all you need to know in life.
Google is in court every other week and having to pay fines so it's not like we can trust them to give us the truth and they are part of a monoply, we all know the names of the next six players in the game and i for one don't think that this is a good thing for any of us.
|
|
|
|
|
I wouldn't be surprised to find that Google also trims the tail. Any site with only one or two links probably isn't interesting to anybody, so why waste the disk space.
We can program with only 1's, but if all you've got are zeros, you've got nothing.
|
|
|
|
|
Quote: the 5-20 million that I was originally expecting
Well, there's your problem! I see no reason to expect anything of the sort especially when you've conducted the search in such an arbitrary fashion. How many sites are intentionally excluded from Google? How many are referenced by their IP address only? Casting your net in the Google Sea is no more likely to allow you to identify every domain any more than casting it in the Indian Ocean would get you every species of fish.
|
|
|
|
|
"How many sites are intentionally excluded from Google"
Lot's i would say but it never use to be that way and was more a level playing field but today you are forced to play by googles rules, google special meta tags in HTML pages, Google spyware scripts on the site and then to not poke your neck out too much.
Censorship in my book but most people don't see it.
Well done, nail on the head
|
|
|
|
|
To continue the fishing analogy this is casting the net 27,000 times in different locations within the sea, I'd say it should return a decent reflection of the overall fish in the Google ocean.
Good experiment, I always doubted the billions upon billions of web pages supposedly searched to return my results in a fraction of a second.
|
|
|
|
|
I love this experiment in reality vs. theory! I agree with your premise and expected results. Results of a simple .com test should be in the tens-of-millions results.
http://www.internetlivestats.com/total-number-of-websites/
Is there any way to take your same experiment and test against Bing and Ask? I expect that you are scraping the google results DOM or something like that, so maybe not feasible.
No matter how "big data" you are, there just isn't enough computing power to iterate through billions of *raw* records for the 5.7 billion daily searches. You transform data, then search on the transformed result.
This very well might be a result of the google data transformation layer filtering out the millions of sites that don't have a high confidence score for your search terms.
I think you are finding how powerful search engines are...they define our internet reality. Companies go out of business when google drops their URL. Just the past few weeks, windows developers have complained about their google Ad revenue dropping by up to 95%.
Must feel good to have the power to nuke entire companies and industries by dropping them from search results.
Robert
|
|
|
|
|
A first-grade teacher, Ms. Brooks, was having trouble with one of her students. The teacher asked, “Harry, what’s your problem?” Harry answered, “I’m too smart for the 1st grade. My sister is in the 3rd grade and I’m smarter than she is! I think I should be in the 3rd grade too!”
Ms. Brooks had had enough. She took Harry to the principal’s office.
While Harry waited in the outer office, the teacher explained to the principal what the situation was. The principal told Ms. Brooks he would give the boy a test. If he failed to answer any of his questions he was to go back to the 1st grade and behave. She agreed.
Harry was brought in and the conditions were explained to him and he agreed to take the test.
Principal: “What is 3 x 3?”
Harry: “9.”
Principal: “What is 6 x 6?”
Harry: “36.”
And so it went with every question the principal thought a 3rd grader should know.
The principal looks at Ms. Brooks and tells her, “I think Harry can go to the 3rd grade.”
Ms. Brooks says to the principal, “Let me ask him some questions.”
The principal and Harry both agreed.
Ms. Brooks asks, “What does a cow have four of that I have only two of?”
Harry, after a moment: “Legs.”
Ms. Brooks: “What is in your pants that you have but I do not have?”
The principal wondered why would she ask such a question!
Harry replied: “Pockets.”
Ms. Brooks: “What does a dog do that a man steps into?”
Harry: “Pants.”
The principal was trembling.
Ms. Brooks: “What word starts with an ‘F’ and ends in ‘K’ that means a lot of heat and excitement?”
Harry: “Firetruck.”
The principal breathed a sigh of relief and told the teacher, “Put Harry in the fifth-grade, I got the last four questions wrong myself.”
I'll get my coat.
Once you lose your pride the rest is easy.
In the end, only three things matter: how much you loved, how gently you lived, and how gracefully you let go of things not meant for you. – Buddha
Simply Elegant Designs JimmyRopes Designs
|
|
|
|
|
|
Thanks for pointing that out. And only 3 years ago. How could I have missed that!
Once you lose your pride the rest is easy.
In the end, only three things matter: how much you loved, how gently you lived, and how gracefully you let go of things not meant for you. – Buddha
Simply Elegant Designs JimmyRopes Designs
|
|
|
|
|
A highly dangerous virus called "Weekly Overload Recreational Killer" (WORK) is currently going around, indiscriminately attacking CP members worldwide. If you come into contact with the WORK virus, you should go immediately to the nearest "Biological Anxiety Relief" (BAR) center, to take antidotes known as "Work Isolating Neutralizer Extract" (WINE), Radioactive UnWORK Medicine (RUM), "Bothersome Employer Elimination Rebooter" (BEER) or "Vaccine Official Depression Killing Antigen" (VODKA). Don't wait! Protect yourself! Do it now!
How do we preserve the wisdom men will need,
when their violent passions are spent?
- The Lost Horizon
|
|
|
|
|
I was fully vaccinated with copious quantities of those (and Work Hating Immunization Serum Keeping Employees Younger) when I was a mere strippling!
Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...
|
|
|
|
|
Aaaaaah! Whiskey: The right stuff to pickle any liver and make it last forever!
How do we preserve the wisdom men will need,
when their violent passions are spent?
- The Lost Horizon
|
|
|
|
|
I had hoped to stumble across a presentation as visually beautiful and sensual as last year's Polish entry [^]; instead, all I saw was:
1. feral amazons dressed-to-kill snarling through Las Vegas scream-a-thons.
2. young-folks from countries where dancing is, evidently, either an unknown art, or a form of martial art involving using spastic movements to dis-orient opponents.
3. the usual bo-toxed talking-head commentators stoned on methamphetamine and/or steroids, engaging in glossolalia with shards of trivia barely able to crawl through.
4. horrible, crude, cgi backdrop animations of machinery a la steam-punk spawning cheap knock-off Geiger style nightmare-techno.
The only enjoyable moment my half-hour meandering about the final produced was when I realized ... afterwards ... I had not seen Ms. EuroVision 2014, Conchita Wurst, simpering through "Droop like a Phoenix."
Yes: why did I bother ? Poland where are you when I need you ?
«I want to stay as close to the edge as I can without going over. Out on the edge you see all kinds of things you can't see from the center» Kurt Vonnegut.
|
|
|
|
|
Hey come on Bill! They let Australia into Europe this year!
(I think the organisers were secretly hoping the Ozzies would win, and it would never involve Europe again...)
I can't comment on the performances, as I had prior engagements and couldn't watch it. Cutting my toenails, watching grass grow, paint dry - that sort of thing.
Bad command or file name. Bad, bad command! Sit! Stay! Staaaay...
|
|
|
|