|
|
What do you get when you cross a joke with a rhetorical question?
The metaphorical solid rear-end expulsions have impacted the metaphorical motorized bladed rotating air movement mechanism.
Do questions with multiple question marks annoy you???
|
|
|
|
|
I've been scanning Google search results for months now using only googles .Com, .BE and Co.UK but for the life of me I cannot seem to get more than about 850k unique domain names.
At first I used about one thousand hand coded search terms and read the first ten pages returned from google that included about a hundred results per page and soon got to about half a million domains but then it slowed down to a stop as it reached 700k.
Six months later I ran a scan to see how many of the 700k sites had gone 404 and also had the DNS entry removed and found that the number of sites up and running were dropping like flies (top of head, about 20% gone)
Now I have a whopping twenty seven thousand search terms pulled from meta data and have been throwing these (Very slowly) at Google and have reached about 850k domains in the database and it has all but come to a stop again and the 850k includes about 20% that went dead pulled six month previously.
I read once that the internet contains about a billion websites not that I believed these numbers and many of the domains I have collected point to foreign sites in places like China or Korea so it is a bit of a mixed bag of results and I also understand that many domain have been parked (lots are linked back to fake sites sharing the same IP and running add-words, google does not mind) but this 850k numbers I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
modified 24-May-15 8:59am.
|
|
|
|
|
Dr Gadgit wrote: foreign sites in places like China or Korea So sites in other countries are not foreign?
|
|
|
|
|
Most from Google.com or Co.UK would be english sites and english is quite common for website located in countries where english is not the offical langwage.
A finger in the air guess by me would say 1/3 of the internet uses english or has the option of being viewed in english and would be indexed by google and be returned when searching via Google.com or UK
Pushing the boat out I still don't think I could get above 4 million sites from Google even if I scanned everyone of it's country based servers.
I can tell you for a fact that some domain parks run about 20,000 sites each and they host sites that relay google add-word adverts but i would not know just how many google filters out from its results.
If these fake park sites didn't get any hits then they would not do it.
See http://ww25.krvkr.com/[^]
or
http://ww2.bangalorewalkin.com/[^]
Richard i did not know you was running a seach engine
http://www.searchinguncovered.com/?pid=7PO3Q136C&dn=RichardMacCutchan.com&rpid=7POL08WI8[^]
The fake ones I am talking about work a bit like this but just work of the parked domain name
modified 24-May-15 10:09am.
|
|
|
|
|
Would some of the 'fake' ones be cyber-squatted ones too?
(Perhaps the Bangalore walk-in site you mention is one such).
|
|
|
|
|
I think we get a bit of a mix but in general Google does manage to filter most out.
Big domain parks pointing lots of domains to the same IP would show up but I don't see anything that stands out.
|
|
|
|
|
Fantastic information. Great analysis you are doing. Really cool stuff.
|
|
|
|
|
|
Dr Gadgit wrote: I've been scanning Google search results for months now using only googles .Com,
.BE and Co.UK How?
AFAIK, you'd need help from Google to go beyond a certain number of requests.
Bastard Programmer from Hell
If you can't read my code, try converting it here[^]
|
|
|
|
|
No i don't need any help from Google but i must admit that it's a bit of an art to fool Google in to not blocking the searches.
The first trick is going slow but also varings the delays between requests (30-60 seconds) and the second trick is to read the names and values of the HTML input boxes and buttons from the form to make up the next request URL needed for a prefect forgery.
If you don't send the cookies back then they will in time block you and it's best to use HTTPS because Google redirects to SSL after a while, it's helps them to hide spyware scripts from most people.
What i am doing might not work for much longer because google is removing all traces of domain names from its search results on mobile devices and if no one gets upset about that then they will at a guess do the same to all search results.
|
|
|
|
|
I think you will find that google is detecting your ip address scraping information and consequently is limiting what is being returned to you.
“That which can be asserted without evidence, can be dismissed without evidence.”
― Christopher Hitchens
|
|
|
|
|
Indeed a good theory and I am sure they could limit me to a sub-set of just 850k domain names but in general when they catch you and think you are upto no good (Try using google from Tor) Google will send you to a Captua screen where you need to type in a number.
They cannot just block based on the number of searches at such slow rates because most work places sit behind a router so everyone in the office shares the same public IP-Address and the code I use changes the User-Agent now and then to give them that impression.
Some times I also switch over to a VPN so that the requests come from various locations from around the world.
Google are good, very good with masive data centers that have more security than fort knocks but they still need to use web-farms and the IP-Address for Google.com gets changed all the time and you get a more local address depending on where you request the DNS lookup from.
I don't need to spread the requests to take advantage of this but i would if I had too.
|
|
|
|
|
Dr Gadgit wrote: and have reached about 850k domains in the database ...I read once that the internet contains about a billion websites...I am seeing does not look anywhere near the 5-20 million that I was originally expecting.
Expecting it why? Google uses algorithms to return 'best' matches. They vary results for various reasons but that doesn't mean that the 20 million site is going to suddenly show up at the number one spot. Especially since the queries only return a fixed, much smaller size, anyways. So more likely that set is the one with varying results.
|
|
|
|
|
"Expecting it why? Google uses algorithms to return 'best' matches"
No they don't and instead always try to sell you something.
Don't take my word for it and read up on bidding for Goiogle add-words or read comments from people who complain about how google returns results.
I did 1000 hand coded search request and took the first 1000 links and towards the end i found that i had recorded each host name already in 99.999% of cases so i must had about maxed google out.
Not convinced by this, 6 months later I decided to try 27,000 unique search terms to see if i could tease any more names from google and did manage another 150k ish but in the same period of time 150k names had been removed, 404, no DNS entry.
I am working on the logic that if i play enought hands of cards then sooner or later I will come to learn just how many cards their are in the pack.
Maybe the only google search know to man that will return abc123def456ghi.com is to type that domain into the google search box but having tried 27,000 search terms only to get no new results in 99.99999 of cases says to me that I am about home for english results.
Please advise of a better method if you know of one or a dictionary I should be using and i will give it a try.
|
|
|
|
|
Dr Gadgit wrote: No they don't and instead always try to sell you something.
Xince the vast majority of sites do not have any financial relationship with google then, by your very own assumption, then it would explain why there are so few results. But google doesn't solely base results on paid relationships.
Dr Gadgit wrote: Don't take my word for it and read up on bidding for Goiogle add-words
You do of course realize that the vast majority of those 1 billion sites do not in fact have anything to do with ad-words? So of course that is meaningless. Except of course for the fact that those that do pay will in fact occupy higher spots and thus, guarantee, that some slots way down the list will be pushed out.
Dr Gadgit wrote: Please advise of a better method
Facts are
1- 1 billion sites
2- Results that are sorted based on some 'best' criteria
3- Results are limited to 1000.
4- Certainly some factor that many of those sites are not in english.
The above together would seem to insure, to me, that there is no way, one is going to get close.
Dr Gadgit wrote: I decided to try 27,000 unique search terms
Oxford says there are 170,000 english words. If each of those words returned 1000 unique sites then that would be 170,000,000 sites.
Naturally of course there is no way you are going to get unique sets for each. Thus 170 million is the absolute theoretical maximum and common sense would suggest it will be far lower.
|
|
|
|
|
if you make enought requests and read 1000 links each time then sooner or later you will cover every site that google is not hiding and no you don't need to pay google money to be on page ten of the search results.
Google never use to be like this and i think that the nanny state is working to limit just what you can read on the internet or where else have the other 90% plus of sites all gone
|
|
|
|
|
Okay, for arguments sake let's say there are 1 billion sites out there (unlikely) if the search results of only the 850K sites are really accessbile then that's all users ever see anyway. No one would ever find the others, right?
Interesting.
|
|
|
|
|
newton.saber wrote: Okay, for arguments sake let's say there are 1 billion sites out there (unlikely)
That is however what they say there are.
http://www.internetlivestats.com/total-number-of-websites/[^]
newton.saber wrote: No one would ever find the others, right?
No one? Obviously someone created it so presumably that person can find it.
And some sites are intentionally not supposed to be findable of course.
The topic here isn't of course whether one can get to it at all but rather whether one can get to it via a search engine. And per my other reply that seems unlikely (on average.)
|
|
|
|
|
If i run a query on the 850k domains and GROUP BY IP, Count Doamins then the winner is WordPress with 1485 domains on the same IP-Address 192.0.78.16
using
SELECT Host, IP, ASN, DateScan
FROM dbo.Hosts
WHERE (Host LIKE '%WordPress.%')
ORDER BY Host
Then this gets me 3235 results with the first host being '02varvara.wordpress.com' and the last one being 'zxksinglegaymendating.wordpress.com' << SORRY DO NOT VIEW
(The above two domains might not be on 192.0.78.16)
if you check out http://www.my-ip-neighbors.com/?domain=192.0.78.16[^]
For the address then you get 263 results for the address that i found 1485 domain names on so I must be getting close with my research by having six times that number of domains hosted on that single IP-Address.
I will eat my hat if anyone can find a host name below "02varvara" and ends in ".wordpress.com" on any google search results page that works in english.
|
|
|
|
|
What kind of hat? I just want to be sure it is worth the effort.
|
|
|
|
|
|
Google search algorithm ranks websites mainly by their number of links from other external domains.
I think that the explanation for the limited number of sites that you found, is that not all of the billion websites in existence have links enough pointing at them in order to gain a ranking that justifies (for Google) their inclusion in search results.
|
|
|
|
|
"Google search algorithm ranks websites mainly by their number of links from other external domains."
To some degree you must be right because lots of people banging on about SEO who earn a living from it spend their nights spamming sites (Organic grouth thye call it) but nothing beats paying google some money to hit top of the page.
I was reading 1000 links per search term and i don't think i would have got many more unique Url's even if i read 10,000 per search, they just ran out of domains as far as i could see at about 800k
I am sure i could get lots more domains from alexa.com and faster by running a web-bot but the point I want to make is that Google now hides most of the internet, goodbye little guys, well hello controlled opersition or paying customers.
Gone are the days where "Little Mrs Smith" will help you knit a pair of socks and pay for her site using a few adverts, no sir, it's all in Ebay or Wikipedia,Facebook it's all you need to know in life.
Google is in court every other week and having to pay fines so it's not like we can trust them to give us the truth and they are part of a monoply, we all know the names of the next six players in the game and i for one don't think that this is a good thing for any of us.
|
|
|
|
|
I wouldn't be surprised to find that Google also trims the tail. Any site with only one or two links probably isn't interesting to anybody, so why waste the disk space.
We can program with only 1's, but if all you've got are zeros, you've got nothing.
|
|
|
|