Click here to Skip to main content
15,867,308 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello. We have a Python code. This code is part of a larger code. This code first receives a URL from the user, and then searches at a depth of 2 in the URL received from the user and extracts the email addresses. The goal is to have no limits for depth and to search all subdomains and links in the received URL without any restrictions. Please guide me and give me the modified code.

What I have tried:

def extractUrl(url):
print ("Searching, please wait...")
print ("This operation may take several minutes")
try:
count = 0

listUrl = []

conn = urllib.request.urlopen(url)

html = conn.read().decode('utf-8')

emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", html)
print ("Searching in " + url)

for email in emails:
if (email not in listUrl):
count += 1
print(str(count) + " - " + email)
listUrl.append(email)


soup = BeautifulSoup(html, "lxml")
links = soup.find_all('a')

for tag in links:
link = tag.get('href', None)
if link is not None:
try:
print ("Searching in " + link)
if(link[0:4] == 'http'):
f = urllib.request.urlopen(link)
s = f.read().decode('utf-8')
emails = re.findall(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}", s)
for email in emails:
if (email not in listUrl):
count += 1
print(str(count) + " - " + email)
listUrl.append(email)
if(searchEmail("EmailCrawler.db", email, "Especific Search") == 0):
insertEmail("EmailCrawler.db", email, "Especific Search", url)
Posted

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900