Click here to Skip to main content
15,499,155 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I'm not much familiar with python. I'm scraping data from a school site about its address, phone, email, and the school link. I scraped the json data, and one of the keys has all these values which i have shown below.

This is the output which i receive under the key "address" :

<br>title='École privée'/>
17 rue Jean Gallart  <br>49650 ALLONNES  #This is the address
<br>Téléphone : <a href="tel:0241528201">0241528201</a>   #Phone no
<br>Adresse de courriel : <a href=""></a>  # Email 
<br><br><a href="./etablissement/Allonnes/ECOLE-PRIMAIRE-PRIVEE-SAINT-DOUCELIN/0491164Z.html">  #Link for school

Everything was in a single line, i formatted it to look clear and removed unneccesary font tags

I want to extract these items in the following format:

Address : 17 rue Jean Gallart 49650 ALLONNES
Telephone: 0241528201
Link for school = https://etablissement/Allonnes/ECOLE-PRIMAIRE-PRIVEE-SAINT-DOUCELIN/0491164Z.html

What I have tried:

I tried extracting email using regex :
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", address)
print (emails)

I tried extracting href tags:

soup = BeautifulSoup(address, 'lxml')
for anchor in soup.find_all("a"):

I got a close output :

{'href': 'tel:0241528201'}<br>
{'href': ''}<br>

{'href': './etablissement/Allonnes/ECOLE-PRIMAIRE-PRIVEE-SAINT-DOUCELIN/0491164Z.html'}<br>

How can I extract these items one by one under different variables so that i can easily save it into a csv file?
Thanks in advance
OriginalGriff 5-Aug-21 7:41am     CRLF
Are you sure about that? That doesn't look like valid HTML - You seem to be mixing a "title" tag into a "br" tag for starters, and that implies that your whole HTML could well be wrong.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900