Click here to Skip to main content
15,945,119 members
Please Sign up or sign in to vote.
1.00/5 (1 vote)
See more:
I am using a Bash script

Bash
#!/bin/bash
# Define variables for the URL and browser
sGDomain="idealista"
sGCitta="fucecchio-firenze"
sGTypo="vendita-case"
iGPagina=1

# Start of the loop
while :; do

    # Build the URL with the iGPagina variable
    url="https://www.$sGDomain.it/$sGTypo/$sGCitta/lista-$iGPagina.htm"
    #echo "$url"
    
    # Get the HTML content of the page
    html_content=$(curl -s -L -A "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) Gecko/20100101 Firefox/89.0" "$url")

    echo "$html_content" > htmlcompleto.txt
    
    # Check if the error string is not present in the HTML content
    if [[ ! $html_content =~ "Successiva" ]]; then
        break  # Exit the loop if the error string is not present
    fi
    
    # Use xidel to extract the ads
    xidel_output=$(xidel --silent --xpath '
        //div[contains(@class, "item-info-container")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)

    # Check if the temporary file exists and delete it if present
    if [ -f "temp.txt" ]; then
        rm temp.txt
    fi

    # Replace special characters from "desc=" to the end of each line in semi.txt
    echo "$xidel_output" | sed -e "s/desc=\(.*\)\(['\"]\)/desc=\1 /g" > semi.txt

    sed -i 's/\([0-9]\{1,\}\)\.\([0-9]\{1,\}\),[0-9]\{2\}/\1\2/g' semi.txt
    sed -i 's/m²//g' semi.txt

    # Concatenate semi.txt with debugtxt.txt for debugging purposes
    cat semi.txt >> debugtxt.txt

    # Connect to the SQLite database
    db_file="immo.db"
 
    # Loop through the lines and insert them into the SQLite database
    while IFS= read -r line; do
        # Extract price, size, link, and description values from the lines using awk
        prezzo=$(echo "$line" | awk -F 'price=' '{print $2}' | awk -F 'size=' '{print $1}')
        size=$(echo "$line" | awk -F 'size=' '{print $2}' | awk -F 'link=' '{print $1}')
        link=$(echo "$line" | awk -F 'link=' '{print $2}' | awk -F 'desc=' '{print $1}')
        descrizione=$(echo "$line" | awk -F 'desc=' '{print $2}')

        # Determine if the description contains "asta"
        if [[ $descrizione =~ "asta" ]]; then
            asta=1
        else
            asta=0
        fi

        # Insert the data into the SQLite database
        sqlite3 "$db_file" "INSERT INTO $sGDomain (prezzo, link, descrizione, metratura, asta) VALUES ('$prezzo', '$link', '$descrizione', '$size', $asta)"
    done < semi.txt

    # Increment the iGPagina variable for the next iteration**your text**
    ((iGPagina++))
done
to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used: xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression

        //div[contains(@class, "items-container items-list")] ! string-join(
            (
                ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ),
                ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
                ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
                ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
            ),
            codepoints-to-string(9)
        )
    ' -)```
with items-container items-list, but nothing 

to search a web page with a specific XPath expression. Although I believe the XPath is correct, the script fails to find anything on the page. Web page URL: https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm. XPath expression used:
xidel_output=$(xidel --silent --xpath ' //div[contains(@class, "item-info-container")] ! string-join( ( ( "price=" || normalize-space(.//span[contains(@class, "item-price")]/text()[1]) ), ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ), ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ), ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) ) ), codepoints-to-string(9) ) ' -) **


What I have tried:

Expected result: I expect to extract the price, the listing link, the description, and the square meters from each listing on the web page. i tryed also this xpath expression
xidel_output=$(xidel  --xpath '
     //main//div[contains(@class, "item-info-container ")] ! string-join(
          (
              ( "price=" || normalize-space(.//span[contains(@class, "item-price h2-simulated")]/text()[1]) ),
              ( "size="  || normalize-space(.//span[contains(@class, "item-detail") and contains(text(), "m2")]) ),
              ( "link="  || normalize-space(.//a[contains(@class, "item-link")]/@href) ),
              ( "desc="  || normalize-space(.//p[contains(@class, "ellipsis")]) )
          ),
          codepoints-to-string(9)
      )
  ' "$url")
but nothing
Posted
Comments
Richard Deeming 10-Jun-24 7:15am    
That URL just returns a 403 Fobidden error. We can't tell you why your XPath isn't working without seeing the relevant parts of the source you're trying to evaluate it against.
GiulioRig 10-Jun-24 9:31am    
for me work https://www.idealista.it/vendita-case/fucecchio-firenze/lista-18.htm , but you can use also another page for testing https://www.idealista.it/vendita-case/fucecchio-firenze/lista-1.htm
Richard Deeming 10-Jun-24 9:37am    
Exactly the same response - 403 Forbidden.

We can't help you scrape a site that we can't access!
GiulioRig 10-Jun-24 9:38am    
is very strange also if you call a domain idealista.it ?
Richard Deeming 10-Jun-24 9:40am    
Another 403 response, with a "Please enable JS and disable any ad blocker" message.

You cannot scrape a site that only works in a browser.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS


CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900