Click here to Skip to main content
14,733,033 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
I'm trying to write a program that will download my emails and save them as PDF.

I've encountered a problem with encoding.

I'm using the email and imaplib modules. When I use this method to write the file: part.get_payload(decode=True) I get an html file with \u2013 and � in it.

Writing the raw email in html works and doesn't show any � but it also shows the header of the email message, trying to get rid of the headers makes the � return. I've tried changing the encoding to ISO-8859-1 which removes the � but instead I get \u2020 and \u2013

Removing this line from the html solved the problem, until I converted it to PDF: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:asp="remove"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"></meta><meta name="format-detection" content="telephone=no, date=no, address=no, email=no, url=no"></meta><style type="text/css">

When I converted it to PDF Â and â started appearing on the document.

This is the code I wrote:


What I have tried:

m = imaplib.IMAP4_SSL('imap.mail.yahoo.com')
m.login('xxxx@yahoo.com', 'xxxxx')




m.select('IL', readonly=True)
resp, data = m.search(None, '(SINCE "01-Jul-2019" BEFORE "29-Oct-2020" SUBJECT \"Your order\")')

messages = data[0].split()


for item in messages:
    typ, data = m.fetch(item, '(RFC822)')
    raw_email = data[0][1].decode("utf-8")
    email_message = email.message_from_string(raw_email)
    to_ = email_message['To']
    from_ = email_message['From']
    subject_= email_message['Subject']
    date_ = email_message['date']
    counter = 1
    for part in email_message.walk():
        if part.get_content_maintype() == "multipart":
            continue
        filename = part.get_filename()
        content_type = part.get_content_type()
        if not filename:
            ext = mimetypes.guess_extension(content_type)
            if not ext:
                ext = '.bin'
            filename = 'msg-part-%08d%s' %(counter, ext)
        counter +=1
    save_path = os.path.join(os.getcwd(), "emails", date_, subject_)
    if not os.path.exists(r'save_path'):
        print (save_path)
        os.makedirs(r'save_path')
    with open(os.path.join(r'save_path', filename), 'wb') as fp:
        fp.write(part.get_payload(decode=True))
    pdfkit.from_file('msg-part-00000001.htm', 'test.pdf')
<pre lang="Python">
Posted
Comments
Gerry Schmitz 29-Oct-20 14:27pm
   
It's punctuation; check your character set.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

  Print Answers RSS
Top Experts
Last 24hrsThis month



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900