I'm trying to write a program that will download my emails and save them as PDF.
I've encountered a problem with encoding.
I'm using the email and imaplib modules. When I use this method to write the file: part.get_payload(decode=True) I get an html file with \u2013 and � in it.
Writing the raw email in html works and doesn't show any � but it also shows the header of the email message, trying to get rid of the headers makes the � return. I've tried changing the encoding to ISO-8859-1 which removes the � but instead I get \u2020 and \u2013
Removing this line from the html solved the problem, until I converted it to PDF: <html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:asp="remove"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8"><meta name="viewport" content="width=device-width, initial-scale=1.0"></meta><meta name="format-detection" content="telephone=no, date=no, address=no, email=no, url=no"></meta><style type="text/css">
When I converted it to PDF Â and â started appearing on the document.
This is the code I wrote:
What I have tried:
m = imaplib.IMAP4_SSL('imap.mail.yahoo.com')
m.login('xxxx@yahoo.com', 'xxxxx')
m.select('IL', readonly=True)
resp, data = m.search(None, '(SINCE "01-Jul-2019" BEFORE "29-Oct-2020" SUBJECT \"Your order\")')
messages = data[0].split()
for item in messages:
typ, data = m.fetch(item, '(RFC822)')
raw_email = data[0][1].decode("utf-8")
email_message = email.message_from_string(raw_email)
to_ = email_message['To']
from_ = email_message['From']
subject_= email_message['Subject']
date_ = email_message['date']
counter = 1
for part in email_message.walk():
if part.get_content_maintype() == "multipart":
continue
filename = part.get_filename()
content_type = part.get_content_type()
if not filename:
ext = mimetypes.guess_extension(content_type)
if not ext:
ext = '.bin'
filename = 'msg-part-%08d%s' %(counter, ext)
counter +=1
save_path = os.path.join(os.getcwd(), "emails", date_, subject_)
if not os.path.exists(r'save_path'):
print (save_path)
os.makedirs(r'save_path')
with open(os.path.join(r'save_path', filename), 'wb') as fp:
fp.write(part.get_payload(decode=True))
pdfkit.from_file('msg-part-00000001.htm', 'test.pdf')
<pre lang="Python">