Click here to Skip to main content
15,881,709 members
Please Sign up or sign in to vote.
5.00/5 (3 votes)
See more:
Hi all

I use ITextSharp library to convert html to pdf.
My users use persian language sentence in her/his html files, So this library can't convert persian word.

For resolve this and right to left problem i use bellow code:

Document document = new Document(PageSize.A4, 80, 50, 30, 65);
            PdfWriter.GetInstance(document, new FileStream(strPDFpath, FileMode.Create));
            document.Open();
            ArrayList objects;
            document.NewPage();
            
            var stream = new StreamReader(strHTMLpath, Encoding.Default).ReadToEnd();
            objects = iTextSharp.text.html.simpleparser.
            HTMLWorker.ParseToList(new StreamReader(strHTMLpath, Encoding.UTF8), styles);            
            BaseFont bf = BaseFont.CreateFont("c:\\windows\\fonts\\Tahoma.ttf",
                                            BaseFont.IDENTITY_H, true);
            for (int k = 0; k < objects.Count; k++)
            {
                PdfPTable table = new PdfPTable(1);
                table.RunDirection = PdfWriter.RUN_DIRECTION_RTL;
                var els = (IElement)objects[k];
                foreach (Chunk el in els.Chunks)
                {
                    #region set persian font
                   iTextSharp.text.Font f2 = new iTextSharp.text.Font(bf, el.Font.Size,
                                                    el.Font.Style, el.Font.Color);
                    el.Font = f2;
                    #endregion set persian font
                    #region Set right to left for persian words
                    PdfPCell cell = new PdfPCell(new Phrase(10, el.Content, el.Font));
                    cell.BorderWidth = 0;
                    table.AddCell(cell);
                    #endregion Set right to left for persian words
                }
                //document.Add((IElement)objects[k]);                
                document.Add(table);
            }
            document.Close();
            Response.Write(strPDFpath);
            Response.ClearContent();
            Response.ClearHeaders();
            Response.AddHeader("Content-Disposition", "attachment; filename=" + strPDFpath);
            Response.ContentType = "application/octet-stream";
            Response.WriteFile(strPDFpath);
            Response.Flush();
            Response.Close();
            if (File.Exists(strPDFpath))
            {
                File.Delete(strPDFpath);
            }


My right to left and convert persian words was resolved, but it have another problem.

My algorithm can't parse and convert content of table tag that uses in html file.

For example i put here an html file that it's content language in persian:

<pre lang="xml"><html>
<head>
<meta name="charset" content="utf-8" />
</head>
<body>

<p style="text-align: right;"><span style="font-family: tahoma;">سلام<br />
<br />
نامه شماره 1<br />
<br />
<br />
<table cellspacing="1" cellpadding="1" align="center">
    <tbody>
        <tr>
            <td>شماره شناسنامه SHSH</td>
            <td>نام خانوادگيFamily</td>
            <td>نامName</td>
        </tr>
        <tr>
            <td>123456789</td>
            <td>حيدربزرگHeidarbozorg</td>
            <td>سعيدSaeed</td>
        </tr>
        <tr>
            <td>258</td>
            <td>رضاييRezaee</td>
            <td>عليAli</td>
        </tr>
        <tr>
            <td>654987</td>
            <td>علي مردان خانAliMardanKhan</td>
            <td>رضاReza</td>
        </tr>
    </tbody>
</table>
<br />
<br />
مشخصات بالا را دريافت کردم</span></p>

</body></html>



Now the question is: How to parse html file that have table tag, div and paragraph tag with persian language sentence, and convert it to pdf?
Posted

There can be few items to check up.

What is the charset in HTML? Should be something like that:
HTML
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />


It's not mandatory to have a text-file BOM matching this charset, but do you have it? (Your text editor should have options "Save as UTF-8 files", "Save as Unicode files", see Unicode standard for BOMs. The constructor of the class System.IO.StreamReader constructor has a parameter detectEncodingFromByteOrderMarks; if this is true, the reader looks at the BOM at the beginning of the file.

Why do you have this stream with default encoding? Look at your line:

C#
var stream = new StreamReader(strHTMLpath, Encoding.Default).ReadToEnd();


This could be a mistake.

Persian language is covered by Unicode exactly as most other languages, processing Persian usually never cause any problems.

—SA
 
Share this answer
 
v2
Comments
Henry Minute 8-Feb-11 14:36pm    
@SA I just corrected a typo. You had 'charser' instead of 'charset' :)
Sergey Alexandrovich Kryukov 8-Feb-11 14:38pm    
Thank you very much, Henry,
--SA
Sergey Alexandrovich Kryukov 19-Jul-12 11:16am    
Thank you very much, Henry.
--SA
Thank you for your response
I change my code to this:

C#
var stream = new StreamReader(strHTMLpath, Encoding.UTF8).ReadToEnd();


and add this header to my html file:
<br />
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" /><br />


but it not correct :((

My problem is: Data in the table tag can't parse and convert to pdf
 
Share this answer
 
Hi I have your problem exactlly
Could you help me if your problem is solved?
I'm from Iran
 
Share this answer
 
Any body here?
Please help me
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900