65.9K
CodeProject is changing. Read more.
Home

UTF-8 encoded XML file/stream processing

starIconstarIconstarIcon
emptyStarIcon
starIcon
emptyStarIcon

3.58/5 (9 votes)

Jun 3, 2004

2 min read

viewsIcon

85161

downloadIcon

5952

Process an UTF-8 encoded XML file or stream; read group & attribute values; write & delete grps, attribs, values & comments.

Sample Image

Introduction

This DLL provides routines to manipulate UTF-8 encoded XML files. The set provided is not all-singing-and-dancing but a useful, small collection. Several co-operating executables, living off a common UTF-8 encoded XML file, may find their operating parameters and set parameters for others.

Background

Initially, the read functions were implemented to save incorporating the large overhead of using a proprietary interface. From this grew a certain understanding of the mechanism. Then were added write & delete routines; stream routines that allowed the user program to supply & recover the UTF-8 encoded XML data (without using disk files); some super (i.e. over-arching) routines to shrink the user's code.

Using the code

VC 6.0 projects: Place the XM8DLL.dll in a directory on your path variable. Add the library XM8DLL.lib to the project resources. Add the module XM8calls.h to the project. Use the routines therein.

VB 6.0 projects: Register the XM8DLL.dll with regsvr32. Add the module XM8DLL.bas to the project. Use the public routines therein.

//
// Sample source to produce the above file
//
  XM8_newFile("Order");

  XM8_getFrstGroup("Order",0);
  XM8_newAttPutVal("number","1234");

  XM8_pokeNewGrpPutVal("Date","2000/1/1");
  XM8_newGrpPutVal("Customer","Acme < & > \" ' Ltd");
  XM8_newAttPutVal("ID","1234A");

  XM8_getFrstGroup("Order",0);
  XM8_newGroup("ITEM");
  XM8_newGroup("ITEM");

  XM8_getFrstGroup("ITEM",1);
  XM8_newAttPutVal("ID","01");
  XM8_newGrpPutVal("Part-number","E16-25A");
  XM8_newAttPutVal("warehouse","Warehouse11");
  XM8_getFrstGroup("ITEM",1);
  XM8_pokeNewGrpPutVal("Description","Production-Class Widget A");
  XM8_newGrpPutVal("Quantity","16");

  XM8_getLastGroup("ITEM",1);
  XM8_newAttPutVal("ID","02");
  XM8_newGrpPutVal("Part-number","E23-45B");
  XM8_newAttPutVal("warehouse","Warehouse11");
  XM8_getLastGroup("ITEM",1);
  XM8_pokeNewGrpPutVal("Description","Production-Class Widget B");
  XM8_newGrpPutVal("Quantity","12");

  XM8_writeFile(fileName);

Points of Interest

  • Throughout this article, the acronym UTF means UTF-8.
  • Four 'conversion' routines are also supplied. These are not used internally by XM8DLL. The pair XM8_UTFtoUCS, XM8_UCStoUTF. The pair XM8_UTF8toUTF16, XM8_UTF16toUTF8.
  • After installing the relevant character sets on W2K, I managed to reveal the Japanese streams.
  • For C/C++ only users, a static library can be built using workspace & project files provided.
  • The private routines in the XM8DLL.bas module are to get around C/C++ <-> VB differences.
  • The implementation of 'false' (C/C++ 0, VB -1).
  • VB string addresses to C/C++ routines.
  • VB return-string-parameter is handled in the DLL.

History

  • 1.9 Corrections to XMJ_deProfundis.
  • 1.8 XM8_sNew.cpp bug fixed in putThing.
  • 1.7 Encryption using TinyEncryptionAlgorithm (TEA).
    • XM8_crypt_vb.zip - demonstration of TEA applied to XML files.
    • Four encryption routines to implement TEA: XMLteaCryptKey, XMLteaEncrypt, XMLteaEncryptVal and XMLteaDecrypt.
  • 1.6 Default is now 1-4 byte UTF-8, 22 bit UNICODE usage.
    • New routine XM8_fullCODE, revert to 1-6 byte UTF-8, 31 bit usage.
  • 1.5 XM8_sNew.cpp new loop routine XM8_deProfundis.
    • Third VB demo. XLS files to XML files.
  • 1.4 XM8_sNew.cpp bug fixed in XM8_newStream.
  • 1.3 handles <, &, >, " and ' within values; both read & write.
    • XM8DLL.bas bug fixed in XM8_UTF8toUTF16.
  • 1.2 handles group to attribute & attribute to attribute white space.
    • What took 661 mS now takes 231 mS.
  • 1.1 XM8 handles ASCII encoded XML files because they are a sub-set of UTF-8. Therefore, XMJ may be replaced by XM8. Because XM8 works internally in UCS, it is about 30% slower than XMJ. Any observations on the code that might recover this loss will be much appreciated.