Click here to Skip to main content
15,895,142 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Hello,
I have 200 xml files. each one consists of pathways (something like network information). each pathway consists of entities with some attributes. I would like to ask how I can creat a text file for each xml file that contain the only name attributes for all the entities inside this xml file. I have the xml files in this format:

XML
<?xml version="1.0" ?> 
  <!DOCTYPE pathway (View Source for full doctype...)> 
- <!--  Creation date: Oct 7, 2014 11:01:31 +0900 (GMT+09:00) 
  --> 
- <pathway name="path:gmx00010" org="gmx" number="00010" title="Glycolysis / Gluconeogenesis">

- <entry id="13" name="gmx:100527532 gmx:100775844 gmx:100778363 gmx:100786504 gmx:100792394 gmx:100795446 gmx:100798677 gmx:100802732 gmx:100815070 gmx:100818383 gmx:100818915 gmx:547751" type="gene" >
  </entry>

- <entry id="37" name="gmx:100777399 gmx:100778722 gmx:100782019 gmx:100783726 gmx:100784210 gmx:100786773 gmx:100798020 gmx:100798892 gmx:100800699 gmx:100803104 gmx:100808513 gmx:100809812 gmx:100811186 gmx:100811501 gmx:100811891 gmx:100816594 gmx:100817701 gmx:100819197 gmx:547717" type="gene">
  </entry>

- <entry id="38" name="ko:K01905" type="ortholog">
  </entry>

- <entry id="39" name="ko:K00129" type="ortholog">
  </entry>


I want to write a program in visual C++ to create a text file with the same title as the xml file and this text file contains the name attribute values (ex: gmx:100527532 gmx:100775844 gmx:100778363 gmx:100786504 gmx:100792394 gmx:100795446 gmx:100798677 gmx:100802732 gmx:100815070 gmx:100818383 gmx:100818915 gmx:547751) for all the entities of type="gene" and ignore any entity with other types.

Thanks.
Posted
Updated 5-Dec-14 22:37pm
v2
Comments
PIEBALDconsult 6-Dec-14 1:04am    
XSLT?
http://www.w3schools.com/xsl/default.asp
barneyman 6-Dec-14 2:03am    
completely!

http://stackoverflow.com/questions/34093/how-to-apply-an-xslt-stylesheet-in-c-sharp


1 solution

There's several ways you can go with this. The most flexible approach is to use XSLT so that when your requirements change (and they will) you can modify your XSL file and apply that to the data.

A less flexible approach is to use an XMLReader to read the XML a node at a time and hard code what you are searching for. It's good enough for one off requirements, but if you can spare the time learning the basics of XSL and XPath is likely to prove a time saver over the long run. See the links in the comments.

If you are working with gene mapping I suspect that your files are going to be very large so you're going to want avoid solutions that require the whole file to be loaded into memory.

The example code below is C# but if you're using Visual C++ in .Net the translation should be straightforward.


The XSL route:
There are a number of ways you can apply XSL transforms in .Net you. For smaller files can use code like the following. This takes XML as a string, applies the transform returning a string. To use this you would read open your XML file read the content into a string

C#
if (File.Exists(outputPath)) File.Delete(outputPath);
string output = ApplyTransform(xmlToTransform, xslTemplate);
StreamWriter writer = new StreamWriter(outputPath, false, Encoding.Unicode);
writer.Write(output);
writer.Flush();
writer.Close();


Where you have a method something like the one shown below to apply the transform. Two problems here:
1 - You have to learn XSL and XPath, not too difficult but it does take time and it does have a few gotchas.
2 - Everything is done in memory. This limits this size of file you can work with and can be very slow for larger XML documents.


C#
/// <summary>
/// Apply an XSL transform to a well formed XML string
/// returning the transform output as a string.
/// </summary>
/// <param name="xmlToTransform">Well formed XML as a string.</param>
/// <param name="xslTemplate">Full path to an XSL template file.</param>
/// <returns></returns>
public static string ApplyTransform(string xmlToTransform,
                                    string xslTemplate)
{

  XmlReader reader = null;
  XmlWriter writer = null;
  StringWriter sw = new StringWriter();

  try
  {

    // Using a reader allows us to use stylesheets with embedded DTD.
    XmlReaderSettings readSettings = new XmlReaderSettings();
    readSettings.ProhibitDtd = false;
    reader = XmlReader.Create(xslTemplate, readSettings);

    // We want the output indented by tag.
    XmlWriterSettings writeSettings = new XmlWriterSettings();
    writeSettings.OmitXmlDeclaration = true;
    writeSettings.ConformanceLevel = ConformanceLevel.Fragment;
    writeSettings.CloseOutput = true;
    writeSettings.Indent = true;
    writeSettings.IndentChars = "  ";
    writeSettings.NewLineChars = System.Environment.NewLine;
    writeSettings.Encoding = Encoding.Unicode;
    writeSettings.CheckCharacters = false;
    writer = XmlWriter.Create(sw, writeSettings);

    // Turn the incoming string into something we can apply a
    // a transform to.
    XmlDocument dbSchema = new XmlDocument();
    dbSchema.LoadXml(xmlToTransform);
    XPathNavigator xpath = dbSchema.CreateNavigator();

    // Apply the transform.
    XslCompiledTransform styleSheet = new XslCompiledTransform(true);
    styleSheet.Load(reader);
    styleSheet.Transform(xpath, null, writer, null);

  }
  catch(System.Exception ex)
  {
    #if DEBUG
    System.Diagnostics.Debugger.Break();
    #endif
    throw ex;
  }
  finally
  {
    if (reader != null) reader.Close();
    if (writer != null) writer.Close();
  }

  return sw.ToString();

}


The "hard coded" route.
This can be as simple as the following:

C#
ExtractToFile(@"c:\someDirectory\geneInfo.xml",
              @"c:\someDirectory\geneInfo.txt",
              "entry", "type", "gene", "name");

Where ExtractToFile looks like this....

C#
/// <summary>
/// Extract the value of the specified attribute for elements of the
/// specified name where a search attribute has a specific value.
/// </summary>
/// <param name="inFile">full path to source xml</param>
/// <param name="outFile">full path spec of file to create</param>
/// <param name="elementName">The element to find in the XML</param>
/// <param name="attributeName">The search/filter attribute</param>
/// <param name="attributeValue">The required search/filter attribute value.</param>
/// <param name="attributeOut">The attribute for which we want the value.</param>
public static void ExtractToFile(string inFile,
                                 string outFile,
                                 string elementToFind,
                                 string attributeName,
                                 string attributeValue,
                                 string attributeOut) {

  // XML is case sensitive, but we're not.
  StringComparison ignoreCase = StringComparison.InvariantCultureIgnoreCase;

  // Decide how often we're going to dump output from buffer to disk.
  int rowCount   = 0;
  int flushCount = 1000;

  if (File.Exists(outFile)) {
    File.Delete(outFile);
  }

  using (StreamWriter output = new StreamWriter(outFile)) {

    // We assume the file exists and that the contents are valid XML
    // An XMLReader instance will work through all the nodes in the XML from the
    // start to the end. All we do is sit and wait for the elements we're
    // interested in to come floating past and deal with with them as they do.
    using (XmlReader fileReader = XmlReader.Create(inFile))

      while(fileReader.Read()) {
        if( fileReader.NodeType == XmlNodeType.Element &&
            fileReader.Name.Equals(elementToFind, ignoreCase) &&
            fileReader.HasAttributes) {

          string _find = fileReader.GetAttribute(attributeName);
          string _out  = fileReader.GetAttribute(attributeOut);

          if (_find.Equals(attributeValue, ignoreCase)) {
            output.WriteLine(_out);
            if (rowCount == flushCount){
              rowCount = 0;
              output.Flush();
          }
        }
      }
    }
  }
}
I believe that XMLReader is limited to 2GB files. If your files are larger than this you are going to
have to consider alternative solutions. Here might be a good place to start

Parse XML at SAX Speed without DOM or SAX[^]
 
Share this answer
 
Comments
Member 11290013 8-Dec-14 0:27am    
Thank you very much it works. I appreciate your help thanks :)

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900