There's several ways you can go with this. The most flexible approach is to use XSLT so that when your requirements change (and they will) you can modify your XSL file and apply that to the data.
A less flexible approach is to use an XMLReader to read the XML a node at a time and hard code what you are searching for. It's good enough for one off requirements, but if you can spare the time learning the basics of XSL and XPath is likely to prove a time saver over the long run. See the links in the comments.
If you are working with gene mapping I suspect that your files are going to be very large so you're going to want avoid solutions that require the whole file to be loaded into memory.
The example code below is C# but if you're using Visual C++ in .Net the translation should be straightforward.
The XSL route:
There are a number of ways you can apply XSL transforms in .Net you. For smaller files can use code like the following. This takes XML as a string, applies the transform returning a string. To use this you would read open your XML file read the content into a string
if (File.Exists(outputPath)) File.Delete(outputPath);
string output = ApplyTransform(xmlToTransform, xslTemplate);
StreamWriter writer = new StreamWriter(outputPath, false, Encoding.Unicode);
writer.Write(output);
writer.Flush();
writer.Close();
Where you have a method something like the one shown below to apply the transform. Two problems here:
1 - You have to learn XSL and XPath, not too difficult but it does take time and it does have a few gotchas.
2 - Everything is done in memory. This limits this size of file you can work with and can be very slow for larger XML documents.
public static string ApplyTransform(string xmlToTransform,
string xslTemplate)
{
XmlReader reader = null;
XmlWriter writer = null;
StringWriter sw = new StringWriter();
try
{
XmlReaderSettings readSettings = new XmlReaderSettings();
readSettings.ProhibitDtd = false;
reader = XmlReader.Create(xslTemplate, readSettings);
XmlWriterSettings writeSettings = new XmlWriterSettings();
writeSettings.OmitXmlDeclaration = true;
writeSettings.ConformanceLevel = ConformanceLevel.Fragment;
writeSettings.CloseOutput = true;
writeSettings.Indent = true;
writeSettings.IndentChars = " ";
writeSettings.NewLineChars = System.Environment.NewLine;
writeSettings.Encoding = Encoding.Unicode;
writeSettings.CheckCharacters = false;
writer = XmlWriter.Create(sw, writeSettings);
XmlDocument dbSchema = new XmlDocument();
dbSchema.LoadXml(xmlToTransform);
XPathNavigator xpath = dbSchema.CreateNavigator();
XslCompiledTransform styleSheet = new XslCompiledTransform(true);
styleSheet.Load(reader);
styleSheet.Transform(xpath, null, writer, null);
}
catch(System.Exception ex)
{
#if DEBUG
System.Diagnostics.Debugger.Break();
#endif
throw ex;
}
finally
{
if (reader != null) reader.Close();
if (writer != null) writer.Close();
}
return sw.ToString();
}
The "hard coded" route.
This can be as simple as the following:
ExtractToFile(@"c:\someDirectory\geneInfo.xml",
@"c:\someDirectory\geneInfo.txt",
"entry", "type", "gene", "name");
Where ExtractToFile looks like this....
public static void ExtractToFile(string inFile,
string outFile,
string elementToFind,
string attributeName,
string attributeValue,
string attributeOut) {
StringComparison ignoreCase = StringComparison.InvariantCultureIgnoreCase;
int rowCount = 0;
int flushCount = 1000;
if (File.Exists(outFile)) {
File.Delete(outFile);
}
using (StreamWriter output = new StreamWriter(outFile)) {
using (XmlReader fileReader = XmlReader.Create(inFile))
while(fileReader.Read()) {
if( fileReader.NodeType == XmlNodeType.Element &&
fileReader.Name.Equals(elementToFind, ignoreCase) &&
fileReader.HasAttributes) {
string _find = fileReader.GetAttribute(attributeName);
string _out = fileReader.GetAttribute(attributeOut);
if (_find.Equals(attributeValue, ignoreCase)) {
output.WriteLine(_out);
if (rowCount == flushCount){
rowCount = 0;
output.Flush();
}
}
}
}
}
}
I believe that XMLReader is limited to 2GB files. If your files are larger than this you are going to
have to consider alternative solutions. Here might be a good place to start
Parse XML at SAX Speed without DOM or SAX[
^]