|
|||||||||||||||||||||||||||||||||||||||||
|
|||||||||||||||||||||||||||||||||||||||||
|
Announcements
Chapters
Services
Feature Zones
|
Note: This is an unedited contribution. If this article is inappropriate,
needs attention or copies someone else's work without reference then please
Report This Article
Download Cliver.Parser_usage_sample.zip - 163.14 KB
AnnotationThis article can be helpful for those who potentially need to solve complex parsing tasks that require using more than one regular expression.
Problem background: superposed regexesOften, it is impossible to write a single regex to parse exactly what you need from the text. In those cases if even a single regex can be written, it may get a grotesque form that is hard to write and even harder to perform. Hence, to solve such a parse task, you have to write several regexes so that they are applied in turn one after another to the text or to the results of the previous regex. This brings to the issues listed below. Who wrote code for such a parsing, knows that debugging of superposed regex constructions is dull pastime. It is so because, although there are a lot of regex debug tools, in the case of superposed regexes they become not too handy as can debug only one regex at once. That means you have to intercept captures of the previous regex in order to debug the regex that is applied after. Keeping in mind that debugging should be performed on many matches to get confidence in the regex, it appears real headache. Yet again the same problem arises while updating the parser when new peculiarities in the input text are found that occurs often enough. Another problem is that the parsing code becomes non-readable and intricate because of presence of many regexes and superposed operations on parsing results. The code around regexes has to conform to the logic that they dictate, so in most cases you will not be able to change the regexes without changing the code and vice versa. As a result you lose your time and end up with obscure code that is a total maintenance nightmare and cannot ever be reused because it is designed to do the very specific thing that's bound to change at any moment.
Terminology
First, a little of terminology.
Parsed data tree is a tree-like structure returned by
The solution essence
The general objectives of the solution are:
2)Keep regexes and the parsing process separated from the code where the parsed data is used. It is done with
RegexTreeerDevelopment of a parser with RegexTreeer implies the following general steps:
Generally, after you have built a regex tree with RegexTreeer, there are 3 ways to use the regexes as it is displayed in the diagram:
We'll consider the way of using
Cliver.Parser
The general objective of
Example
To understand better how it works, let's consider the following example. Let we want to parse from the text below company names, addresses and all information for each staff person: name, phones, mobile, email etc as separate fields.
Company:
Orange Hotel Inc.
5823 Orange Beach, Honolulu, HI 54365
Staff:
Maria Bronte
Phone 808.373.4559
Mobile 808.306.5183
maria@orangebeach.zzz
John Thompson
Phone 888.343.4259, 888.343.4258
Mobile 888.292.5180
john@orangebeach.zzz
Company:
New Technologies Co.
43 Light River, San Francisco , CA 33456
www.newtechnologies.zzz
Staff:
Benjamin J. Jonson
Mobile 777.233.6367
johnson@newtechnologies.zzz
Victoria Gramm
Phone 777.546.1353
Mobile 777.754.9645
<...>
As we can depicture in advance, the parsed data, that we want to obtain, will be a tree-like structure that can be represented in JSON format as it is dispayed below: {
Company:{
CompanyName:"Orange Hotel Inc.",
CompanyAddress:"5823 Orange Beach, Honolulu, HI 54365",
Employee:{
EmployeeName:"Maria Bronte",
EmployeePhone:"808.373.4559",
EmployeeMobile:"808.306.5183",
EmployeeEmail:"maria@orangebeach.zzz"
}
Employee:{
EmployeeName:"John Thompson",
EmployeePhone:["888.343.4259","888.343.4258"],
EmployeeMobile:"888.292.5180",
EmployeeEmail:"john@orangebeach.zzz"
}
}
Company:{
CompanyName:"New Technologies Co.",
CompanyAddress:"43 Light River, San Francisco , CA 33456",
CompanySite:"www.newtechnologies.zzz",
Employee:{
EmployeeName:"Benjamin J. Jonson",
EmployeeMobile:"777.233.6367",
EmployeeEmail:"johnson@newtechnologies.zzz"
}
Employee:{
EmployeeName:"Victoria Julius Gramm",
EmployeePhone:"777.546.1353",
EmployeeMobile:"777.754.9645"
}
}
<...>
}
In order to obtain so structured data, we'll have to apply to the text several superposed regexes (i.e. a regex tree).
Regex treeThe needed regex tree can be built using RegexTreeer. We'll not review RegexTreeer interface here because it is simple enough. Having taken a brief look to RegexTreeer Help, you can fast learn how to build regex trees there. So let’s imagine the regex tree is already created and saved in a regex tree file named Companies.rgx. You can see RegexTreeer screenshot with the regex tree that was built for our example:
The regex tree is seen in the TreeView control in RegexTreeer's window. The used regex engine is .NET so refer MSDN for the regex syntax.
As we can see from the regex tree diagram, the parsed data will be a tree of named values. This observation is directing us to the next section.
Parsed data treeThe view of the regex tree suggests that it would be fine to obtain the parsing results formed as a tree-like structure – then we can manage the parsed data in our code in clear and vivid manner. Thus, while iterating through the array of companies, we would get company's name like Company[i].CompanyName, or employee's phone like Company[i].Employee[j].EmployeePhone[k].
To clarify this better, let's consider the parsed data tree being result of parsing of our example text. Below it is represented in JSON form: {
Company:[
{
value:<capture #1 of Company group>,
CompanyName:[
{
value:<capture #1 of CompanyName group>
}
<…GroupCaptures for the rest captures of CompanyName group…>
],
CompanyAddress:[
{
value:<capture #1 of CompanyAddress group>
}
<…GroupCaptures for the rest captures of CompanyAddress group…>
],
CompanySite:[
{
value:<capture #1 of CompanySite group>
}
<…GroupCaptures for the rest captures of CompanySite group…>
],
Employee:[
{
value:<capture #1 of Employee group>,
EmployeeName:[
{
value:<capture #1 of EmployeeName group>,
}
<…GroupCaptures for the rest captures of EmployeeName group…>
],
EmployeePhone:[
{
value:<capture #1 of EmployeePhone group>,
}
<…GroupCaptures for the rest captures of EmployeePhone group…>
],
EmployeeMobile:[
{
value:<capture #1 of EmployeeMobile group>,
}
<…GroupCaptures for the rest captures of EmployeeMobile group…>
],
EmployeeEmail:[
{
value:<capture #1 of EmployeeEmail group>,
}
<…GroupCaptures for the rest captures of EmployeeEmail group…>
]
}
<…GroupCaptures for the rest captures of Employee group…>
]
}
<…GroupCaptures for the rest captures of Company group…>
]
}
In this JSON structure, any element denoted as {…} is a GroupCapture object. Thus GroupCapture's together form a parsed data tree.
Code with Cliver.ParserNow, we have only to see how Cliver.Parser company_parser = new Parser("../../_config_files/Companies.rgx");
/// <summary>
/// Process the page by Cliver.Parser
/// </summary>
/// <param name="page">text to be parsed</param>
void process_company_list(string page)
{
Cliver.GroupCapture gc = company_parser.Parse(page);
foreach (Cliver.GroupCapture company in gc["Company"])
{
Console.WriteLine("\n\n>>>>>>>Company:>>>>>>>");
Console.WriteLine("Name: " + company.FirstValueOf("CompanyName"));
Console.WriteLine("Address: " + company.FirstValueOf("CompanyAddress"));
Console.WriteLine("Site: " + company.FirstValueOf("CompanySite"));
foreach (Cliver.GroupCapture employee in company["Employee"])
{
Console.WriteLine("\n-------Employee:-------");
Console.WriteLine("Name: " + employee.FirstValueOf("EmployeeName"));
//employee can have more than one phone in our sample, that's why we enum them in a cycle
foreach (string phone in employee.ValuesOf("EmployeePhone"))
{
Console.WriteLine("Pnone: " + phone);
}
Console.WriteLine("Mobile: " + employee.FirstValueOf("EmployeeMobile"));
Console.WriteLine("Email: " + employee.FirstValueOf("EmployeeEmail"));
}
}
}
As you can see, any parsed value can be accessed by a name path like certain employee's phone:
Using no-named groupsDraw your attention, a parsed data tree can contain only captures of named groups, while captures of no-named groups are not taken to the parsed data tree, in spite of the fact that they participate in the parsing process yet. That means, if you left certain group no-named then captures of the next regex that is applied to the captures of the no-named group, are collected into one array. In our example, leaving groups of regex #1.1 no-named means that all captures of regex #1.1.1 will be placed into one array with no distinguishing what capture of group $1 of regex #1.1 was parsed. (Of course, the same can be said about regex 1.1.2 too). We can do so because captures of regex #1.1 are not the end data used in the code, and also, as expected, regex #1.1 has only one match within each company. Thus, leaving its groups no-named, we only made the reference path to the data shorter by one name.
Tip: use simple regexesRegular expressions are flexible and powerful language enough to allow in many cases writing one regex instead of two or more. However, such travail usually brings to non-readable, non-editable code that is hypersensitive for the parsed text’s deviations. So do not try to use complex regexes, instead, use tree of regexes which are as simple as possible. This approach provides clear logic of data hierarchy that will save your development time. In most cases it also brings to the highest performance.
The conclusionThis article is only an outline of using RegexTreeer+ For those who are interested, the sources of RegexTreeer and
Using the CodeIn the attached code you can find:
The latest
Issues
If anybody wants to help in any of these issues, please contact me. Be happy!
|
||||||||||||||||||||||||||||||||||||||||