|
Hi all,
I am having a little trouble with Regex.
Basically i have a text file that i am trying to parse.
In this file i know there will be a number of names, these names will always use 100 bytes of space, if the name, in plain ASCII code, is less then 100 bytes the remainder of the space is filed with 0xFF
My regex code is as follows.
streamreader sr = new streamreader("file.txt");
string text = sr.ReadToEnd();
sr.close();
string pattern = @"[A-Za-z0-9]|\xFF{100}";
Regex = new Regex(pattern);
foreach(Match match in Regex.Matches(text))
{
}
I think the problem is with the \xFF part of the program, as a simple regex expression such as @"\xFF" returns no where near as many 0xFF values as are present in the file
Thanks for any help
If only MySelf.Visible was more than just a getter...
A person can produce over 5 times there own body weight in excrement each year... please re-read your questions before posting
|
|
|
|
|
How about something like:
@"[A-Za-z0-9\xFF]{100}";
(Untested)
And about whitespace?
|
|
|
|
|
thanks for your reply,
I have tried your suggestion too and get the same results. white space characters in addition to basic symbols may be present but for now im just looking to get this one working as i am certain the HEX value is what is causing me problems
If only MySelf.Visible was more than just a getter...
A person can produce over 5 times there own body weight in excrement each year... please re-read your questions before posting
|
|
|
|
|
Hi,
ReadToEnd reads data as text, i.e. it applies an encoding to interpret the bytes and returns a Unicode string. Are you sure your text still holds characters with the value \u00FF?
I suggest you print out a few hundred chars in hex and check.
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
modified on Sunday, June 12, 2011 8:43 AM
|
|
|
|
|
Oh, yeah, I hadn't thought of that.
|
|
|
|
|
Thanks Luc, correct as allways
Printed the chars out as ints and found they where much higher than i was expecting. the thing i found unusual thou, atleats in my opinion, was that FF became 65333 as did a number of other Hex values over D8 (maybe lower values but i didnt see any others)
is there any reason why FF should not translate to 00FF?
Anyway my resulting modification was to loop the BaseStream and use a string builder with each byte read. works just as i expect now
While on the subject. Do you know if it is possible to perform a Regex search on the file itself?
i understand that a full string is required to ensure patterns are not split but it just seems bad to load the whole file into memory, even if its small in relation to RAM.
Initial i had my own file search that used keyword that allowed wildcards directly with the BaseStream, but this only works when there are contant values to identify.
I would have a keyword class and a method that takes a byte value, if that value matches the keyword at index 0 it incremets a counter so the next byte passed in is checked against index 1. if the count equals the keyword length then the function returns true (a match) then i can process.
I just think it would be good for regex to have a similar option because at the minute i can not handle a progress bar update until the regex has found matches. Possibly regex was not meant for file searching and more for validation of string values.
Anyway lol, thanks again
If only MySelf.Visible was more than just a getter...
A person can produce over 5 times there own body weight in excrement each year... please re-read your questions before posting
|
|
|
|
|
Hi musefan,
whne you read text from a file, an encoding is used; you can specify one explicitly, most methods (e.g. File.ReadAllText) allow you to specify one; or you get your system's default encoding implicitly, which depends on regional settings.
I expect the regional setting ("code page") to be such that a lot of special characters get encoded in the byte value range [0x80, 0xFF], in order to save bytes in a file (and make it region-dependent!).
As an example, in Western Europe the default code page is 1252, which puts the Euro sign at 0x80, although there also is a Unicode character for it (0x20AC). So by default, reading a file containing 0x80 will result in a string with a 0x20AC at that position.
No, Regex does not work on files or streams, it needs a string.
No, progress indication is not really possible when you ask for a bulk operation such as File.ReadAllLines, a Regex operation, an SQL database operation, etc. Progress indication is available only when you implement it, which implies there are many steps in the job, possibly forcing you to cut the job in small steps (and forego the big methods such as ReadAllLines, and all Regex stuff).
I'm not a very big fan of Regex for your needs, I would have coded that with direct byte or char manipulations on the stream. Not sure whether I would chose bytes or chars though, probably bytes since your file does not really qualify as text, due to the 0xFF stuff.
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
modified on Sunday, June 12, 2011 8:43 AM
|
|
|
|
|
thanks for all the info
I have not really found a need for regex and in the past - i usually resort to manual manipulation for more control. but this time i just thought i should look into 'out the box' functions and start getting used to them.
I find every time i post a theory of how to do something manually i get knocked for being wrong for not using ready made class and functions, so maybe i need to start. but at the end of the day i will do what i think is best for the task at hand. which may involve changing manually handling the file as you suggested. The only issue is my OP example wasnt the complete regex expression, my requirement is more complex, but i can hand code if it comes to that.
Thanks
If only MySelf.Visible was more than just a getter...
A person can produce over 5 times there own body weight in excrement each year... please re-read your questions before posting
|
|
|
|
|
musefan wrote: i get knocked for being wrong for not using ready made class and functions
Yes, there are lots of classes and methods that offer a compact solution for small problems, but in the end they typically aren't the best solution because they may take long to execute, consume much memory, and provide no feedback.
Regex is a powerful tool, but it still is a tool; you have to know it exists and use it when you feel it is the right approach for you. If the regex expression is so complex you can hardly understand it, then IMO you are better of coding something yourself, so you know what it does, how it does it, and you can debug and watch it.
Luc Pattyn [Forum Guidelines] [My Articles]
- before you ask a question here, search CodeProject, then Google
- the quality and detail of your question reflects on the effectiveness of the help you are likely to get
- use the code block button (PRE tags) to preserve formatting when showing multi-line code snippets
modified on Sunday, June 12, 2011 8:44 AM
|
|
|
|
|
If only MySelf.Visible was more than just a getter...
A person can produce over 5 times there own body weight in excrement each year... please re-read your questions before posting
|
|
|
|
|
Try using a binary reader instead of a text reader?
|
|
|
|
|