Click here to Skip to main content
15,881,732 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
My subject is pretty well the question here I have a regular expression that works in some cases but it returns other undesired values at times which is not what I want to see. I hope there is a better or more precise expression I can use to capture the SSDEEP string in question.


Here is the html code which I wish to capture the string from

HTML
<div class="floated-field-value">768:pHC0p5mwel+twV39TD8mRF5rKJZsF6No2:o0p5mwelJ9TD8mv5ImGo</div>


the regular expression I am working on looks like this

VB
Dim SSDEEP As New Regex("(?<=<div class=""floated-field-value"">)([^\""]+)(</div>)", RegexOptions.IgnoreCase)



I can only seem to get it close with
HTML
</div>
still remaining on the end of the string so I excluded ("div") off the string with some code

VB
For X = 0 To RichTextBox3.Lines.Length - 2
            Dim MyString As String = RichTextBox3.Lines(X).ToString
            Label28.Text = MyString 
next



I hope this is enough for someone to help me
thank you in advance!!
Posted
Comments
ledtech3 8-Aug-13 21:37pm    
Are you trying to get the whole line or just the value ?
I forgot about this
http://www.codeproject.com/Articles/9099/The-30-Minute-Regex-Tutorial
Is has a listing in it for Html tags.
Draco2013 9-Aug-13 14:29pm    
I am trying to only get the SSDEEP string from the html tag
ledtech3 9-Aug-13 14:38pm    
I was trying last night to get one to work and the best I got so far was to return the entire string.
The sample provided that is supposed to only return what is between the tags is not returning anything when dropped into a sample application.
I'm still trying to see what works.

1 solution

Ok got it.
Instead of trying to get the tags, get the pattern of the data.

Input string:
<div class="floated-field-value">768:pHC0p5mwel+twV39TD8mRF5rKJZsF6No2:o0p5mwelJ9TD8mv5ImGo</div>


Regx1 used:
((\d{3}):(\w*)\+(\w*):(\w*))

Regx2 used:
((\d{3}):(\w*):(\w*)|(\d{3}):(\w*)\+(\w*):(\w*))

Regx3 used:
((\d*):(\w*):(\w*)|(\d*):(\w*)\+(\w*):(\w*))


Output:
768:pHC0p5mwel+twV39TD8mRF5rKJZsF6No2:o0p5mwelJ9TD8mv5ImGo



the 2 outer "()" contains the search terms.Not sure if they are needed when parsing a site or not.

"(\d{3})" looks for three numbers
":" that char next
"(\w*)" alphanumeric word of any length
"\+ escape the plus and look for the plus sign next
"(\w*)" alphanumeric word of any length
":" that char next
"(\w*)" last word to extract

Thats it like I said not sure how it would work on a real site.
It should work as long as all data values contain a "+" otherwise it would need to be modified for that type. like an "Or" statement that dosen't use the "+" in it but most everthing else the same.

It does work in a small test app.
I hope this is not your homework :)
EDIT:
After looking up what SSDEEP is I tested the other 2 Regx added.
the second one is for catching if the "+" is there or not.
The third one after a review of SSDEEP the first section could be longer the 3 Char's so I fixed it to get any length of digits.
The best I can tell the 2 outside "()" would need to be there to match the entire pattern.
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900