Though not in our day to day programming, but sometime we may need to parse the HTML text for extracting the data. Recently, I encountered the same and henceforth I thought of sharing this with you all.
BODY tags or there may not be any
BODY tags. In either case, only portions from within the body tags need to be parsed.
<html><head><title></title><body><a href="#">Extract text from here</a></body></html>
<a href="#">Extract text from here</a>
Output: Extract text from here
Program / Sql Query
DECLARE @OriginalStr VARCHAR(MAX)
DECLARE @NewStr VARCHAR(MAX)
DECLARE @StartIndex INT
DECLARE @EndIndex INT
DECLARE @Noise TABLE(Noise VARCHAR(100),ReplaceChars VARCHAR(10))
An example of parsing html at <a href="%22http://www.codeproject.com%22">codeproject.com</a>' -- original data
SET @OriginalStr = '
Click the button <input type="button" value="some button" onclick="test();">
I am followed by a BLANK SPACE and then a terminator
It''s a new line
Fonts are nice features
INSERT INTO @Noise(Noise,ReplaceChars)
SELECT ' ',space(1) UNION ALL SELECT '>',space(1) UNION ALL
SELECT '<',space(1) UNION ALL SELECT '
',space(2) UNION ALL
SELECT ' ',space(2)
-- End of initialisation
--Step 1: Extract text between and
SELECT @StartIndex= CHARINDEX('', @OriginalStr)
SELECT @EndIndex= CHARINDEX('',@OriginalStr)
SELECT @OriginalStr =
CASE WHEN(@StartIndex = 0 or @EndIndex = 0) THEN @OriginalStr
((@EndIndex - @StartIndex) + 7 )) END-- Adding 7 since length of = 7
--Step 2: Extract the relevant text between the tags
--Generate a number table
;WITH Num_Cte AS
SELECT 1 AS rn
SELECT rn +1 AS rn
WHERE rn <= LEN(@OriginalStr)
-- Shred into individual characters
, Get_Individual_Chars_Cte AS
rn AS Id
CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1) AS chars) SplittedChars
--Combine the characters again to obtain the extracted text between the tags
SELECT @NewStr = ExtractedText FROM (
SELECT CAST(chars AS VARCHAR(MAX)) FROM (
--Get the characters between the tags
SELECT a.* FROM Get_Individual_Chars_Cte a
--Find the position of the characters between the tags
SELECT rn FROM Num_Cte EXCEPT
FROM Num_Cte n
-- Find the start and end range of characters
-- between < and > for all tags
Id AS StartRange
,(SELECT TOP 1 Id
WHERE Id > a.Id
ORDER BY Id) AS EndRange
FROM Get_Individual_Chars_Cte a
ON n.rn BETWEEN X.StartRange AND X.EndRange)
ON a.Id = X.rn
FOR XML PATH('')
--Remove the noises
SELECT @NewStr = REPLACE(@NewStr, Noise, ReplaceChars) FROM @Noise
SELECT ParsedText = @NewStr
Click the button hello world I am followed by a BLANK SPACE and then a terminator It's a new line Fonts are nice features Nice link...... www.codeproject.com
First we need to check if the HTML given has any or not and that will be the input source.
The very next step after that is to pick up the relevant text.
For accomplishing this, we will take the help of a number table that has been created on the fly using a recursive cte.
Next split the data into individual characters. e.g. CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1) AS chars) SplittedChars.
The next step is to find the start and end range of characters between < and > for all tags.
Once this task is accomplished, the next step is to find the position of the characters between the tags and combine the characters again (by using FOR XML PATH('')) to obtain the extracted text between the tags.
After combining the characters, we will encounter some junk characters like ' ','>' etc.
For that reason, we will keep a noise table where the noises (let's call the junk characters so) will be replaced by one space. Once done, we will achieve our goal.
Points of Interest
This article showed a way of parsing data using the SET BASED approach. Also it shows us how we can replace data in one single Replace statement instead of multiple. The usage of FOR XML PATH in combining the characters is really handy as substantiated from this article.
- 12th January, 2010: Initial post