Click here to Skip to main content
15,879,239 members
Articles / Database Development / SQL Server

HTML Parser In SQL SERVER 2005 (Set Based Approach)

Rate me:
Please Sign up or sign in to vote.
4.50/5 (8 votes)
12 Jan 2010CPOL2 min read 21.9K   10   1
An HTML parser for parsing only the HTML body content

Introduction

Though not in our day to day programming, but sometime we may need to parse the HTML text for extracting the data. Recently, I encountered the same and henceforth I thought of sharing this with you all. 

Background

Recently one of my colleagues ran into a problem. His requirement is that he is getting some HTML text from where he needs to extract the relevant text. After Googling, I found some answers SQL SERVER – 2005 – UDF – User Defined Function to Strip HTML – Parse HTML – No Regular Expression but that won't meet the requirement. Henceforth, I have created one for him. At the same time, he has one more challenge. The HTML text can be anything i.e. it can have JavaScript functions, style tags, etc. Moreover, it can be embedded within the BODY tags or there may not be any BODY tags. In either case, only portions from within the body tags need to be parsed. 

e.g.

Input(A):

<html><head><title></title><body><a href="#">Extract text from here</a></body></html>

Input(B) 

<a href="#">Extract text from here</a> 

Output: Extract text from here

Program / Sql Query

SQL
--Variable Declarations

DECLARE @OriginalStr VARCHAR(MAX)
DECLARE @NewStr VARCHAR(MAX)
DECLARE @StartIndex INT
DECLARE @EndIndex INT
DECLARE @Noise TABLE(Noise VARCHAR(100),ReplaceChars VARCHAR(10))

--End of Declaration

--Variable initialisation

--SET @OriginalStr = 'Niladri Biswas  at www.codeproject.com. 
An example of parsing html at <a href="%22http://www.codeproject.com%22">codeproject.com</a>' -- original data
SET @OriginalStr = '
 
 Click the button <input type="button" value="some button"  onclick="test();">
 hello world  
I am followed by a BLANK SPACE   and then a terminator

It''s a new line

Fonts are nice features

Nice link......
<a href="%22http://www.codeproject.com%22">www.codeproject.com</a>
 
'

INSERT INTO @Noise(Noise,ReplaceChars)
SELECT ' ',space(1) UNION ALL SELECT '>',space(1) UNION ALL
SELECT '<',space(1) UNION ALL SELECT '

',space(2) UNION ALL
SELECT ' ',space(2)

-- End of initialisation

--Step 1: Extract text between  and 
SELECT  @StartIndex= CHARINDEX('', @OriginalStr)
SELECT  @EndIndex= CHARINDEX('',@OriginalStr)
SELECT @OriginalStr =
	CASE WHEN(@StartIndex = 0 or @EndIndex = 0) THEN @OriginalStr
	ELSE 
	SUBSTRING(@OriginalStr,@StartIndex ,_
	((@EndIndex - @StartIndex) + 7 )) END-- Adding 7 since length of  = 7
--Step 2: Extract the relevant text between the tags

--Generate a number table
;WITH Num_Cte AS
(     
      SELECT 1 AS rn
      UNION ALL
      SELECT rn +1 AS rn 
      FROM Num_Cte 
      WHERE rn <= LEN(@OriginalStr)
)
-- Shred into individual characters
, Get_Individual_Chars_Cte AS
( 
      SELECT  
			rn AS Id
			,chars 			     
      FROM Num_Cte
      CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1)  AS chars) SplittedChars       
)
--Combine the characters again to obtain the extracted text between the tags
SELECT @NewStr = ExtractedText FROM (
	SELECT CAST(chars AS VARCHAR(MAX)) FROM (
		--Get the characters between the tags
		SELECT a.* FROM Get_Individual_Chars_Cte a
		INNER JOIN
		(

			--Find the position of the characters between the tags
			SELECT rn FROM Num_Cte EXCEPT
			(SELECT n.rn
			 FROM Num_Cte n 
			 JOIN(
				-- Find the start and end range of characters 
                                     -- between < and > for all tags
				SELECT
					 Id AS StartRange
					,(SELECT TOP 1 Id 
					  FROM Get_Individual_Chars_Cte 
					  WHERE Id > a.Id 
					  AND chars='>' 
					  ORDER BY Id) AS EndRange
				FROM Get_Individual_Chars_Cte a
				WHERE chars='<')X(StartRange,EndRange)
		ON n.rn BETWEEN X.StartRange AND X.EndRange)
	)X(rn)
	ON a.Id = X.rn	
)X(id,chars)
FOR XML PATH('')
)X(ExtractedText)
OPTION(MAXRECURSION 0)

--Remove the noises
SELECT @NewStr = REPLACE(@NewStr, Noise, ReplaceChars) FROM @Noise
SELECT ParsedText = @NewStr

Output

ParsedText

Click the button hello world I am followed by a BLANK SPACE and then a terminator It's a new line Fonts are nice features Nice link...... www.codeproject.com

Explanation

First we need to check if the HTML given has any or not and that will be the input source.

The very next step after that is to pick up the relevant text. 

For accomplishing this, we will take the help of a number table that has been created on the fly using a recursive cte.

Next split the data into individual characters. e.g. CROSS APPLY( SELECT SUBSTRING(@OriginalStr,rn,1) AS chars) SplittedChars.

The next step is to find the start and end range of characters between < and > for all tags. 

Once this task is accomplished, the next step is to find the position of the characters between the tags and combine the characters again (by using FOR XML PATH('')) to obtain the extracted text between the tags. 

After combining the characters, we will encounter some junk characters like '&nbsp;','>' etc. 

For that reason, we will keep a noise table where the noises (let's call the junk characters so) will be replaced by one space. Once done, we will achieve our goal.

Points of Interest

This article showed a way of parsing data using the SET BASED approach. Also it shows us how we can replace data in one single Replace statement instead of multiple. The usage of FOR XML PATH in combining the characters is really handy as substantiated from this article.

History

  • 12th January, 2010: Initial post

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



Comments and Discussions

 
Generalhoho Pin
Jipat12-Jan-10 21:23
Jipat12-Jan-10 21:23 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.