Click here to Skip to main content
15,894,646 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I need take out top 10 words from SQL Server table columns which are repeated most of times.

Col1_Row1 = first columon and first row

Eg.
Col1_Row1 "This is good bike.",
Col1_Row2 "Well, I like bike rides.",
Col2_Row1 "Only Harley can makes real good bike."


Here bike is repeated 3 times, good 2 times and other words 1 times.
So i need out put in table form which contains word and repeat count in it.

May data will grow according to time. Please suggest best approach.
Posted

In addition to Solution 1

- Picking up on your statement
Quote:
May data will grow according to time
There is an article at sqlservercentral that discusses the performance of various string splitting alternatives[^]

- Picking up on Maciej spotting the full stop in the sentence this post[^] suggests a nice method of getting rid of all not alpha characters

- I frequently use the following function to split data as it is reasonably peformant - Note it's not my own work and unfortunately I can't find the original at the moment(if I do, or someone else does then I will accredit the work here)
CREATE FUNCTION [dbo].[fnSplitString] 
( 
    @string NVARCHAR(MAX), 
    @delimiter CHAR(1) 
) 
RETURNS @output TABLE(splitdata NVARCHAR(MAX) 
) 
BEGIN 
    DECLARE @start INT, @end INT 
    SELECT @start = 1, @end = CHARINDEX(@delimiter, @string) 
    WHILE @start < LEN(@string) + 1 BEGIN 
        IF @end = 0  
            SET @end = LEN(@string) + 1
       
        INSERT INTO @output (splitdata)  
        VALUES(SUBSTRING(@string, @start, @end - @start)) 
        SET @start = @end + 1 
        SET @end = CHARINDEX(@delimiter, @string, @start)
        
    END 
    RETURN 
END


- I went down the route of using temporary tables which might not be best for large data tables. I'm including my efforts here only to demonstrate the calls to the functions I've suggested (It also demonstrates an alternative to using CURSOR if you ever get tempted down that route)
create table #temp
(
	rownum int,
	datacol varchar(max)
) ON [PRIMARY]	

insert into #temp
SELECT rownum = ROW_NUMBER() OVER (order by id), datacol
from topwords

DECLARE @wordlist table (word varchar(max))

declare @maxi int 
SELECT @maxi = COUNT(*) FROM topwords

DECLARE @loopCount int = 1
DECLARE @txt varchar(max)
WHILE @loopCount <= @maxi
BEGIN
	SELECT @txt = datacol from #temp1 WHERE rownum = @loopCount
	INSERT INTO @wordlist SELECT [dbo].fnRemoveNonAlpha(splitdata) FROM [dbo].fnSplitString(@txt,' ')
	SET @loopCount = @loopCount + 1
END

select word, count(*) from @wordlist
group by word
order by count(*) desc
 
Share this answer
 
Comments
Maciej Los 3-Jun-15 16:07pm    
+5, Caroline ;)
If you expect data to grow and performance is an issue you should consider adding a full text index on the table.

Then you can query it using either one of these procedures: sys.dm_fts_index_keywords[^], sys.dm_fts_index_keywords_by_document[^] or sys.dm_fts_index_keywords_by_property[^]
 
Share this answer
 
Comments
Maciej Los 3-Jun-15 16:10pm    
+5 Jörgen!
CHill60 3-Jun-15 17:35pm    
+5 - fits the OP requirement of "best approach"
All you need to do is to split sentence into words via using CTE or custom function.

Try this (read comments):
SQL
DECLARE @tmp TABLE (col1 VARCHAR(255), col2 VARCHAR(255))

INSERT INTO @tmp (col1, col2)
VALUES('This is good bike.', 'Only Harley can makes real good bike.'),
('Well, I like bike rides.', NULL)

;WITH CTE AS
(
	--initial part
	--get first word from col1 As word and the rest as remainder
	SELECT LEFT(col1, CHARINDEX(' ', col1)-1) AS word, RIGHT(col1, LEN(col1) - CHARINDEX(' ', col1)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col1)>0
	UNION ALL
	--get first word from col2 As word and the rest as remainder
	SELECT LEFT(col2, CHARINDEX(' ', col2)-1) AS word, RIGHT(col2, LEN(col2) - CHARINDEX(' ', col2)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col2)>0
	UNION ALL
	--recursive part
	-- get another words
	SELECT LEFT(remainder, CHARINDEX(' ', remainder)-1) AS word, RIGHT(remainder, LEN(remainder) - CHARINDEX(' ', remainder)) AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)>0
	UNION ALL
	SELECT remainder AS word, NULL AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)=0
)
--remove dot and count words ;)
SELECT REPLACE(word, '.', '') As word, COUNT(word) AS CountOfWord
FROM CTE
GROUP BY REPLACE(word, '.', '')
ORDER BY COUNT(word) DESC


For further information, please see: Using Common Table Expressions[^]
 
Share this answer
 
v2

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900