How to get top ten words from SQL Table columns which are occurred many times.

Question

0.00/5 (No votes)

See more:

I need take out top 10 words from SQL Server table columns which are repeated most of times.

Col1_Row1 = first columon and first row

Eg.
Col1_Row1 "This is good bike.",
Col1_Row2 "Well, I like bike rides.",
Col2_Row1 "Only Harley can makes real good bike."

Here bike is repeated 3 times, good 2 times and other words 1 times.
So i need out put in table form which contains word and repeat count in it.

May data will grow according to time. Please suggest best approach.

Posted 3-Jun-15 3:08am

Bajirao_

Add a Solution

3 solutions

Solution 2

In addition to Solution 1

- Picking up on your statement

Quote:
May data will grow according to time

There is an article at sqlservercentral that discusses the performance of various string splitting alternatives[^]

- Picking up on Maciej spotting the full stop in the sentence this post[^] suggests a nice method of getting rid of all not alpha characters

- I frequently use the following function to split data as it is reasonably peformant - Note it's not my own work and unfortunately I can't find the original at the moment(if I do, or someone else does then I will accredit the work here)

CREATE FUNCTION [dbo].[fnSplitString] 
( 
    @string NVARCHAR(MAX), 
    @delimiter CHAR(1) 
) 
RETURNS @output TABLE(splitdata NVARCHAR(MAX) 
) 
BEGIN 
    DECLARE @start INT, @end INT 
    SELECT @start = 1, @end = CHARINDEX(@delimiter, @string) 
    WHILE @start < LEN(@string) + 1 BEGIN 
        IF @end = 0  
            SET @end = LEN(@string) + 1
       
        INSERT INTO @output (splitdata)  
        VALUES(SUBSTRING(@string, @start, @end - @start)) 
        SET @start = @end + 1 
        SET @end = CHARINDEX(@delimiter, @string, @start)
        
    END 
    RETURN 
END

- I went down the route of using temporary tables which might not be best for large data tables. I'm including my efforts here only to demonstrate the calls to the functions I've suggested (It also demonstrates an alternative to using CURSOR if you ever get tempted down that route)

create table #temp
(
	rownum int,
	datacol varchar(max)
) ON [PRIMARY]	

insert into #temp
SELECT rownum = ROW_NUMBER() OVER (order by id), datacol
from topwords

DECLARE @wordlist table (word varchar(max))

declare @maxi int 
SELECT @maxi = COUNT(*) FROM topwords

DECLARE @loopCount int = 1
DECLARE @txt varchar(max)
WHILE @loopCount <= @maxi
BEGIN
	SELECT @txt = datacol from #temp1 WHERE rownum = @loopCount
	INSERT INTO @wordlist SELECT [dbo].fnRemoveNonAlpha(splitdata) FROM [dbo].fnSplitString(@txt,' ')
	SET @loopCount = @loopCount + 1
END

select word, count(*) from @wordlist
group by word
order by count(*) desc

Posted 3-Jun-15 6:02am

CHill60

Comments

Maciej Los 3-Jun-15 16:07pm

+5, Caroline ;)

Solution 3

If you expect data to grow and performance is an issue you should consider adding a full text index on the table.

Then you can query it using either one of these procedures: sys.dm_fts_index_keywords[^], sys.dm_fts_index_keywords_by_document[^] or sys.dm_fts_index_keywords_by_property[^]

Posted 3-Jun-15 9:43am

Jörgen Andersson

Comments

Maciej Los 3-Jun-15 16:10pm

+5 Jörgen!

CHill60 3-Jun-15 17:35pm

+5 - fits the OP requirement of "best approach"

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Maciej Los · Accepted Answer · 2015-06-03T05:24:00

All you need to do is to split sentence into words via using CTE or custom function.

Try this (read comments):

SQL

DECLARE @tmp TABLE (col1 VARCHAR(255), col2 VARCHAR(255))

INSERT INTO @tmp (col1, col2)
VALUES('This is good bike.', 'Only Harley can makes real good bike.'),
('Well, I like bike rides.', NULL)

;WITH CTE AS
(
	--initial part
	--get first word from col1 As word and the rest as remainder
	SELECT LEFT(col1, CHARINDEX(' ', col1)-1) AS word, RIGHT(col1, LEN(col1) - CHARINDEX(' ', col1)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col1)>0
	UNION ALL
	--get first word from col2 As word and the rest as remainder
	SELECT LEFT(col2, CHARINDEX(' ', col2)-1) AS word, RIGHT(col2, LEN(col2) - CHARINDEX(' ', col2)) AS remainder
	FROM @tmp
	WHERE CHARINDEX(' ', col2)>0
	UNION ALL
	--recursive part
	-- get another words
	SELECT LEFT(remainder, CHARINDEX(' ', remainder)-1) AS word, RIGHT(remainder, LEN(remainder) - CHARINDEX(' ', remainder)) AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)>0
	UNION ALL
	SELECT remainder AS word, NULL AS remainder
	FROM CTE
	WHERE CHARINDEX(' ', remainder)=0
)
--remove dot and count words ;)
SELECT REPLACE(word, '.', '') As word, COUNT(word) AS CountOfWord
FROM CTE
GROUP BY REPLACE(word, '.', '')
ORDER BY COUNT(word) DESC

For further information, please see: Using Common Table Expressions[^]