Click here to Skip to main content
15,889,992 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
Good Day

I've tried a few scenarios and googled a lot, but still can't find a solution.

I have a table of users that can look something like this

|UserName  |
|:--------:|
|Cakes420  |
|18Jack01  |
|18Jack04  |
|16Jack22  |
|22Jack16  |
|Mapple7609|
|Chrom44   |
|chrom22   |
|chrom77   |
|013Cake   |
|016Cake   |
|122Cake   |
|123Cake87 |


So I need a query that checks for all records that share 4 or more (in sequence) characters in the table.

So I need to return something like :

|Characters|Times Used|Names Sharing|
|:--------:|:--------:|:-----------:|
|Cake      |    5     |Cakes420, 013Cake, 016Cake, 122Cake, 123Cake87|
|Chro      |    3     |Chrom44, chrom22, chrom77|


or anything similar as I'd prefer not to repeat patterns, but hey, at this stage if it reurns the values properly, I don't mind.

The shared characters can naturally appear in any place in the string, which is what makes this so difficult.

What I have tried:

Stuff For XML
Subquery
except
with
Posted
Updated 21-Jan-21 23:30pm
v3

For substrings of precisely four characters, you can use a variation on the trigram approach described here:
Trigram Wildcard String Search in SQL Server - SQLPerformance.com[^]
SQL
WITH cteN0 As -- 10 values
(
    SELECT 0 As N
    UNION ALL SELECT 0 
    UNION ALL SELECT 0 
    UNION ALL SELECT 0 
    UNION ALL SELECT 0
    UNION ALL SELECT 0 
    UNION ALL SELECT 0 
    UNION ALL SELECT 0 
    UNION ALL SELECT 0
    UNION ALL SELECT 0
),
cteN1 As -- 100 values
(
    SELECT 0 As N
    FROM cteN0 As X
    CROSS JOIN cteN0 As Y
),
cteN2 As -- 10000 values
(
    SELECT 0 As N
    FROM cteN1 As X
    CROSS JOIN cteN1 As Y
),
cteTally As
(
    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 0)) As N
    FROM cteN2
),
cteGrams As
(
    SELECT
        U.Username,
        G.word
    FROM
        @T As U
        CROSS APPLY
        (
            SELECT TOP (CASE WHEN Len(U.Username) > 4 THEN Len(U.Username) - 4 ELSE 0 END)
                Substring(U.Username, N.N, 4) As word
            FROM
                cteTally As N
            ORDER BY
                N.N
        ) As G
)
SELECT
    word As [Characters],
    Count(1) As [Times Used],
    STUFF(
        (SELECT ', ' + G2.Username
        FROM cteGrams As G2
        WHERE G2.word = G.word
        ORDER BY G2.Username
        FOR XML PATH(''), TYPE).value('.', 'varchar(max)')
    , 1, 2, '') As [Names Sharing]
FROM
    cteGrams As G
GROUP BY
    word
HAVING
    Count(1) > 1
;
(Adjust the tally table to suit the maximum length of your input strings.)

For your sample input, this produces:
Characters | Times Used | Names Sharing
------------------------------------------------------------------
18Ja       |          2 | 18Jack01, 18Jack04
3Cak       |          2 | 013Cake, 123Cake87
8Jac       |          2 | 18Jack01, 18Jack04
ack0       |          2 | 18Jack01, 18Jack04
Cake       |          2 | 123Cake87, Cakes420
chro       |          3 | chrom22, Chrom44, chrom77
hrom       |          3 | chrom22, Chrom44, chrom77
Jack       |          4 | 16Jack22, 18Jack01, 18Jack04, 22Jack16
You may be able to extend this approach to suit your requirements.
 
Share this answer
 
Comments
Maciej Los 20-Jan-21 13:21pm    
Got the same, but i've used different method ;)
Member 9374423 28-Jan-21 8:15am    
This is very close, but all the "cake" need to appear under Cake, all the "3Cak" need to appear there as well,
This is funny, because many years ago, i was looking for something like that (even if my requirements were bit different).

Take a look at below code:
SQL
CREATE TABLE AllNames
(
  [UserName] varchar(30)
);
INSERT INTO AllNames([UserName])
VALUES('Cakes420'),
('18Jack01'),
('18Jack04'),
('16Jack22'),
('22Jack16'),
('Mapple7609'),
('Chrom44'),
('chrom22'),
('chrom77'),
('013Cake'),
('016Cake'),
('122Cake'),
('123Cake87');

;WITH FourChars AS
(
  --initial part
  SELECT 1 LoopNo, ROW_NUMBER() OVER(ORDER BY LOWER([UserName])) rn, LOWER([UserName]) OrigName, SUBSTRING(LOWER([UserName]), 1, 4) Part
  FROM AllNames
  WHERE  LEN([UserName])>4
  --recursive part
  UNION ALL
  SELECT LoopNo +1 LoopNo, rn, OrigName, SUBSTRING(OrigName, LoopNo +1, 4) Part
  FROM FourChars
  WHERE  LEN(OrigName)>= LoopNo+4
)
SELECT *
FROM
(
SELECT Part, STRING_AGG(OrigName, ', ') ComNames
FROM FourChars
GROUP BY Part
) a
WHERE CHARINDEX(',', a.ComNames)>0;



Result:
18ja 	18jack01, 18jack04
3cak 	123cake87, 013cake
8jac 	18jack04, 18jack01
ack0 	18jack04, 18jack01
cake 	cakes420, 122cake, 123cake87, 016cake, 013cake
chro 	chrom22, chrom44, chrom77
hrom 	chrom44, chrom22, chrom77
jack 	22jack16, 18jack04, 18jack01, 16jack22


db_fiddle[^]

Good luck!
 
Share this answer
 
Comments
Member 9374423 22-Jan-21 2:40am    
Thank you so very much!
Maciej Los 22-Jan-21 3:40am    
You're very welcome.
I got it working with this, thank you all so much!

with rcte as
(
  select n.Name,
         convert(nvarchar(4), substring(n.Name, 1, 4)) as Part,
         1 as PartFrom
  from Names n
  where len(n.Name) >= 4
    union all
  select r.Name,
         convert(nvarchar(4), substring(r.Name, r.PartFrom+1, r.PartFrom+4)),
         r.PartFrom+1
  from rcte r
  where len(r.Name) >= r.PartFrom+4
),
cte_count as
(
  select r.Part,
         count(1) as PartCount
  from rcte r
  where r.Part not like '%[0-9]%' -- exclude parts with numbers in them
  group by r.Part
  having count(1) > 1
)
select c.Part,
       c.PartCount,
       string_agg(r.Name, ', ') as Names
from cte_count c
join rcte r
  on r.Part = c.Part
group by c.Part,
         c.PartCount
order by c.Part;
 
Share this answer
 
Comments
Maciej Los 22-Jan-21 5:35am    
5ed!
 
Share this answer
 
Comments
Member 9374423 20-Jan-21 8:10am    
Nope, The characters shared can be anywhere, and I don't know the 4 characters before hand, the query must find it, so Lik won't work. Thank you
Richard MacCutchan 20-Jan-21 8:15am    
With no information at all it is going to be very difficult to find what you want. You could try a regex that looks for sequences of four characters, but that may still not find everything you want, or even include things you do not want.
Member 9374423 20-Jan-21 8:20am    
Regex in SQL? I've never heard of this, let me google a bit
Member 9374423 20-Jan-21 8:21am    
So what would this regex look like?
Richard MacCutchan 20-Jan-21 8:56am    
You would need to extract all the names from the database and process them in whatever language your code is written in. That would be a much simpler way to get what you want.
The chances are that you aren't going to find a "nice" pure SQL solution (or indeed "any" pure SQL solution) - SQL string handling can charitably be described as "poor".

I would strongly suggest you would be better off doing this in a presentation language, whichever one(s) you use - it'll be a lot easier, and a heck of a lot more maintainable!
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900