15,963,836 members
See more:
I've two tables t1 and t2.
From t1:-
total records sel_col-distinct
28564502 12257389

From t2:-
total records sel_col-distinct
11217091 10890077

I am executing the below queries:-
SQL
```select count( distinct t1.sel_col ) distinct, count(sel_col) total
from
t1(nolock) join t2(nolock)
on t1.sel_col = CONVERT(VARBINARY(max),t2.sel_col,2)```
distinct total
10600479 25666519
time:-00:50 mins

SQL
```SELECT count(distinct t1.sel_col) total, count(t1.sel_col) total FROM
t1(nolock)
where t1.sel_col in
(select CONVERT(VARBINARY(max),t2.sel_col,2) from t2(nolock))```
distinct total
10600479 24835271
time:-12:54mins

The distinct value is coming same, but there is difference between total. The 2nd query is taking huge time. Is there any way to get the desired result? Also why there is so much difference in the total count?

Posted
Updated 2-Jan-14 10:15am
v2
chaau 2-Jan-14 15:56pm
I guess you have rows in t2 with duplicate sel_col

## Solution 2

It's because the join has the ability to multiply rows. Look at the following tables:

```TableA
Cust    Value
1        1
1        2
2        1

TableB
Cust
1
1
2```

When you join the two you get

```a.Cust b.Cust Value
1      1      1
1      1      1
1      1      2
1      1      2
2      2      1```

So if you count these values you get 5 where are using the in you get that same dataset as TableA so the count is 3.

If you run this SQL:

SQL
`SELECT SUM(COUNT(*) - 1) diff, COUNT(*) total, sel_col FROM t2 GROUP BY sel_col HAVING COUNT(*) > 1`

If this returns any results it means your count is being multiplied by the join. The diff column in the result will equal the difference between the two results.

If you need a join with the same result as the in, try the following.

SQL
`SELECT COUNT(t1.sel_col) FROM t1 (NOLOCK) INNER JOIN (SELECT DISTINCT sel_col FROM t2 (NOLOCK)) t2`

Joining to the sub query when the sub query is distinct will reduce duplicates and the multiplying effect of the join.

Christian Graus 2-Jan-14 17:28pm
Cool - thanks for taking the time to make this clearer than I did. Sadly, the OP does not appear to be replying.
Member 10499886 5-Jan-14 0:36am
Thanks a lot for such a detailed explanation.

## Solution 1

The reason they take different times is that the second uses a subquery, which is slow. Comparing varbinary(MAX) sounds like you're comparing files, which is always slow. Consider storing a hash of the data in your varbinary, for faster comparisons. You can then do a deeper comparison where the hashes match.

You are throwing the DISTINCT keyword about a bit. This often means that your SQL has other issues which you are hiding.

Are you saying the second result is correct ? What is the table structure ?

OK, I see now. You have put the totals there. Your first select is the same, it's a distinct. The second, is not. If you have the same value twice in t2, an inner join will create 2 copies of that row in your results, even if it's in t1, once. Doing a subquery, is not a join, so it would not create duplicate rows. This is, I assume, why you're using DISTINCT. Instead, you should fix your database structure.

v2
Member 10499886 5-Jan-14 0:39am
Thanks Christian for your solution. And yes there is issue with DB structure, in future I'll fix it.
Christian Graus 5-Jan-14 0:41am
I explain the issue in more depth in my latest article : http://www.codeproject.com/Articles/705911/SQL-Wizardry-Part-Five-when-not-to-use-DISTINCT