Best practice to avoid duplicate data entries in a mysql DB

Question

0.00/5 (No votes)

See more:

I've looked around on various websites in search for a "best practice" method when trying to avoid duplicate entries in a MySQL DB.

I did find basic information on how to detect duplicate entries, but I still fail to find substantial information on how to implement a complete solution to detect for duplication and then control the execution process within a stored procedure.

I believe this is where I will now have to explain myself better:

I collect patient demographic information and insert that into a MySQL database. The information is specific to:

Patient
|
-- Patient Details (table)
|
-- Patient Contact Details (table)
|
-- Patient Address Details - Street & Postal (table)
|
-- Patient Medical Insurance Details (table)
|
-- Patient Next of Kin Details (table)
|
-- Patient Employer Details (table)

Validation is done on "Patient Details" (First Name, Last Name, SSN, DOB, etc) with the following SP:

SQL

DELIMITER $$

CREATE DEFINER=`rootuser`@`%` PROCEDURE `SP_patAddDetails`(
/*IN pFileNo 		VARCHAR(11),*/
IN pKey 		VARCHAR(1),
IN pTitle 		VARCHAR(15),
IN pFirstName 		VARCHAR(30),
IN pInitials 		VARCHAR(5),
IN pSurname 		VARCHAR(30),
IN pIDNo 		VARCHAR(15),
IN pDOB 		VARCHAR(10),
IN pGender  		VARCHAR(8),
IN pLanguage 		VARCHAR(15),
IN pOccupation 		VARCHAR(30),
IN pDependantCode 	VARCHAR(2),
IN pRelationship 	VARCHAR(15),
IN pMemNo 		VARCHAR(20),
IN pDateOpened      VARCHAR(10)
)
BEGIN
START TRANSACTION;

INSERT INTO patDetails(patKey, patTitle, patFirstName, patInitials, 
                       patSurname, patIDNo, patDOB, patGender, patLanguage,
                       patOccupation, patDependantCode, patRelationship, 
                       patMemNo, patDateOpened, patUpdated) 
 
 /*
INSERT INTO patDetails(patFileNo, patKey, patTitle, patFirstName, patInitials, 
                       patSurname, patIDNo, patDOB, patGender, patLanguage,
                       patOccupation, patDependantCode, patRelationship, 
                       patMemNo, patUpdated, patDateOpened)
*/
  
  
 /*                      
to avoid duplicate entries during the patient add process 
   check to see if patient already exist in the patDetails table
*/		


       
        SELECT * FROM (SELECT pKey, pTitle, pFirstName, pInitials, 
                              pSurname, pIDNo, pDOB, pGender, pLanguage, 
                              pOccupation, pDependantCode, pRelationship, 
                              pMemNo, pDateOpened, now()) AS tmp
                              
/*        
SELECT * FROM (SELECT pFileNo, pKey, pTitle, pFirstName, pInitials, 
                              pSurname, pIDNo, pDOB, pGender, pLanguage, 
                              pOccupation, pDependantCode, pRelationship, 
                              pMemNo, now()) AS tmp, pDateOpened
*/
        
        WHERE NOT EXISTS (
			SELECT patTitle, patFirstName, patInitials, patSurname, patDOB, patMemNo
			FROM   patDetails 
			WHERE  patTitle = pTitle
			AND    patFirstName = pFirstName
			AND    patInitials = pInitials
			AND    patSurname = pSurname
            AND    patDOB = pDOB
            AND    patMemNo = pMemNo
        ) 
        LIMIT 1;    

 /*               
        WHERE NOT EXISTS (
			SELECT patFileNo, patKey, patTitle, patFirstName, patInitials, patSurname
			FROM   patDetails 
			WHERE  patFileNo = pFileNo
			AND    patKey = pKey
			AND    patTitle = pTitle
			AND    patFirstName = pFirstName
			AND    patInitials = pInitials
			AND    patSurname = pSurname
        ) 
        LIMIT 1;  
*/

COMMIT;


SELECT LAST_INSERT_ID()

COMMIT;
END

Currently I am controlling the validation process inside my VB.net Forms application by checking to see if I get a new database ID returned. If a new ID is returned I know that validation passed, the patient details were written to the DB and I get a new DB ID which I can then use as my foreign key during the rest of the process. (I am happy to edit the post to remove all the remarked code lines, things I tried, if it posses any reading difficulty)

I want to believe that there is a much more logical method of doing this within MySQL rather than doing it external, but I don't seem to find the information I need to achieve this...maybe it is meant to be done from an application rather from inside MySQL (which I want to be wrong about).

Regards,
T

What I have tried:

In the Stored Procedure there are commented lines of code of things I've tried but this is still a process which it controlled outside of MySQL.
-edit: grammar

Posted 10-Nov-18 21:40pm

Tino Fourie

Updated 11-Nov-18 7:12am

v3

Add a Solution

3 solutions

Solution 1

For starters, you are storing the same information repeatedly: Language, Occupation, Gender, and so forth - those should be separate tables with a Foreign Key link. THis applies to all places where information is not limited to a single user, but "shared" by many.

Second, you are storing everything as strings, which is not a good idea at all. Look at it: pIDNo - is that a string? No, it's a number. Store it in a numeric field! Date of birth? DATE. Date Opened? DATE. And so forth.
If you store data in strings all the time it's easy for you as a developer to get started - but it causes huge problems later because you can't sort it, you don;t know if a date is in MM/dd/yy or dd/MM/yy (or even yy/MM/dd) - heck, you don't even know it's a valid date at all! So your code later fails because your data is rubbish. And you have thrown away any context which lets you decide which format the user might have used when he entered it ...

That looks like a student had to design a system, and wanted to go to the pub instead! :laugh:

Posted 10-Nov-18 22:08pm

OriginalGriff

Comments

Tino Fourie 11-Nov-18 5:37am

Hi Griff and thanks for your reply.

For your starter, I completely agree with you. That can certainly be addressed to prevent inflating a DB.

However, for the rest I don't really have the luxury of using the appropriate data types as you pointed out in par. 2. I collect data from various 3rd party DB's, therefore I don't have to validate for correctness. Data correctness validation is done through my application when users are required to update the information on the DB.

When I do validation from 3rd party DB, up to 30% of the entries fail initial validation. I know you were not aware of this, and therefore I can assure you that I did not throw any context away nor decided how a user will input data.

As it stands, it is poor data validation from 3rd party vendors and I need to fix that.

Why I use strings, instead of the more appropriate data type, is that when data is imported into my DB, I have no idea how a date was entered, which data format (dd/MM/yy, dd/MM/yyyy) were used - you could say that I am completely blind when importing data from 3rd party DB's.

PS: I don't do the pub scene and also not a student like in attending classes - but I like to learn even at 50. So I guess you could call me a student ;)

OriginalGriff 11-Nov-18 5:43am

:laugh: Perhaps we can!

To be honest, if I'd paid for data and 30% of it failed validation, I'd be looking for my money back, not perpetuating the problems into my DB's and later software which may - or may not - know that the data is unreliable.
I'd be looking at the very least to validate it from the 3rd party into a sensible database format and discarding (or at least offlining) the rubbish. That's a very high bad data rate and I wouldn't be a happy bunny, or particularly inclined to trust the rest of the 3rd party offering.

Tino Fourie 11-Nov-18 5:55am

Again, that I agree to and sadly it is a combination of poor data validation and lack of user responsibility.

Now that is where my application comes in. This project is self-funded and being an entrepreneur I saw an opportunity in the market.

How do I know users will enter patient demographics correctly when using my application - well they don't have to enter (type) anything because I use data injection via WinAPI into the front-end of the 3rd party application. I refuse to write directly to a 3rd party database, I ain't stupid :laugh

Tino Fourie 12-Nov-18 10:23am

I gave you a ^5, a pint will be just fine :laugh

Solution 3

I would have prefer to post this as a comment, but comments seems to have a Problem at the Moment.

Have a look to this, could be interesting for you:
MySQL :: MySQL 5.7 Reference Manual :: 13.2.5.2 INSERT ... ON DUPLICATE KEY UPDATE Syntax[^]

It is a mySQL Substitute for MSSQL merge.

Posted 11-Nov-18 7:12am

User 11060979

Comments

Tino Fourie 12-Nov-18 10:21am

Hi 0x01AA and thank you for replying.
I did look into that originally when I started out. The problem is, I have no way of telling which one of the duplicate entries will be the most correct "unless" I run validation within my application.

Keep in mind, some entries will have a patID, other don't and then you have incomplete patId's which could either be less than 13 digits OR more than 13 digits.

This is just one example, with a few variations, that I am facing.

Currently, I am focusing on only a few data fields to determine if a duplication exist. The rest I leave up to the user to either update incomplete data OR in the very least, delete the worst of the duplicate entries and correctly update whatever remains. Somewhere along the line I hope to have a few "perfectionist" users out there who has an uncontrollable and impulsive obsession to correct data :P

Add a Solution

Add your solution here

Treat my content as plain text, not as HTML

Preview 0

…

Existing Members

Sign in to your account

...or Join us

Download, Vote, Comment, Publish.

Your Email
Password
Forgot your password?

Your Email
This email is in use. Do you need your password?
Optional Password

I have read and agree to the Terms of Service and Privacy Policy
Please subscribe me to the CodeProject newsletters

When answering a question please:

Read the question carefully.
Understand that English isn't everyone's first language so be lenient of bad spelling and grammar.
If a question is poorly phrased then either ask for clarification, ignore it, or edit the question and fix the problem. Insults are not welcome.
Don't tell someone to read the manual. Chances are they have and don't get it. Provide an answer or move on to the next question.

Let's work to help developers, not make them feel stupid.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Wendelius · Accepted Answer · 2018-11-10T22:31:00

In most of the cases I consider the best option to use declarative constraints. For example to prevent duplicate rows for natural keys one should use MySQL UNIQUE Constraint[^] .

Sometimes the logic is more complex than what can be expressed in a constraint but in such cases I would prefer triggers over procedures whenever possible. This ensures that the check is always enforced. See MySQL :: MySQL 8.0 Reference Manual :: 24.3.1 Trigger Syntax and Examples[^]

Also if more complex logic is needed it's often a good idea to stop for a while and consider if the relational model is correct. Just to ensure that the database is properly utilized. If the current design still feels feasible then continue with procedural approach.