Click here to Skip to main content
15,867,141 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
Hi,
I've been working on an application to help manage a Health Management Organization departments.
Its my first commercial piece of software and I'm excited about rounding up this week if I can get this issue sorted out.
Here's the problem...
The MIS department receives a PDF 4 times a year. This PDF contains 2 pieces of information.

a) A list of all hospitals registered under the orgnanisation.

b) A list of Enrollee's registered under each hospital.

I was tasked with writing a program that retrieve's all hospitals in the PDF, registering them in the application's database, then retrieving all enrollee's and registering them in the database under their respective hospitals(This is managed using a foreign key relationship in the database).

I used Regex to write a solution that registers all the hospitals, and save for some delay when parsing the PDF(which is 4000 pages long) it works perfectly.

The problem is that my solution to register the enrollee's is not as efficient as it should be, about 2 out of 10 enrollee's don't get registered due to inefficiencies in my code.

And when I transfer the already partially working solution to the client's server where it will finally reside, I get an error which says "Source Code Could Not Be Found". But when I run it in debug mode to check what the problem might be, it extracts the enrollee details as expected. So I'm very confused about that.

If I can get help with a)The "source code cannot be found" error or b)Why my code works on my development machine but not the server I would be very grateful.

Ill include my code and would also include a snapshot of the PDF as well but I doubt stack lets attachments with questions.

Thanks.

private void extractEnrolleesFromPDF(string enrolleeExtraction, string hospital)
        {
            int start;
            int end;
            string substring;
            
            try
            {
                MatchCollection policyNumbers = Regex.Matches(enrolleeExtraction, @"(\*)(\d{8})(\*)");

                foreach (var policyNumber in policyNumbers)
                {
                    Match match = Regex.Match(enrolleeExtraction, "\\" + policyNumber.ToString());
                    if (match.Success)
                    {
                        //Strore the first occurence of the enrollee's policy number
                        start = match.Index;

                        Match match2 = Regex.Match(enrolleeExtraction.Substring(start + 10), @"(\*)");
                        if (match2.Success)
                        {
                            end = match2.Index + 9;

                            substring = enrolleeExtraction.Substring(start, end);

                            enrolleePolicyNumber.Add(substring);
                        }                        
                    }
                }
                //Extract enrollee data an insert into the database

                ArrayList individualEnrolees = new ArrayList();

                int numberOfEnrollees = enrolleePolicyNumber.Count;
                bool principal = false;
                string fName;
                string lName;
                DateTime dob;
                string sex;
                string hospitalCode = hospital.Substring(1, 7);
                for (int i = 0; i < numberOfEnrollees; i++)
                {
                    string enrolleePolNumber;
                    Match policyNumber = Regex.Match(enrolleePolicyNumber[i].ToString(), @"((\*)(\d{8})(\*))");
                    if (policyNumber.Success)
                    {
                        enrolleePolNumber = policyNumber.Value;
                    }
                    MatchCollection enrolleeRecords = Regex.Matches(enrolleePolicyNumber[i].ToString(), @"(\d{1})(\s)(\D*)(\d{2})/(\d{2})/(\d{4})");

                    //Empty the array list each time to avoid going over the same recors over and over again
                    individualEnrolees.Clear();

                    foreach (var record in enrolleeRecords)
                    {
                        individualEnrolees.Add(record);
                    }

                    //The way our search works at the moment is that is uses the pattern *-------* at th ebeginning and end to
                    //mark where an enrolleee's records begin and end. The problem now is that the last record does not have
                    //that pattern at the end. So we need to find a way to retrieve the last record and add it to the collection we parse
                    //for the enrollee data.
                    try
                    {
                        Match lastPolicyNumberInHospital = Regex.Match(enrolleeExtraction, @"(\*)(\d{8})(\*)", RegexOptions.RightToLeft);

                        string lastRecord = enrolleeExtraction.Substring(lastPolicyNumberInHospital.Index);

                        enrolleePolicyNumber.Add(lastRecord);
                    }
                    catch (Exception ex)
                    {
                        MessageBox.Show("Failed to extract last record: " + ex.Message);
                    }

                    foreach (var record in individualEnrolees)
                    {
                        string princ;

                        string[] splitEnrolleeData = record.ToString().Split(' ');

                        //int splitSectionCount counts how many section our split enrollee data is
                        int splitSectionCount = splitEnrolleeData.Count();

                        //if we have six sections then we expect the Principal or Spouse record to be
                        //on index 1
                        if (splitSectionCount == 5)
                        {
                            princ = splitEnrolleeData[1].ToString();
                            if (princ == "Principal")
                            {
                                principal = true;
                            }
                            else
                            {
                                principal = false;
                            }
                        }
                        //if we have five sections then we expect the Principal or Spouse record to be
                        //on index 0.
                        //i.e. Merged with the serial number so we check to see if it contains
                        //the string "Principal" or "Spouse"
                        else if (splitSectionCount == 4)
                        {
                            if (splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = true;
                            }
                            else if (!splitEnrolleeData[0].ToString().Contains("0"))
                            {
                                principal = false;
                            }
                        }
                        //TO-DO: Eliminate this comment block is else-if above works properly
                        //princ = splitEnrolleeData[1].ToString();
                        //if (princ == "Principal")
                        //{
                        //    principal = true;
                        //}
                        //else
                        //{
                        //    principal = false;
                        //}

                        enrolleePolNumber = policyNumber.Value.Substring(1, policyNumber.Value.Length - 2);

                        //if we have 6 sections as expected carry on and register the enrollee as usual
                        //if not, if we have 5 do something else
                        //this is because some enrollees in the NHIS PDF arent split properly returning 
                        //5 items instead of 6
                        if (splitSectionCount == 5)
                        {
                            lName = splitEnrolleeData[2].ToString();
                            fName = splitEnrolleeData[3].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[4].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                        }
                        else if (splitSectionCount == 4)
                        {
                            lName = splitEnrolleeData[1].ToString();
                            fName = splitEnrolleeData[2].ToString();
                            dob = Convert.ToDateTime(splitEnrolleeData[3].ToString());
                            hosp = getHospitalID(hospitalCode);
                            if (principal == true)
                            {
                                if (checkIfPolicyNumberExists(enrolleePolNumber) == false)
                                {
                                    registerEnrollee(enrolleePolNumber, fName, lName, dob, hosp.ToString());
                                }
                            }
                            else if (principal == false)
                            {
                                if (checkIfDependantPolicyNumberExists(enrolleePolNumber, fName) == false)
                                {
                                    
                                        registerDependant(enrolleePolNumber, fName, lName, dob, hosp.ToString());

                                    //else if (!parentExists(enrolleePolNumber))
                                    //{
                                        
                                    //}
                                }
                            }
                        }
                    }
                }
            }
            catch (Exception ex)
            {
                MetroFramework.MetroMessageBox.Show(this, "Error retrieving subsitring: " + ex.Message);
            }

        }
Posted
Comments
Sergey Alexandrovich Kryukov 17-Mar-15 21:57pm    
Unfortunately, your whole assignment, the whole idea to extract data from PDF means nothing but huge abuse of technology in general, the result of someone's big mistakes. PDFs are designed as a one-way data sink: people use data to generate PDF and are not supposed to reverse it. This is not PDF is designed for.

This is just my note. This is not a work any reasonable person would ever choose. I'm really sorry for you.

—SA

1 solution

That's impressive that you can get 80% of those records. Unfortunately, even though it is in PDF format, the data itself is not well formed when it goes in. If you are a consultant, then there isn't much you can do, but if you are a regular employee you can throwout some suggestions.

On the programming side, I would track all of the failed records and look for a correlation on why they failed; such as the hospital it came from, the data type, or length. This would help narrow it down further. I would also beg to have them submit the document in CSV, Excel, or XML format, because I am guessing it is coming from some sort of Office application that someone thinks looks nice as a PDF and it is smaller so it can be sent through email.

Lastly, if you can show which records are failing and why, I'd find who compiled the document and ask them to Poka-yoke the input forms. That is a Japanese term used in Lean Manufacturing to make things full-proof. Typically office forms, such as those from the doctor's office are the worst, as they are hand entered and difficult to understand; thus you end up with garbage data.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900