Click here to Skip to main content
16,003,345 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
I'm having a .net core console application that reads parts of the JSON files stored in a folder the part of the JSON that I'm reading contains greater than 500,000 records in it, I'm able to read all the data but the only challenge I'm facing is that it is very slow it takes approximately 10 minutes to fetch and insert all the data. how can I modify my code below so I can improve performance and increase the speed to minimize the time it takes to process and insert all the data.

What I have tried:

using (var context = new ApplicationDbContext())
{
var dispenseToDelete = context.tblDispense.Where(p => p.HfrCode == facilityName);

if (dispenseToDelete.Any())
{
context.tblDispense.RemoveRange(dispenseToDelete);
context.SaveChanges();
}

}

using (StreamReader sr = new StreamReader(filePath))
using (JsonTextReader reader = new JsonTextReader(sr))
{
    while (reader.Read())
    {
        if (reader.TokenType == JsonToken.StartArray && reader.Path == "tblDispense")
        {
            using (var context = new ApplicationDbContext())
            {
                List<TblDispense> tblDispenseBatch = new List<TblDispense>();
                int batchSize = 1000;

                while (reader.Read() && reader.TokenType != JsonToken.EndArray)
                {
                    if (reader.TokenType == JsonToken.StartObject)
                    {
                        JObject item = JObject.Load(reader);
                        TblDispense tblDispense = item.ToObject<TblDispense>();
                        tblDispense.UUID = Guid.NewGuid().ToString() + "." + facilityName;
                        tblDispense.MSDCode = tblDispense.MSDCode + '.' + facilityName;
                        tblDispense.DispenseID = tblDispense.DispenseID + "." + facilityName;
                        tblDispense.DispenseRecordId = tblDispense.DispenseRecordId + '.' + facilityName;
                        tblDispense.HfrCode = facilityName;
                        tblDispenseBatch.Add(tblDispense);

                        // Batch insert to avoid memory overload
                        if (tblDispenseBatch.Count >= batchSize)
                        {
                        try
                        {
                        context.tblDispense.AddRange(tblDispenseBatch);
                        context.SaveChanges();
                        tblDispenseBatch.Clear();
                        }
                        catch (DbUpdateException ex)
                        {
                        Console.WriteLine($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                        logger.LogError($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                        }
                        }
                    }
                }

                // Insert remaining items in the batch
                if (tblDispenseBatch.Count > 0)
                {
                try
                {
                context.tblDispense.AddRange(tblDispenseBatch);
                context.SaveChanges();
                tblDispenseBatch.Clear();
                }
                catch (DbUpdateException ex)
                {
                Console.WriteLine($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                logger.LogError($"[{facilityName}] an error occurred while saving the batch: {ex.InnerException?.Message}");
                }
                }
            }
            break;
        }
    }
}


any suggestions will be much appreciated.
Posted
Comments
Richard Deeming 8-Jul-24 3:46am    
When you profiled your code, which part did you identify as the bottleneck?

And if you haven't profiled it, then why not? Without profiling, all you can do is try to guess what the problem might be. You'll spend days optimizing parts of the code that are not relevant to the problem you're trying to solve.
Office Systems 8-Jul-24 4:02am    
Thanks @Richard Deeming for your comment however after profiling my code I found that the part that takes a lot of time is this part here since there are many parts of the of the json so it takes time to reach tblDispense and also there are many records that it reads over 500,000 records bottleneck while (reader.Read())
{
if (reader.TokenType == JsonToken.StartArray && reader.Path == "tblDispense")
{
Dave Kreskowiak 8-Jul-24 9:49am    
That doesn't seem to make sense. Just because some line of code executes 500,000 times doesn't make it the bottleneck. It's the TIME it takes to execute each line of code, not the number of executions.
PIEBALDconsult 8-Jul-24 8:41am    
If multiple files, read them in parallel.
If SQL Server, use an SqlBulkCopy object.

There really isn't a major way to improve the performance here as json isn't a "random access" data storage format - it's designed as a data transfer format and that means it's meant to be read in a linear manner - there isn't a way to predict how long a data item in a json file is in the storage medium other than to actually read it, so you can't split the data into multiple segments easily and process each segment in a separate thread - which is the only way you will get significant performance advances with your code.

The problem is the storage medium - you should really consider switching to a data storage format such as a database (SQL Server, MySql or even a single user DB like Access is much more appropriate.) The storage engines for these are optimised for queries and will be significantly faster than any code you want to write!

[edit]
I was just going to start writing an article explaining why JSON is a bad choice for data storage when I realized that I wrote one a couple of years ago ... :O
That's not a database, dude![^]
[/edit]
 
Share this answer
 
v2
There are so many parts to this that represent potential bottlenecks. You probably want to rethink your architecture here, and take care of a couple of basics while you're at it. What you are going to have to do is a lot of investigative work, identifying slow down areas sections by section. Some things I would consider - you have a List created without a capacity. You know that you're going to be allocating 1000 records to it at a time, so create the list sized to 1000 entries. Something else I would look at is not reusing the same list in a batch. What you could do is create, effectively, a List of Lists and create all of your batches before you actually update the database. Something like this:
C#
List<List<TblDispense>> batches = new List<List<TblDispense>>(1000); // Create a 1000 batch entries for giggles
List<TblDispense> tblDispense = new List<TblDispense>(1000);
batches.Add(tblDispense);
.... Do work adding data into tblDispense but don't clear it down at the end.
At the end, just iterate over your batches and add them.

However, playing around with this isn't going to get around the fact that your biggest bottleneck is going to be physically loading and reading the file. You might want to investigate faster file capabilities in the framework, and create something like a ReadOnlySpan<t> to hold the contents of the file (if you're using a late enough version of .NET).

Note: When I am investigating performance, I normally use dotTrace Profiler: .NET Profiler[^]
 
Share this answer
 
Comments
Office Systems 8-Jul-24 11:17am    
Thanks @Pete O'Hanlon I have tried the above scenario but it seems to overload the memory
Pete O'Hanlon 8-Jul-24 11:56am    
Reduce the initial batch size for batches to 100 and try that.
Office Systems 9-Jul-24 6:18am    
I have tried all the suggestions but still it takes a lot of time to fetch all 500,000 records into database
Pete O'Hanlon 9-Jul-24 6:39am    
But have you tried profiling the application? What is the hotspot? Ultimately, you are going to reach a point where you aren't going to make the code any faster, but you have to look at the profiling first.
Office Systems 9-Jul-24 8:30am    
Thanks @Pete O'Hanlon let me try to profiling the application again

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900