Click here to Skip to main content
15,888,026 members
Please Sign up or sign in to vote.
0.00/5 (No votes)
See more:
I'm making a program to generate a class with properties based on a dataset, (nothing fancy, just a tool to simplify life).

Right now I'm testing CSV-data of stars, as it a lot of columns, and it's somewhat messy data.

What I'm doing right now is that I take the dataset, extracts the headers, then generate a class, (removed about 30 lines for brevity):
C#
class 
{
	public int Id { get; set; }
	public string Hip { get; set; }
	public string Hd { get; set; }
	public string Hr { get; set; }
	public decimal Ci { get; set; }
	public decimal X { get; set; }
	public decimal Y { get; set; }
	public decimal Z { get; set; }
}

This is done by looping through the first row of data, evaluating the cells with this:
C#
public static string Evaluate(string input)
{
    string output = "string";
    
    if (decimal.TryParse(input, out var result)) { output = "decimal"; }
    if (DateTime.TryParse(input, out var result1)) { output = "DateTime"; }
    if (int.TryParse(input, out var result2)) { output = "int"; }

    if (input == "") { output = "string"; }

    return output;
}

It's a bit clunky, but this isn't a beauty-contest

What I need to to is to evaluate the first 5 rows of data, then compare which datatype is the most likely. It turns out that just sampling the first line is a bad idea, so I'm looking for which type is correct for three out of five lines, (if no type has this it's a string).

The data:
id	hip	hd	ra	dec	dist	pmra	pmdec
0			0	0	0	0	0
1	1	224700	0.00006	1.089009	219.7802	-5.2	-1.88
2	2	224690	0.000283	-19.49884	47.9616	181.21	-0.93
3	3	224699	0.000335	38.859279	442.4779	5.24	-2.91
4	4	224707	0.000569	-51.893546	134.2282	62.85	0.16

Note: The first dataline is Sol, ging it 0 as value in a column filled with decimals, so just checking the first, or one line would be folly, as it wouldn't handle a whole number not shown as 0.0.

The GitHub Project: https://github.com/frankhaugen/class-from-dataset

What I have tried:

Yes, I've googled, but I to either lack the words, or there isn't any information on this

And I've looked into using ML.NET for the task, but it seems somewhat overkill, (but it might be fun)
Posted
Updated 16-Jul-18 5:04am
v2
Comments
Eric Lynch 15-Jul-18 21:16pm    
As always, there are many ways to get the same solution. If you're limiting yourself to common C# data types, you may want to look at the System.TypeCode enumeration. This might be better for representing the possible types.

One way of accomplishing this task, might be to create a List<Dictionary<TypeCode, int>>, where each item in the list represents a column of data. Each entry in the dictionary then keeps track of the number of times the corresponding data type (TypeCode) occurs.


Sampling basically requires you to add or increment the corresponding dictionary item. When done sampling, you can simply iterate though the dictionary, for each column, and choose the one with the highest count.

Though, you should consider what to do in the case of a tie. For example, in most cases, text for TypeCode.Int32 (int) is also valid for TypeCode.Decimal (decimal).

Again, just one possibility. There are many other approaches that would also work.
Richard MacCutchan 16-Jul-18 2:44am    
Looking at your sample every number looks like a valid decimal. Maybe you should re-order your tests so the most likely type is tested for last
Frank R. Haugen 16-Jul-18 5:48am    
Yes, discovered that after this was posted, as a decimal value can represent a datetime. But that's trivial
Richard MacCutchan 16-Jul-18 8:05am    
Well the problem is that a decimal number could represent a DateTime, but there is no guarantee that that is what it does represent. Without context, such a program is likely to be largely guesswork.
Frank R. Haugen 16-Jul-18 9:25am    
Guesswork will be fine, as it's supposed to be a helping hand and not a perfect solution :-)

Try this article: CSV/Excel File Parser - A Revisit[^]

It creates a strongly-typed DataTable either by determining the types itself, or by programmer-specified types.
 
Share this answer
 
v2
I looked at the comments, and they started me down the right path.

I changed over to DataTable from Dictionaries and Lists, (it's already set up for 2-dimensional data). My code need a lot of cleaning and commenting, but now it does what I want, (mostly)

My code for creating the datatable:
C#
static void GenerateDataTable()
{
    dt.TableName = "Stars";
    string[] heads = inputData[0].Replace(" ", "").Split(',');

    foreach (string head in heads)
    {
        dt.Columns.Add(new DataColumn() { ColumnName = head });
    }

    for (int i = 1; i < inputData.Count; i++)
    {
        dt.Rows.Add(inputData[i].Replace(" ", "").Split(','));
    }

    Dictionary<int, string> testDict = new Dictionary<int, string>();

    for (int i = 0; i < dt.Columns.Count; i++)
    {
        List<string> testList = new List<string>();
        for (int j = 0; j < 500; j++)
        {
            testList.Add(EvaluateVariableType.Evaluate(dt.Rows[j].ItemArray[i].ToString()));
            
        }
        
        testDict.Add(i, MostOccurences(testList));
    }

    for (int i = 0; i < testDict.Count; i++)
    {
        HeadersAndTypes.Add(heads[i], testDict[i]);

        Console.WriteLine(testDict[i]);
    }
}

For the counting of the occurances i did this, (stolen from stackexchange):
C#
static string MostOccurences(List<string> input)
{
    string output;

    try
    {
        var groupsWithCounts = from s in input
                               group s by s into g
                               select new
                               {
                                   Item = g.Key,
                                   Count = g.Count()
                               };

        var groupsSorted = groupsWithCounts.OrderByDescending(g => g.Count);
        string mostFrequest = groupsSorted.First().Item;

        output = mostFrequest;
    }
    catch (Exception e)
    {
        output = e.Message;
    }

    return output;
}
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900