Click here to Skip to main content
13,736,547 members
Click here to Skip to main content
Add your own
alternative version

Tagged as

Stats

2.9K views
5 bookmarked
Posted 16 Jun 2018
Licenced CPOL

DARL and Whitebox Machine Learning

, 16 Jun 2018
Rate this:
Please Sign up or sign in to vote.
Use the online free system to create machine learning models you can understand

Introduction

This is a continuation of a previous article that described the DARL language and its basis in Fuzzy Logic. In that article, I talked about the problem of "blackbox" models in Machine Learning, and how the DARL technology was initially developed to behave as a model you can machine learn to, that you can also understand. Such algorithms are sometimes labelled "whitebox" for obvious reasons.

The DARL.AI website now gives you API access to machine learning using a supervised learning, fuzzy logic rule induction algorithm. This interface is free. The REST interface is described here.

Machine learning can be processor intensive, depending on the amount of training data. To spread out the load, the processing is performed by an Azure function responding to a queue and returning results via email.

I've constructed a very simple project that will access this web service and fire off three benchmark machine learning examples.

Background

Machine learning falls into several types. The most common is supervised learning. This is where you have collected multiple examples of the inputs and outputs of some process you want to learn, recorded in a database, or some code representation like XML or Json, and the machine learning algorithm tries to create a model that reproduces the outputs when presented with the inputs.

If you remember from the previous article, DARL inputs and outputs can be Textual, Categorical, Numeric or Temporal. Machine learning here is limited to categorical and numeric inputs and outputs. Learning is limited to a single output at a time. If that output is categorical then classification has been performed, if numeric then prediction.

The data used to train the data is a training set, and some of the data may be put aside to form a test set. With this machine learning algorithm, you specify the percentage of the data to train on and the system will randomly split the data into two groups.

Although problems that have an existing analytic solution are sometimes used to test ML algorithms, for instance getting a model to copy some logical relationship, in the real world no one in their right mind would use a machine learning algorithm to learn something for which an analytic model, like an equation, exists. Machine learning algorithms are used when nothing else will work. This is typically when the problem to be solved is noisy, poorly specified or ephemeral.

Machine learning is seldom absolutely correct. In all real world situations, you will have to deal with some inaccuracy. This might be misclassification or some prediction error. It is also entirely possible that your inputs are not related in any discernable way to your outputs, so the model performance will be poor.

To use the DARL Machine learning service, you need several things:

  • A source of data with 1 or more input values per pattern and one output value to be classified or predicted. The number of patterns required is problem dependant, but is normally >> 50.
  • A ruleset skeleton created in DARL that specifies the inputs and output and how to find them in the data. For XML (XPath) and Json (Jsonpath), this consists of an expression to find the patterns, and expressions relative to the pattern that find each data item.
  • A choice of fuzzy set count from the set 3.5.7.9 where larger numbers result in more complex models if numeric inputs or output are present.
  • A choice in the range 1-100 of the percentage of the data to train on. If a number less than 100 is chosen, then that percentage of the data is randomly chosen as a training set, and the rest becomes test set. In this case, the performance of both sets is reported in the results.
  • An email address that the results should be sent to. Short term and spam email addresses are filtered out.

Using the Code

The project here on GitHub contains the example code for accessing the web service.

There are three examples provided, each consisting of a data file in XML and a darl skeleton, embedded in the executable.

class Program
{
    static string destEmail = "support@darl.ai"; //put your email address here!
    static void Main(string[] args)
    {
        DarlML("yingyang").Wait();
        DarlML("iris").Wait();
        DarlML("cleveheart").Wait();
    }

    static async Task DarlML(string examplename)
    {
        var reader = new StreamReader(Assembly.GetExecutingAssembly().
        GetManifestResourceStream($"DarlMLRestExample.{examplename}.darl"));
        var source = reader.ReadToEnd();
        reader = new StreamReader(Assembly.GetExecutingAssembly().
        GetManifestResourceStream($"DarlMLRestExample.{examplename}.xml"));
        var data = reader.ReadToEnd();
        var spec = new DarlMLData { code = source, data = data,
                                    email = destEmail, percentTrain = 100,
        sets = 5, jobName = examplename};//use your own choice of training percent (1-100)
                                         //and sets, (3,5,7,9)
        var valueString = JsonConvert.SerializeObject(spec);
        var client = new HttpClient();
        var response = await client.PostAsync("https://darl.ai/api/Linter/DarlML",
        new StringContent(valueString, Encoding.UTF8, "application/json"));
        //check for errors here...
    }
}

Don't forget to replace the email address with your own!

The class encapsulating the ML specification looks like this:

public class DarlMLData
{
    /// <summary>
    /// Your DARL code
    /// </summary>
    /// <remarks>Should contain a single ruleset decorated with "supervised"
    /// containing only I/O. Outside of the ruleset the "pattern" parameter should be specified,
    /// along with mapinputs,mapoutputs and wires. MAP I/O should have paths.</remarks>
    public string code { get; set; }

    /// <summary>
    /// The training data
    /// </summary>
    /// <remarks>this can be in XML or Json. If the former XPath should be used
    /// to specify paths in the ruleset. If the latter, JsonPath</remarks>
    public string data { get; set; }

    /// <summary>
    /// Number of sets to use for numeric variables. Only values 3,5,7 and 9 can be specified.
    /// </summary>
    [Range(3,9)]
    public int sets { get; set; }

    /// <summary>
    /// The percent to train on
    /// </summary>
    /// <remarks>Must be between 1 and 100</remarks>
    [Range(1, 100)]
    public int percentTrain { get; set; }

    /// <summary>
    /// email to send results
    /// </summary>
    /// <remarks>Because Machine learning can be CPU intensive training is performed
    /// via a queue in a secondary process.
    /// Results and the mined DARL will be emailed to this address.
    /// </remarks>
    [DataType(DataType.EmailAddress)]
    public string email { get; set; }

    /// <summary>
    /// A name to identify the job in the returned email
    /// </summary>
    public string jobName { get; set; }
}

The Example Data Sets

The three data sets provided demonstrate classification of different kinds of data. The system can handle numeric outputs too.

  • Iris is Fisher's Iris data set. A real world data set used frequently in machine learning. It contains the measurements of 3 kinds of Iris flower, with 50 examples each. The task is to learn the kind (cultivar) from the measurements.
  • CleveHeart is the cleveland heart database containing measurements of patients who arrived at an A&E unit in a hospital with a heart attack. The task is to predict the outcome - survival - based on the various measurements.
  • YingYang is a synthetic data set containing two categories formed from the entwined Ying Yang symbol. The coordinates of points within the shapes are provided along with the category, and the system has to learn to separate them. Use 7 or 9 fuzzy sets.

To illustrate how to construct a darl skeleton, we'll look at the data of the Iris example.

<?xml version = "1.0"?>
<irisdata>
 <Iris>
  <sepal_length>5.10</sepal_length>
  <sepal_width>3.50</sepal_width>
  <petal_length>1.40</petal_length>
  <petal_width>0.20</petal_width>
  <class>Iris-setosa</class>
 </Iris>

This is one pattern out of 150.

The Iris DARL skeleton looks like this:

pattern "//Iris";

ruleset iris supervised
{
 input numeric petal_length;
 input numeric sepal_length;
 input numeric petal_width;
 input numeric sepal_width;

 output categorical class;
}
 
mapinput petal_length "petal_length";
mapinput petal_width "petal_width"; 
mapinput sepal_length "sepal_length";
mapinput sepal_width "sepal_width";

mapoutput class "class";

wire petal_length iris.petal_length;
wire petal_width iris.petal_width;
wire sepal_length iris.sepal_length;
wire sepal_width iris.sepal_width;
wire iris.class class;

The section pattern "//Iris"; defines the XPath (since the data is XML) to find all the patterns in the data.

Within the ruleset, which is annotated with supervised, the inputs and output are specified.

Finally, mapinput and mapoutput definitions are tied to the XPath to find the data item relative to the pattern. This happens in this case to just be the name of the data item in the XML.

The Wire elements link the MapInput and Mapoutput elements to the ruleset.

When you run the example, you will receive 3 emails back. The Iris email will look like this, assuming you've kept the same parameters:

Darl Machine Learning results at 6/15/2018 10:31:32 AM for iris

Job id: 7f29976a-b094-426d-b8f0-763f4abf1bb5
Training on 100%
Train performance 96.0526315789474(%/RMS Error)
Unknown responses 0%
Thanks for using DARL Machine Learning. DARL.AI Support.
If you would like to unsubscribe and stop receiving these emails, click here

and the included DARL code will look like this:

pattern "//Iris";

ruleset iris supervised
{
 // Generated by DARL rule induction on  6/15/2018 10:31:34 AM.
// Train correct:  96.05% on 152 patterns.
// Percentage of unknown responses over all patterns: 0.00
input numeric petal_length { {very_small, -Infinity,1,1.6},{small, 1,1.6,4.4},
{medium, 1.6,4.4,5.1},{large, 4.4,5.1,6.9},{very_large, 5.1,6.9,Infinity}};
input numeric petal_width { {very_small, -Infinity,0.1,0.3},{small, 0.1,0.3,1.3},
{medium, 0.3,1.3,1.8},{large, 1.3,1.8,2.5},{very_large, 1.8,2.5,Infinity}};
input numeric sepal_length { {very_small, -Infinity,4.3,5.1},{small, 4.3,5.1,5.8},
{medium, 5.1,5.8,6.4},{large, 5.8,6.4,7.9},{very_large, 6.4,7.9,Infinity}};
input numeric sepal_width { {very_small, -Infinity,2,2.8},{small, 2,2.8,3},
{medium, 2.8,3,3.3},{large, 3,3.3,4.4},{very_large, 3.3,4.4,Infinity}};

output categorical class {"Iris-setosa","Iris-versicolor","Iris-virginica"};

if petal_length is very_small  then class will be "Iris-setosa" confidence 1; // examples: 4
if petal_length is small  then class will be "Iris-setosa" confidence 1; // examples: 39
if petal_length is medium  then class will be "Iris-versicolor" 
                     confidence 0.977272727272727; // examples: 44
if petal_length is large  and petal_width is medium  and sepal_length is medium 
                     then class will be "Iris-virginica" confidence 1; // examples: 1
if petal_length is large  and petal_width is medium  and sepal_length is large  
                     then class will be "Iris-versicolor" confidence 0.75; // examples: 4
if petal_length is large  and petal_width is large  then class will be "Iris-virginica" 
                     confidence 0.888888888888889; // examples: 27
if petal_length is large  and petal_width is very_large  then class will be 
                     "Iris-virginica" confidence 1; // examples: 13
if petal_length is very_large  then class will be "Iris-virginica" confidence 1; // examples: 9
}
 
mapinput petal_length "petal_length";
mapinput petal_width "petal_width"; 
mapinput sepal_length "sepal_length";
mapinput sepal_width "sepal_width";

mapoutput class "class";

wire petal_length iris.petal_length;
wire petal_width iris.petal_width;
wire sepal_length iris.sepal_length;
wire sepal_width iris.sepal_width;
wire iris.class class;

Note that the inputs and outputs are now annotated with fuzzy sets and categories. These are discovered in the data and inserted automatically.

The Ruleset now contains a set of DARL rules that categorize Irises.

Extra information about the rule inference process and the degree of support for each rule are included as comments.

The DARL rulesets you get back can be used with the online inference REST API as specified in the previous article, so that's how you re-use Learned rule sets.

Machine learning is a very big subject. I can't hope to tell you everything here. Please look at the DARL Machine learning help for more advice.

Please report any bugs, especially exceptions, through the DARL support button on the DARL.AI pages.

History

  • 06/15/2018: First version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

AndyEdmonds
United Kingdom United Kingdom
No Biography provided

You may also be interested in...

Pro
Pro

Comments and Discussions

 
-- There are no messages in this forum --
Permalink | Advertise | Privacy | Cookies | Terms of Use | Mobile
Web01-2016 | 2.8.180920.1 | Last Updated 16 Jun 2018
Article Copyright 2018 by AndyEdmonds
Everything else Copyright © CodeProject, 1999-2018
Layout: fixed | fluid