Modeling Supertypes and Subtypes: Part 1

Scott Burkow

4.97/5 (8 votes)

May 15, 2013

CPOL

19 min read

48555

Background Theory and Process Modeling (BPMN).

Introduction
Generalization and Specialization
- Inheritance
  - A Tree Representation
  - Multiple Inheritance
- Completeness and Exclusivity
The Business Rules
- A Process Model in BPMN
- Other Business Rules
Finally: An Historical Perspective

Introduction

The supertype-subtype hierarchy is a central construct in the conceptual design of data--but one not without its challenges. In this article series I'll pose a problem to solve, and take it through its conceptual, logical, and physical design phases. At each point we'll have alternatives and decisions to make.

In this first of three articles, we examine supertypes and subtypes as forming generalization hierarchies. The hierarchy implies inheritance, and we'll examine that, but within a mathematical context: what kind of graph is it? Is it a tree? The answer depends on whether we allow multiple inheritance or limit to single inheritance; we'll cover both.

The supertype-subtype hierarchy has two other important properties--completeness and exclusivity--which we'll examine next. I'll continue to stay a little--okay maybe a lot--on the math side, as we study them in the context of set membership. Admittedly the analyses so far are a tad formal; you can skim them and still understand the rest of the series, but their concepts at times will be referenced.

With background complete, I introduce the problem for which we'll be designing. We'll deduce the supertype and subtypes and their properties from diagrams done in a process modeling graphical language. I've chosen Business Process Model and Notation (BPMN version 2.0) for the task; I find it to be powerful and expressive.

In Part II [ ^ ] , we'll fashion the problem's business rules into the conceptual design, and then translate the conceptual into the logical. Because my CASE tools--I'm using ER/Studio--allow me to express subtyping in both models, we'll see it in two different representations.

In Part III [ ^ ] , it's all about the database. There we'll have choices about how we translate entities into tables. One thing we won't have a choice about, though, is enforcing data integrity; we'll devise check constraints and more complex triggering subsystems for all our physical designs.

Generalization and Specialization

From the list building, commercial, residential, we make the association that commercial is a kind of building, and so is residential. The is a (isa) relationship between building and commercial and building and residential means that commercial and residential are specializations of building, and building, their generalization. It is also natural to envision the terms in a lineal hierarchy with building at the root and the others beneath it.

In our mind's graph, we can further divide commercial into the types retail, office, and warehouse, and residential, into house and apartment, apartment into leased and owned, and so on. Any type can therefore specialize and/or generalize other types in a multilevel isa hierarchy.

Inheritance

Such types are entities in conceptual and logical modeling, and their relationships form supertype-subtype hierarchies. At the top of the hierarchy is the single entity that isn't generalized further; this is the supertype. All other elements are subtypes. The typing (isa) relationship implies inheritance; what are inherited by a subtype are the attributes and relationships of its parent entity, established by the incoming relationship. Each subtype must extend its parent with its own attributes and/or relationships.

As we'll see in later sections, we deduce the supertype-subtype hierarchies from the business rules in the conceptual modeling phase, and transform them in the logical or physical phase. We continue here with a more detailed view of the inheritance relationship.

A Tree Representation

A tree is an undirected graph T = (V, E) of vertices and edges having the following properties:

For every vertex pair there is a path between them (T is connected)
|E| = |V| - 1 (the '||' means the size of the set--its cardinality.)

The properties imply that a tree is acyclic--i.e. if you start at a vertex v and follow any path, you can't get back to v. If you remove an edge, the graph is not connected; add one and the graph gains a cycle.

So far we have a start to making an inheritance tree, and in fact supertypes and subtypes, but we'll need a more specific type of graph.

In this model, the edges are now directed, so if we give the edges a specific meaning, that meaning is meant in one direction. Because we're interested in inheritance, we assign the edges the isa relationship such that vertices at the arrow end of the directed edges are the inheritors.

With or without direction to the edges, the graph is connected and acyclic, making it a directed acyclic graph (DAG) known as a polytree or oriented tree. We see that a polytree can allow for multiple inheritance: vertex e is in the isa relationship with c and d. For example, a tractor can be at the intersection of vehicle and farm equipment hierarchies. While this kind of structuring occurs in data modeling, we won't consider it in this series. What is of interest to us now is adding one more restriction to our inheritance tree:

There is one source vertex having a directed path to all other vertices--the root.

We've now arrived at a type of tree--loosely a polytree known as a rooted tree--that can now be an abstract representation of supertypes and subtypes: the single root is the supertype, and all the labeled edges and other vertices form the inheritance paths and subtypes. Tree inheritance can be defined recursively: a vertex v directly connected to the root inherits the root's attributes and relationships, and the immediate descendants of v inherit v's attributes and relationships, and so on.

This paradigm for supertypes and subtypes is clean and elegant. But is it what we want?

Multiple Inheritance

The rooted tree metaphor for the supertype hierarchy limits us to single inheritance because each subtype vertex has exactly one path connecting it to the root. Simply put: every subtype vertex has exactly one inheritance edge. But consider this:

In this simple UML (data) diagram, the open arrows point to generalization classes; therefore Storefront Apartment derives attributes and relationships from two ancestor paths, both leading to the root Building class. These types of buildings exist, and if their entity extends its ancestors, the class must appear in our modeling. However, if it doesn't extend its ancestors, the stakeholders may want to see it, but it shouldn't move further in the design. We assume it does, and so we'll have to abandon the clean tree abstraction: if we remove the direction on the edges, we have a cycle--every vertex can reach itself on the same path. Whereas we won't in this series address multiple inheritance involving multiple roots as shown in the previous section (the polytree), we will address this diamond pattern form.

The problem of indirectly inheriting a base class multiple times is well known to object-oriented programming (OOP) languages, and each handles it differently. In C++ for example, this is allowed, and the language has constructs for disambiguating or overriding redundant resource references. But in C#, this is disallowed, although we can simulate the paradigm with interface inheritance and by following well-established patterns. In databases, diamond pattern multiple inheritance is also problematic, in different ways; I defer discussion to the design phases in the coming articles.

For now, keep in mind that it is always best to find a workaround for multiple inheritance whenever possible. Here are some strategies that are used in different scenarios:

Make multiply derived entities singly derived by showing only one inheritance edge connecting it directly to the supertype.
Split the supertype into two (or more) in a one-one relationship.
Refactor the design into different entity classes that don't result in multiple inheritance.
Refactor the entities into a role playing model.
Represent multiply derived entities as association classes (intersection entities).

The first is never valid--I put it in to make sure you're never tempted to do it. The sharing of attributes and relationships that the entity class does with other subtype entities is no longer clear, and this can result in data integrity issues in the physical design. Later in the series I'll expand the discussion.

I won't elaborate on the next three other than to say that they are not applicable to our building example. Here is a graphic depicting the Storefront Apartment as an association class; the rest of the diagram is unchanged:

The store and apartment are in a one-one relationship, the diamond represents the n-ary association between the three, and the association class extends the association with specialized attributes and relationships. But because the meaning is really the same as diamond-pattern multiple inheritance, we've only obscured the fact; to my mind something is lost when we show a subtype as not being a full part of the supertype hierarchy. In the next article, we'll see that the CASE tool functionality that transforms the conceptual to logical design will have problems with this style as well; it will correctly write in the shared key for all entity classes, but will not make it the primary key for the association class.

Our example shows that there are true cases of multiple inheritance, and trying to design around them may only gloss over issues that won't be well-handled in the physical design. Our inheritance tree has devolved into a directed graph of one root. So be it.

Completeness and Exclusivity

While we chose graphs for our discussion on inheritance, the next two properties are better described using sets. To understand them, we switch focus, terminology, and diagrammatic convention.

Set of Sets Representation

The diagram above was generated from an ER/Studio tool used for conceptual modeling, and the box-in-box (loosely) style is a compact representation of the supertype-subtype hierarchy. If we think of each box as being a set, then we are seeing a proper subset relationship but without the possibility of overlap. One implication of this is that as we generalize, the population goes up, and down as we specialize; in fact each member of an inner subtype box must be present in all containing boxes including the supertype. Because boxes cannot overlap, the tool seemingly precludes multiple inheritance, but when we get to conceptual modeling in the second article, I'll show you how we'll put it in.

The sets, however, do not contain sets of attributes and relationships; if they did, then the outermost set would correspond to a tree leaf, the innermost, the root, the intermediate, those on the ancestor path from leaf to root, and we would therefore need one diagram for each leaf node. Not a good representation. No; our focus now is on entity membership.

In the conceptual model, the main objects of interest are the (business) entities themselves. For this discussion, let's refer to them as entity sets, as they are sometimes known, and each of their members as the entities. In our running example, supertype Building would be the outermost entity set, Commercial and Residential, being proper subsets (or subtypes), would be rounded rectangles nested adjacently inside, and so on. The supertype must always be the outermost box. In the topics next, I've written in sets in place of entity set names.

Completeness

For these discussions, an innermost entity set is one that doesn't properly contain another; alternatively, a subtype from which no other subtype is derived.

This is the condition for completeness:

{ s | s ∈ supertype ∧ ( ∃subtype_i ) s ∈ subtype_i ∧ ( ( ¬∃subtype_x ) s ∈ subtype_x∧ subtype_x ⊂ subtype_i ) } = supertype

Where the proper subset restriction using subtype_x identifies subtype_i as an innermost entity set.

In other words, every entity in the supertype entity set must be found in an innermost entity set. In the diagram, the first set of sets is complete or exhaustive because all four entities in the superset are also in the innermost sets. The second two, though, are incomplete: in the middle set of sets, entity 'c' is in the supertype only; in the far right, 'c' stops in an internal subtype.

Besides stopping short of communicating the full design, incompleteness has implications for the physical design itself. In that modeling stage we would like full control over which entity sets become tables and which are merged into others. But any entity set--supertype or subtype--whose descendants cannot represent all of its entities must be realized as a table, or one of its containing sets must be. In the middle sample, the supertype must be; in the rhs sample, either the supertype or the subtype represented by subset {a,b,c} must be. We'll see this limitation when we examine the physical design in Part III.

The moral of the story: always strive for complete designs.

Exclusivity

This is the condition for exclusivity.

{ s | ( ∃subtype_x ) s ∈ subtype_x ∧ ( ∃subtype_y ) s ∈ subtype_y ∧ ¬( subtype_x ⊃ subtype_y ∨ subtype_y ⊃ subtype_x ) } = ∅

Unlike completeness, exclusivity involves only the subtypes. In the equation the set builder is looking for entities that are in two different subtype entity sets that are not in the proper containment relationship, and if the resulting set is empty, then all subtypes are exclusive--i.e., no two subtypes have overlapping membership. We rule out proper containment so as not to consider entity sets in the same specialization path--if we didn't the condition would always be met.

For our building hierarchy example, the building at 1 Main Street is a storefront apartment. By the formula, we can't compare memberships in pairings of the Commercial, Retail Store, and Storefront Apartment entity sets because of the containment restriction, and the same applies on the residential side. But we will find the building at 1 Main in the pairing of the commercial entity set and either the residential or apartment sets, and in other valid combinations, and so the generalization hierarchy is not disjoint. I use forms of the terms (mutually) exclusive, overlapping, and disjoint to describe the same concept.

It is important to note that Storefront Apartment is ruled out of the test for overlap. In this analysis, entity set containment is really another way of expressing specialization paths leading from the root supertype. So whether or not we have a multiply derived subtype extending the attributes or relationships of its ancestors, the possibility of finding an entity that exists in two different paths from the common supertype is enough to show that the design is not exclusive. However, of course, the existence of a multiply derived subtype(s) also implies that the hierarchy is not exclusive--but it is only a sufficient condition, not a necessary one (the logical converse doesn't hold).

In the diagram for this section, in the lhs pane the design is exclusive, but the middle and rhs pane designs are not. Also, the designs are incomplete for all panes.

The formulaic tests above for exclusivity and completeness are independent of each other, and so the properties/violations thereof exist in any combination for any design. A design in which, for every non-innermost entity set, its directly derived entity sets form a partition is both complete and exclusive. (Subsets in a partition of set X are non-empty, non-overlapping, and their union of members equals X.)

By the way, do you see the trick I'm playing with the rhs pane? The population passes the test for exclusivity because of proper containment--which it should, but apparently shouldn't. But consider: if the entity set holding only 'c' can never hold 'd,' then it should be shown nested in the set {a,b,c}; otherwise the set should be shown as {c,d}. In my methodology, correct analysis rests upon showing true representative populations. In other words, the diagram is invalid.

As with incompleteness, overlap will limit merging choices when we turn entities into tables. For example, a physical design in which two tables can hold rows for the same entity and in which the tables share common rolled-down attributes or foreign keys, will cause problems for data integrity enforcement.

In the favorable case, every entity is represented in exactly one innermost subtype (equivalently: at the end of one specialization path). The problems we've discussed multiply when both completeness and exclusivity are violated, so always strive for both. The logical or physical data modeler (you) will thank you. And when you can't you can't--and you make note and handle the problems in the physical design. Thank yourself anyway.

The Business Rules

We deduce supertype and subtype hierarchies from interviews with the business people, analysis of existing systems if any, and the documents produced during the analysis phase. But for this exercise let's assume our only point of contact is the business process modeler, and he has devised for us a diagram in BPMN.

The base problem is simple: track some of the company's employees, of which there are two types, manager and engineer. The business process diagram is an outline of adding new employees through an app interface and placing the data in permanent storage.

A Process Model in BPMN

The diagram shows the basic actions taken as the application user software collaborates with the middleware and back end to record a new employee into the system. Here, the front end is emphasized by having more detail, and by being in its own pool; in contrast, database and web service software are in their own lanes but must share the Internal System pool.

In BPMN, circles are events, and the start event occurs when the app user is signaled of a new hire. The sequence flow shows a series of tasks common for all new employees ending at a decision point, known as a gateway. The 'X' or lack of any symbol in the gateway makes it exclusive: exactly one path of several must be taken. If we imagine a token traversing the flow from the start to the end events, then by the time it reaches the Update System task, we have our clue that we'll be designing in a supertype to hold shared employee attributes, and two subtypes for the respective manager and engineer attributes in the conceptual model.

To finish the process model, we see the app software sending an intermediate message event to the web service, which will defer the actual transaction to an interface database procedure--the embedded sub-process (collapsed) in the database lane. The double-circled N is an error intermediate event that fires if the app user fails permission validation (which is a double-check measure--this should have been prevented in the front end).

Because the token must take either the manager or engineer path at the gateway, we conclude that the generalization hierarchy is both complete and exclusive. But we can tweak the gateway to get the other three possibilities.

Variant One: Exclusive Incomplete

The default flow is marked with the short slash line. Referring back to the middle case in the diagram on the section on completeness, the default flow leaves entity 'c' in the supertype (employee) only with no subtype representation.

Variant Two: Non-Exclusive Complete

The 'O' in the gateway means that it is inclusive; an entity can be a manager or engineer or both. Reference the middle pane in the exclusivity diagram (and remove set {c}).

Variant Three: Non-Exclusive Incomplete

Now we're in the wild wild west. The imaginary token can take the exclusive gateway's default path and strand an entity at the supertype, allowing for incompleteness. Or the token can for a given employee move onto the inclusive gateway and split itself into two, allowing for subtype overlap.

Variant: Non-Exclusive Incomplete Mangineer!

If an employee is both a manager and an engineer, then the app user will have more information to input. In our design phases to come we will need a new entity--the team has dubbed it the mangineer--and judging by the appearance of this part of the process model, the entity will participate as a subtype via multiple inheritance. We add this extension to the business rules for the work ahead, knowing it will complicate our design. That's what I'm here for.

Other Business Rules

While we have the foundation for the generalization hierarchy, we need more business rules to proceed to the data design stages. Here they are--in mere outline form.

An employee is mostly associated with a department.
An employee can make or receive any number of referrals. (A referral is an introduction of a company client from one employee to another to generate more business.)
A manager mostly has a budget.
An engineer has one or more skills.
An engineer must be associated with exactly one engineering discipline.
A mangineer gets bonus pay and a perk plan.

Finally: An Historical Perspective

A seminal work on generalization hierarchies dates to the time when researchers were expanding Codd's relational theory--the theory that would become the basis for the relational database management systems to come. The article was published in 1977 under the name Database Abstractions: Aggregation and Generalization. Here is a sample, directly relevant to this article:

Authors Smith and Smith note the following about the diagram:

"...The generic hierarchy shown in Figure 1 has two characteristics that do not belong to all generic hierarchies. The first characteristic is that it is a tree (i.e. no generic object is the immediate descendant of two or more generic objects). The second characteristic is that the immediate descendants of any node have classes that are mutually exclusive...."

That should sound familiar, along with these exceptions:

The authors suggest grouping objects known as "clusters" as a partial remedy for overlapping class membership. The cluster holds for its immediate parent class only those child classes that can be referenced under one generalizing name, and which are mutually exclusive; thus the cluster for the first three children of vehicle is "propulsion system," and the next three, "transit medium." While this additional level categorization makes things clearer, a helicopter is still a motorized air vehicle (multiple inheritance is still possible).

Other parts of the Smith and Smith paper discuss "domain inheritance"--inheritance--and define a set of invariants that in part imply that incompleteness is possible.

The paper describes "image domain," which we know as the discriminator column or columns, and the correct hierarchical level for relationships vis-à-vis business rule enforcement--a couple of the topics we'll be visiting in the coming articles.