Carmot ODL (5) – echo fields
This post is number 5 in a sequence of 7. Click here to get to the beginning.
The ‘><’, ‘><><‘, and ‘?><‘ symbols – echo fields
The echo field symbols were added to the Carmot language in order to allow the ontology definer to specify any reverse linkage that should be created whenever a collection (see here) or persistent reference (see here) field is assigned. Once a standard language is extended to include the description of persistent links between data items, one is forced to deal with issues that are normally the domain of the database administrator. In a relational database, such issues relate to the concept of “referential integrity” and are frequently enforced using “constraints”. More often however, it is left to application code to enforce the more complex inter-record consistency protocols.
This issue is handled in Carmot primarily via the echo field symbol ‘><‘, and secondarily through the use of scripts and annotations. In the earlier post on the collection reference ‘##’ (see here), we have already seen some of the implications of Carmot’s echo fields. To illustrate the use of echo fields more clearly, we will consider a series of different examples gleaned from the standard ontology.
In many situations, it is undesirable for a reference from one persistent data item to another through a field of the referencing type, to cause any type of reverse reference in the opposite direction. Two good examples of this, given here, are the ‘source’ and ‘language’ field of the type Datum (the root of all persistent types). In the referenced post, we see that the echo field symbol is not applied to either of these fields and thus the reference is uni-directional. To see why, consider what would happen if we added a field to the type ‘Language’ of the form “Datum ##languageItems >< language;” and we declared the ‘language’ field in Datum as “Language #language >< languageItems;”.
What we would be specifying then would mean that every time any persistent item is created in English, the echo field specification would cause the corresponding Datum (or descendant) to be added as an ET_Hit to the “##languageItems” collection reference in the ‘Language’ record for English. This of course would mean that simply by fetching the ‘Language’ record for English, we would then have direct links to every other record in the system, regardless of type, for which the record language was English. However, we have to ask ourself whether this is worth the cost in terms of processing overhead in realistic situations. The truth is that we can easily find all records for which the language is English simply by issuing a MitoQuest™ query for all records where the ‘language’ field contains the text “English”, so adding this echo field does not add to our functional capabilities, it does however seriously degrade performance. Consider a system with say 100 million records in it of various types, all of which are in English. In this case whenever a new record is added, the persistent data record for English must be updated to add the new English language record by appending a new ET_Hit to the ‘##languageItems’ collection reference field which already contains 100 million entries. Just the search time to determine if the collection reference already contains the unique ID being added will begin to become significant. Moreover, as we saw previously, whenever a record is fetched to local memory from persistent storage, all embedded collections referenced directly by that record are also fetched, and thus any fetch of the record for English is potentially also going to fetch an embedded collection of 100 million hits, which will become a major impact not only on CPU time but also on network bandwidth.
In this case then it is clear that the potential for the reverse/echo field to explode, even though the forward reference is simply a singular persistent reference, coupled with the fact that the reverse/echo field adds no useful functionality in practical operation, leads us to omit the echo specification in this case. Examples of making this tradeoff can be found extensively in Mitopia’s default ontology. With the echo field omitted like this, Carmot essentially performs in a similar manner to simpler systems such as relational databases. Within Mitopia®, in the absence of an echo field, it is still true that if the target of a reference is deleted, the system automatically deletes all references from any persistent data anywhere in the systems servers. In this sense, Mitopia® guarantees referential integrity in a manner similar to that performed by a relational database in the trivial case where there is no echo field specification, however, unlike the relational concept of referential integrity, that of the Carmot language goes much further when the echo field is specified.
Consider next a simple ‘Stick It’ note relationship as mediated by the type ‘NoteRegarding’ in the base ontology:
Here we see that the “#regarding >< notes.notes” field specifies a persistent reference to any persistent item derived from the type Datum with an automatic echo field (field path “notes.notes”) in the referenced record itself. Examining the definition of the ‘notes‘ sub-structure within Datum:
we see the opposite side of this echo relationship, that is the field ‘notes.notes’ is defined to be a collection reference to items of the type ‘NoteRegarding’ with an echo field ‘regarding’. This is therefore a fully symmetric echo field specification. Since all access to data within Mitopia® goes though the abstraction layer APIs, it is virtually impossible to create or modify one side of this link without automatically creating or modifying the other side both in memory and thereafter in persistent storage. This has taken referential integrity to a new level that operates across different fields and different types. The reasons why the ontology definer opted to specify an echo field in this case are as follows:
- The fact that a ‘Stick It’ note is attached to a given item is presumably functionally and semantically significant and is frequently the result of explicit analyst action. This link is therefore significant in a way that the ‘language’ link is not.
- There is no potential for an explosion in the echo field since notes are attached on a per item basis and not as an unintended side effect of all record creation. The collection of notes associated with any given item will be relatively small compared to our previous example.
- If a link is significant in one direction, it is probably significant in the other direction and since much of what Mitopia® does is explore links, we must instantiate and preserve the opposite link so that when for example an analyst comes across a record, he is made aware in the GUI of any Stick It notes attached to that record by other analysts. The link will also be discovered by such tools as the link analyzer. One such ‘Stick It’ may contain information vital to his analysis.
We can see the effect of this kind of echo field quite clearly in the figures of the earlier post (here) on collection references. In the first figure, the links shown between the “Person A” record and the “GroupRelation X” record are the result of the echo field for the ‘notes.relatedFromGroup’ field, and thus are maintained automatically whenever either link is established. The degree to which this simplifies the task of maintaining consistency in data references is difficult to over-stress since it allows the ‘echo’ field maintenance to be effectively ignored by client code. In the second figure of the ‘##‘ post, all the green line connections shown in the diagram (dotted and solid) are essentially created and maintained automatically by the substrate as a result of these simple echo field specifications. Again, in these examples, the links pass through the type ‘GroupRelation’ which is a note type just like ‘NoteRegarding’ except that it creates a note linking one to many items; the arguments for the use of echo fields are similar to those of NoteRegarding.
Note that in the first (unresolved) collection reference figure, there is no echo field associated with the ‘#relationType’ connection between “GroupRelation X” and “RelationType – Evidence”. Here again, it is not clear what the utility of tracking all occurrences of all instances of “Evidence” might be (since once again this can be discovered by query) and the fact that the relation is not specific to a given instance but is instead associated with a ‘kind’ of record (that is anything that is evidence) might give rise to explosion in a system targeted at a law enforcement agency such as the FBI where whole buildings are allocated to holding ‘evidence’.
In all of the echo field examples thus far, the specification has been for a bidirectional echo link where one side of the link is a persistent reference and the other side is a collection reference. Of course other examples also exist for the situations “persistentRef<->persistentRef” and “collectionRef<->collectionRef”. There is also the possibility that one side of a relationship may specify an echo field whereas the other does not, this is known as a unidirectional echo link (as opposed to a bidirectional echo link):
|A possible definition for a person’s family references
The listing above shows a potential definition for the ‘family’ section of the type ‘Person’ which clearly illustrates a number of the other possibilities and issues for echo fields, in this case entirely between records of the same type ‘Person’. Starting with the field ’##spouse >< family.spouse’, we see that this is a collection reference field that is self referential to the spouse field in some other ‘Person’ record. The fact that this field is plural rather than singular (collection vs. persistent reference) is reflective of cultures and religions other than the standard Western ones where it is quite possible to have more than one spouse at a time. According to a limited Western viewpoint, this field should be declared ‘#spouse >< family.spouse’, that is with a singular persistent reference at either end of the link. This one-to-one persistent link is a trivial subset of the connections expressible by the collection reference link and can be expressed entirely by it.
In operation what this means is that any time a data record is acquired in which the person is declared to have a spouse of a given name, the corresponding spouse record is either created with the opposite spousal link, or if the data for the spouse already exists, but there was no previous knowledge that that person was married, the echo spousal link will be added to that spouse’s record. One might ask the question here, why did the ontology designer not chose to declare separate ‘husband’ and ‘wife’ fields for clarity? There are two answers to this, firstly many data sources may indicate that a spouse exists, and may give the spouse’s name, but may not indicate the spouse gender, indeed the gender of the referencing record may also be uncertain in any given source. When one considers new statutes appearing in the U.S. and elsewhere allowing same-sex marriages, it is clear that we will get into all kinds of problems if we apply our conventional rules of society to the definition of the fields in our ontology. The ontology definer must distinguish what is societal convention from what is absolute.
What is absolutely certain, is that there is just one referring record and there may be one or more spouses (of either gender), this leads to the declaration for ‘spouse’ given above. Remember that if we wish to restrict the choices that a user has during forms entry say (perhaps our state does not allow same sex marriage or polygamy) then that is a matter for ‘perspectives’, not cause to limit the expressive power of the underlying ontology. Perspectives will be discussed in later posts.
The two fields ‘#father >< family.children’ and ‘#mother >< family.children’ both specify as the echo field ‘family.children’, however the ‘##children’ field itself does not specify the opposite echo field. These fields are examples of unidirectional echo links. The reason behind this is quite clear since there are two different fields on one side of the link, it would be impossible to specify unambiguously the appropriate echo field. The reason for this is that we may not know the gender of a given person from a particular source at the time we discover and assign to that person’s children field. Clearly this means that we don’t know if the referencing person is the child’s mother or father without additional information and so we cannot allow the architecture to enforce this connection automatically. In contrast however if we have a source that specifies the father and/or mother of a given individual, it is quite acceptable to assign to the ‘father’ or ‘mother’ field of that individual, which, since it has an echo field specified, will cause the ‘##children’ field of the father/mother record to be modified to add the child.
|A more correct definition of a person’s family references
Because of issues related to this particular example of a unidirectional link, the base ontology for a person’s family is actually as shown in the listing above. The use of a ‘parents’ field rather than a distinct ‘father’ and ‘mother’ field eliminates all kinds of potential ambiguity.
Another subtle form of asymmetry can occur even in bidirectional echo links when the echo field does not actually occur in the type referenced by a field, but does occur in one or more descendant types. For example, consider the type ‘StockHolding’ in the base ontology:
In this type, the field “Entity #holder >< financial.stockHeld;” specifies that any entity can potentially hold shares in a commercial organization in which case in the entity’s ‘financial’ section we will automatically update the ‘stockHeld’ echo field to reflect this fact. However if we look at the definition of the type Entity, we see that there is no sub-section of Entity called ‘financial’ and thus the echo field called out does not actually exist. However, if we look at the definitions for the type ‘Person’ and ‘Organization’ we will see that both have a ‘financial’ sub-section in their type definitions and that both include the appropriate echo specification for the bidirectional link.
StockHolding##stockHeld >< holder;
By this device we are stating that only certain types of entities are legally allowed to hold stocks, and for example, an animal or a software program, both of which are considered to be individual entities in the base ontology, are not allowed to hold stock since they have no ‘##financial.stockHeld’ field. Since Mitopia® only attempts to access the echo field at the time a link is created, then providing the type of the referenced item (which must be descendant of the referenced type) contains the specified field, the reverse link will be created as expected, however, if the referenced type (e.g., Animal) does not contain the field, a warning will be generated at run-time and no reverse link will be created. The same technique is used in ‘Person’ and ‘Organization’ to reference all fields within the ‘contacts’ portion of the type from other types (e.g., Address) using the type ‘Entity’. When using this strategy, it is important that the ontology designer ensure that all appropriate descendant types either explicitly declare, or inherit, the echo field referenced.
From the examples above, it should be clear that the echo field specification can be used to solve many of the referential problems that arise once one starts to deal with ontological data, however, not all reference issues can be dealt with quite so easily. The problem comes when one tries to enforce additional constraints on references that can only be checked by examining other fields of the current record or by chaining through links to confirm or deny an association. For example, consider enforcing the constraint that “no person can be a descendant of himself” which is a fairly self-evident statement and which it might be desirable to ensure we did not allow to be violated in our data set.
Clearly to confirm or deny this statement, we must recursively chain through all children of the person and confirm that at no level in the tree does any child claim the original person as a child of their own. This is just an extreme case of a class of problems known as field constraints. Other obvious examples are “a person’s date of birth must be before his date of death” or “a person’s child cannot be older than him/her”. The enforcement of these more sophisticated types of constraints is left by the Carmot language to field and type scripts and annotations registered against types in the ontology. Scripts in particular can be set up to invoke registered code functions through the API function TS_RegisterScriptFn(),and this technique is used to enforce all the more complex constraints thereby allowing such logic to be customized on a per system basis.
Thus far in our discussions regarding echo fields, we have ignored ‘@@’ fields (see here), and talked about direct echo field specifications via the ‘><‘ operator. Obviously, it is highly desirable that the elements of an ‘@@’ sub-collection be able to reference other items via ‘#’ and ‘##’ references, since this makes the ‘@@’ collection a very useful tool for organizing and annotating external references in a way not possible for other reference types. The problem is that since the ‘@@’ elements are not derived from Datum, they cannot directly be referenced by unique ID from an external record, either directly or as part of an echo specification. To overcome this limitation, Carmot introduces the indirect echo field operator ‘><><‘. Consider the following example taken from the base ontology:
Within a person or organization ‘contacts’ field, we wish to keep track of the address history, which implies we must annotate the persistent type Address with ‘from’ and ‘to’ date fields describing when the entity involved occupied the address concerned. Without the ‘@@’ collection, we would be forced to declare Occupancy a persistent type with two-way references to the ‘contacts’ and also the Address. This is somewhat cumbersome in this case, and so we use an ‘@@’ reference field for ‘addressHistory’ thereby eliminating the need for a persistent two-way reference type while still allowing the required annotation fields. However, we still need an echo field relationship between Address and an Entity’s ‘addressHistory’ which is not possible using the ‘><‘ operator alone, since we must also specify the field within Occupancy (i.e., ‘address’) that mediates the reference within the ‘addressHistory’ collection elements. As can be seen from the example, we solve this problem by using the indirect echo specifier ‘><><‘ in the declaration of the ‘addressHistory’ field. This syntax can be interpreted as meaning “the echo field and echo record type for the ‘@@ addressHistory’ field are specified in the ‘address’ field of Occupancy”.
When we look at the ‘address’ field of Occupancy, we see that it specifies the ‘contacts’ field of Address as it’s direct echo field using the ‘><‘ operator. Thus the persistent echo field of ‘addressHistory’ is taken by Carmot to be the ‘contacts’ field of Address. Notice that the opposite side of the echo specification within the Address type, that is the ‘## contacts’ field is a persistent collection reference, and directly references the ‘@@’ field ‘contacts.addressHistory’ as its echo field, not the type Occupancy. This arrangement allows the Mitopia® substrate to determine which fields are involved in the echo relationship, regardless of the direction in which the relationship is first established, by examining any ‘><><‘ operators involved to obtain the additional field path required within the ‘@@’ element. By use of this knowledge, the substrate is able to ensure that echo relationships, even those mediated through relative collection references, can be maintained automatically. It is particularly important when echo fields are involved that the correct $UniqueBy annotation be provided (or that the ‘@@’ elements have unique ‘name’ fields) for the ‘@@’ element type, otherwise there is the potential for echo relationships to become intermingled with confusing consequences. Visually in the GUI, the two sides of this Address relationship appear as shown below.
The choice of using an ‘@@’ field rather than an intermediate persistent type and an ‘##’ field is largely driven by consideration of whether it makes sense for the additional annotation fields in the intermediate type to be exported for visibility to outside types, or whether those fields more properly belong privately to the referencing type. In the case of an Address, the ‘from’ and ‘to’ for a particular Entity’s stay apply only to that Entity. The Address continues before and after the Entity’s residence (hopefully) for a period that may potentially be determined by the ‘startTime’ and ‘duration’ elements of the Address’s “Location location;” field.
Last, but not least we come to the ‘?><‘ echo field symbol. As we have mentioned before, for certain ‘##’ fields there is a danger of the field content in persistent storage building up to an unmanageable number of entries. In the discussion above we pointed out that such ‘##’ fields (e.g., all the licenses granted by a government department) could simply be omitted and replaced by an explicit query by the user. Sometimes however there is a good logical reason why the field should be kept explicitly in the ontology, even though it might become un-manageably large. In these cases (for ‘##’ fields only) the ‘?><‘ echo field specification can be used in the ‘##’ field concerned. The leading ‘?’ implies the need to issue a query to discover the contents of the ‘##’ field containing the echo specification. During the mining process, a ‘?><‘ echo specification is treated identically to a ‘><‘ specification, that is the ‘##’ field content are built up as normal and can be reviewed in the mined data before it is persisted just as if a ‘><‘ echo field specification were present. This is very handy for confirming the logical correctness of any mining script.
However, as the mined data is persisted into the servers, the presence of the ‘?><‘ echo field specification causes the server to discard any ‘##’ contents it receives from the client collections, and instead to simply indicate that the ‘##’ field is non-empty (by placing a root record within it). What this means is that the server does not waste any time trying to merge new ‘##’ field contents. It is this merging in the server of new ‘##’ contents from records being persisted with the existing ‘##’ contents, that leads to the large performance slowdown during persist in the server and the excessively large record size implied by such huge ‘##’ sub-collections. The ‘?><‘ echo specification prevents this, while logically preserving the bi-directional echo links in the ontology.
When a client subsequently fetches a record containing such a ‘?><‘ echo specification, if there is any content to the field, all the client code receives is the empty root record (thus indicating that the field is not empty), the client GUI layout logic recognizes that a ‘?><‘ echo specifier is present and displays the ‘##’ collection differently from the normal approach viz:
The content of the list control is shown as empty, however, the presence of the ‘?’ icon above the list control indicates that this is in fact a non-empty ‘?><‘ field for which the field content can be explored by clicking on the ‘?’ icon. If the field is actually empty, the ‘?’ will not be displayed, nor will the ‘?’ be displayed if the unique ID for the referencing record is a temporary ID (because a permanent ID is required to construct the query to launch when the ‘?’ is clicked). Note that during manual data entry or following mining (before persist), there may in fact be entries in the list, indeed in manual data entry you can add items to the list as normal, however, once the data has been persisted and is fetched back from the server, the situation reverts to that shown above. The result of clicking the ‘?’ button is to launch the Query Builder interface to fill in all (or part) of the list by querying the server(s).
Click here for the next post in this sequence. Click here for the previous post.