Asking hard questions

typical components of knowledge level queriesIn a number of earlier posts we have talked about the shortcomings of current database technologies and query languages and how they prevent us from building systems that are capable of asking and answering complex ‘knowledge level‘ queries and above.  Ultimately of course it is exactly these kinds of complex questions that we seek to answer.  In this post I want to examine how one actually asks such questions in a Mitopia® environment, and show how radically the approach differs from a much simpler traditional (e.g., SQL query) scenario. Other future posts will look at the question of how Mitopia® executes these queries, for now we will focus just on the process of ‘asking’ the question through a GUI interface.

NOTE:You will find demos involving link analysis based queries in the “Mitopia.demo.m4v” (go to offset 22:40) and “MitopiaBigData.mov” (go to offset 11:20) videos on the “Videos,demos & papers” page.  For a detailed look at other complex query types, see the “MitoipaNoSQL.m4v” video on the same page.

PyramidSqareBefore we proceed, we must first compare and contrast the kinds of questions that one can ask of a ‘database’ depending on the underlying organizational data model.  The knowledge pyramid (shown to the right) serves as a common classification for the levels of data organization and also for the types of questions that can potentially be asked and answered by data organized at that level.

The ‘data’ level can be characterized in this context as being based on a ‘document model’, that is queries are limited to something like ‘find me all documents containing the following words’, there is no detailed structure that can be queried through the query language.  The world wide web and issuing a Google query is the classic example of this very primitive level of query.  The results come back as millions of document hits, and the user must read some of the documents and figure out for themselves if the answer to their original question (which of course they had to pose as a series of key words) is anywhere within the set.

The ‘information’ level corresponds to organizing data into a set of tables which can be queried as to the content of table fields (i.e., a taxonomy) – in other words this is the domain of relational databases and the standard SQL query language.  Because the underlying data model is more detailed, one can ask somewhat more complex questions of the general class “find me all records where field ‘A’ has value ‘a’ and/or field ‘B’ has value ‘b'”.  The SQL language allows Boolean conditions and text/numeric operators to be applied to appropriate fields.  Information-level queries in general allow one to ask about the content of fields within a table, however it is difficult (i.e., requires a ‘join‘) or impossible to ask questions about the ‘connections’ between things in a data set, particularly if the schema was not designed up front to answer that particular question.  This makes relational databases great for answering questions about record content and transactions in tightly constrained domains (e.g., a bank account), but pretty much useless for asking questions about the real world and the things going on within it (a very un-constrained connection-based domain).

In this particular post we will ignore ‘wisdom’ level queries and focus on ‘knowledge’ level queries, i.e., queries based not just on the content of records, but also on the many (indirect and changing) relationships between records of a wide variety of types.  There is no existing standard query language that handles this level of query since these questions require a contiguous ontology (of everything) as the underpinnings for the database.  We have discussed these underpinnings in previous posts, now we will explain how a user is able to easily ask these questions.  These kinds of questions might be characterized as in the class “find me all records having a direct (or highly indirect but) ‘significant’ relationship to another entity/concept”.  For example, one might ask “Find me all people who might potentially be running or setting up a methamphetamine lab”, “How does this disaster impact my stock portfolio”, or perhaps “Find me all passengers on a plane that may have links to terrorism”.   These questions and others like them are exactly those that people really have in mind when they pose queries, yet it is clearly impossible to ask, let alone answer such questions with an ‘information’ or ‘data’ level substrate.  In Mitopia® such queries can be posed directly and easily.

It is assumed below that the reader is already familiar with Mitopia’s Carmot ontology, with the basics of the MitoPlex™/MitoQuest™ federation, and with Mitopia’s approach to auto-generating the GUI.

Mitopia’s Link Analysis tool

LinkAnalysisTodayFirst let us examine Mitopia’s link analysis tool and how it is used to explore the links between persistent data items as mediated through all item types by the underlying Carmot ontology.  Link analysis is conventionally a process of visualizing data and connections within it so that the the analyst can grasp a larger picture, it has nothing to do with query or predictive detection, but is instead a means to represent past data in a way the human analyst can more easily interpret.  A conventional example (from Centrifuge Systems) is shown to the left.  As we will see below, Mitopia®, through its contiguous ontology approach, takes this process to an entirely different level that includes both query and continuous/predictive monitoring.  The screen shot below shows Mitopia’s ‘shortest path‘ link analysis tool in use:

Mitopia's shortest path based link analysis tool

The scenario above is that of a police department looking for anybody that may be operating a methamphetamine lab.  One of the essential precursors for meth is pseudoephedrine (PSE) which is commonly used in decongestants, and which as a result must now be purchased at the pharmacy counter and requires the purchaser to identify themselves.  In this scenario, we have chosen to begin exploring this question by looking for connections between people purchasing PSE (assuming the police department receives a ‘feed’ of such purchases from pharmacies), and anyone that the department has arrested in the past for narcotics crimes (a data set that the police department obviously possesses).  The  endpoints on the left of the central ‘path’ area are four different reported PSE purchases that we will use to train the link analyzer, the single endpoint on the right is a Mitopia® query for all persons arrested in the past for narcotics crimes.  These endpoints were simply dragged and dropped into the link analyzer, and could be any type (or combination of types) described by the ontology.  In the state depicted above, the user has expanded the ‘cloud’ of potential connections from the training data (on the left) to the target concept (on the right) by clicking the ‘Expand Cloud’ button (twice) and then the ‘Find Paths’ button.  As can be seen, the system has found a connection between all four purchases, and the individual ‘Jack Smith’ who was earlier arrested for narcotics violations.  In this case the user has selected the fourth purchase and the system is highlighting the total ‘path cost’ to get from that source to the endpoint on the right hand side (in this case 194).  The shortest path visualizer works by training the system which kinds of links are important (i.e., have low cost), and which are not (i.e., have high cost), so that ultimately the shortest paths show will be those that are of particular significance to the problem at hand.

Starting with the second purchase from the top, we can see that this corresponds to ‘Jack Smith’ himself purchasing PCE (total path cost = 4).  Either Jack has a snotty nose, or he is not the sharpest tool in the shed.  Regardless, this is clearly a significant, if very simple connection that we would want the system to detect.  No need to tweak anything for this connection then.

However, if we look at the highlighted path, we see that it passes through a total of six links to make the connection starting with another individual ‘Kyle Ray’ (the purchaser) and then going through the State of California!  This looks like a link that should not be considered ‘significant’.  To investigate the reasons for this connection, we double click on ‘Kyle Ray’ in the diagram which then displays the ‘Links’ tab (as opposed to the initial ‘Relations’ tab) as follows:

Examining the links and link costs for any given node in the connection cloud

 Here we see a small snapshot of the connection ‘cloud’ that is building up behind the paths area, in this case just the links passing through the underlying ontology from the ‘Person‘ node ‘Kyle Ray’ between the source and target node.  We can see that currently we have upstream links mediated by the ‘##entity.actions‘ (the purchase action), ‘##identification.licenses‘ (the driving license he showed to identify himself – which leads to DMV records – another source that has been mined and integrated), and a couple of others.  On the downstream side, we see 7 possible paths to the target, each with a different individual link cost as well as a total cost to reach the target obtained by adding all intermediate link costs.  All these links are simply being manifested as part of Mitopia’s underlying Carmot ontology when used to integrate data across multiple different sources using MitoMine™.  Unlike conventional link visualizers, the analyst does not have to ‘create’ these links manually in order to see/visualize the connection – the contiguous ontology does it all automatically.

The link to California has been selected above and goes through the ‘#contacts.stateOrArea‘ field of the ontological type ‘Person‘ – that is Kyle Ray lives in California.  If we double click the ‘Field’ cell, we are taken immediately to the ‘Cloud’ tab showing the full data for ‘Kyle Ray’ in the invisible cloud behind the visualizer:

KyleRayData

As can be seen, ‘Kyle Ray’ lives at ‘13345 3rd St’ in the City of ‘Santa Monica’ and the State of California.  The display of this data is being laid out automatically based on the underlying ontology as described here.  Note that the ontology is also driving the layout of the list on the left and in addition the nodes are color coded to indicate how many ‘expansion’ steps they are away from an endpoint (starting with ‘red’ for endpoint, ‘orange’ for first expansion, and so on).  This is because every time the user clicks ‘Expand Cloud’, the system takes all the new ontological nodes within the cloud returned from the previous expansion (or initially endpoints) and follows every link field described by the ontology for that type to retrieve a new ‘silver lining’ of nodes not yet explored from the servers.  This data is then added to the invisible backing cloud for the visualizer and hooked up automatically by the ontology.  It is these ontological links that the system then explores looking for connections.  We could have navigated directly to the ‘Costs’ tab which shows the specific links present for the node ‘Kyle Ray’ in the directions of the selected source and target by double clicking the ‘Link Cost’ column from our earlier ‘Links’ tab display, the result would be as follows:

adjusting link costs within the link analysis tool based on fields defined in the ontology

In this (also auto-generated) display, we see for each field within the ‘contacts‘ area of the Kyle Ray node’s type (i.e., Person) a set of sliders indicating the individual link path cost involved in following the link involved.  Those links shown checked to the left are costs inherited from the type shown in the center (that is the cost to follow the ‘#townOrCity‘ link is inherited from the type ‘PopulatedPlace‘ which is 75).  We can alter the checkboxes or move the sliders for any field in any type within the ontology (including of course ‘inherited’ costs) in order to specify for this particular problem what types of links are important.  The links that are important for one type of question are very likely quite different for another type of question, so it is important that we can tune all links specifically for the problem at hand.  With more than 100 different types of  number of adjustments for every single type that can be combined with inheritance, there is literally an infinite number of ways to tune the link costs for a given problem.  In the example below, I have set the costs for the ‘#contacts.stateOrArea‘ field that was causing our unwanted shortest path to ‘infinite’ by checking the checkbox to the right;  I have done the same for the ‘#country‘ field which is clearly also not important in this case.  Finally I tweaked the slider for ‘##contacts.serviceProviders‘ (i.e., companies that the person relies on for services e.g., water, electricity, phone, etc.) mainly just to show how this looks in the UI.

KyleRayAdjust1

If we then go back to the ‘Relations’ tab, click the ‘Reset’ button and then expand afresh to the same state, we now see (see screen shot below) that the shortest path through ‘#contacts.stateOrArea‘ has been eliminated and the fourth purchase is no longer considered to be significantly connected to a narcotics offender.  This cost adjustment is thus our first step to training the link visualizer what kind of connections we care about in this case.

AfterTweak

AddressLinksOf course with so many potential connections, we probably have many more cycles of this process to go through before we get exactly the significant purchases we want and no others.  For example lets examine the path that passes through ‘234 3rd St, Pomona’.  Ontologically speaking, this is an ‘Address‘ which among other things is a kind of ‘Place‘ at which you can likely find one or more entities of interest.  We can see from the screen shot to the right that both ‘Jack Smith’ and ‘Gene Mathews’ live at this address, that is they are room mates (according to the DMV records).  Perhaps Jack has sent Gene to buy his PSE for him and so this is another kind of link that we want to consider important in this case.

TownOrCityLinksIn contrast however, if we look at the link through the City of Pomona that connects ‘Tom Jones’ to ‘Jack Smith’, we see that it passes through the ‘#contacts.townOrCity‘ field with a cost of 75.  Again the fact that these two live in the same City is not significant in this particular case, so we should adjust this cost the inhibit finding this link as a lowest cost path.

This cycle will repeat through many potential linking fields.  For example the DMV is the same government organization that granted all the driving licenses and eventually this connection may float to the top as a shortest possible path.  Same goes for a host of other link types.

After cycling through adjusting various costs and expanding the cloud a number of additional steps, we will hopefully reach a point where the cloud does not expand any further, and the only paths we have shown are the ones that we wish to be considered significant, all others are considered irrelevant.  This situation is depicted for this simple example data set below:

EndgameLinks

Here we have expanded four times (i.e., there may be up to 8 degrees of separation between a source and a target), in real data sets, one might expand up to 10 times (i.e., 20 degrees of separation) and have tens of thousands of nodes in the backing cloud containing literally millions of potential connections.  You can see that one additional significant link has been found in this data set which is caused by the fact that the DMV record for Kyle Ray (the purchaser) shows that he owns a 1999 Ford Focus (registration ASDF432) in which Joe Cooper (another individual arrested for narcotics) was actually arrested earlier.  Clearly then Kyle and Joe probably know each other quite well, even though with this data set we may not know how.  This connection is thus significant and may warrant further investigation.

ExpansionFor the sake of completeness, the content of the ‘Expansion’ tab is shown to the left.  In reality what is going on during link analysis is a constant competition between one or more ‘expanders’ and one or more ‘pruners’ that can be registered with the visualizer dynamically and as necessary for the problem at hand.  In the example, we only used the most basic of each of these which for the ‘expander’ simply follows ontology-mediated connections, and for the ‘pruner’ simply stops exploring any path once its total cost from an endpoint exceeds an adjustable limit (in this case 150).  Clearly without some kind of active pruning, with a universal ontology like Carmot, this kind of link analysis will fall prey to the theory of ‘six degrees of separation‘ which holds that everything and everyone is connected to all others by no more that six steps.  In other words without active pruning, after a few expansions, we could well pull the entire world (or at least the entire data set) into the invisible backing cloud.  There are a number of other pruning tools provided, and Mitopia® allows custom pruning algorithms to be developed and registered for specific problem domains.

On the expansion side, once again different problems may require specialized expander algorithms to be invoked, and these too can be registered with the generic link analyzer.  For example, one might want to register an ‘expander’ that did image recognition on DMV and other pictures looking for an individual whose image was captured in a security camera.  If registered, such expanders are invoked automatically (if appropriate) by Mitopia® during the expansion process, and thus the range of problems that can be tackled through this tool is essentially unbounded.

All this is very well you might say, but what has it got to do with asking knowledge-level queries?  The answer is that unlike other link analysis systems, what we have been doing above is not just visualization, we have in actually been creating a knowledge-level query as a side effect!

The leap from visualization to knowledge-level query

Saving a link analysis state as a connection based query or trigger

In the screen shot above, the user has dragged the expandable panel at the bottom of the visualizer upwards to reveal additional UI allowing them to save their work as a knowledge level query and optionally also an interest profile.  A Mitopia® interest profile  is a query that is run automatically and continuously whenever any new data arrives at the servers that match the type specified.  When any data matches the the query, the user that created the interest profile is notified.  In other words, this single step would allow the police department in our example above to continuously monitor all subsequent PSE purchases and to be notified automatically every time a suspicious purchase occurred.  When the user opens the notification that results, the system automatically displays the link analysis diagram that led to the alert so allowing the notified individual to examine the diagram to determine if any additional action is required (which may of course include further tuning of the query to eliminate newly revealed nuisance alerts).

MethLabsQuery

SavedCostsIn the screen shot above, the user has chosen to create an interest profile which like everything else in Mitopia® is described by the ontology itself and thus is auto-displayed and stored/retrieved in exactly the same way as any other ontological type might be.  You can see that the ConnectionQuery definition is referenced from the interest profile and contains the details of the ‘target’ node (in this case the ‘Narcotics Arrests’ query) to which the source type data (in this case a subtype of ‘Action‘) should be linked.  Note that in addition to allowing embedded queries of all kinds (far more powerful than SQL queries) within the expansion diagram, the ‘filter‘ field of the query allows yet another query (of any type) to be used to filter data to be processed by the connection query.  Details of the number of source and target expansion steps as well as the expansion and pruning chains are also specified within the type ‘ConnectionQuery‘.  Note also that the specialized set of adjusted costs (type ‘LinkCosts‘) for this query are also saved and referenced via the ontology from the ‘#costSettings‘ field of the query (shown to the right).

MethLabsInterestProfile

If we look at the interest profile node (shown above), we see that the user has the ability to specify how they want to be notified (and with what urgency), if they want the item name to be spoken and with which voice, and a number of other settings including the actual action that happened to the server data (which defaults to item ‘add’, could also be ‘read’, ‘write’, or ‘delete’ – useful for counter intelligence purposes).

When the user has changed any settings necessary and clicked the ‘OK’ button, the system is now continuously and automatically monitoring all new PSE purchases against the criteria specified and notifies the appropriate users when a significant match occurs.  Although we said we wouldn’t go into the details of how the query/interest profile is actually executed at this time, perhaps we can paint at least a high level picture.

CoralYou will note from the screen shot of the ConnectionQuery above that the user has the ability to separately control the number of ‘source’ and ‘target’ expansion steps.  For example in a particular case we might choose to set ‘targetSteps‘ to 5 and ‘sourceSteps‘ to 3.  When running the query against incoming data as an interest profile, the server performs the target expansion steps just once, caching the results.  This in effect creates a fan shaped set of ontological nodes (imagine it like the coral shown to the left) where the shape of the fan is finely tuned to detect a particular kind of ‘significance’ based on the settings made within the link analyzer.  What we have is thus a finely tuned web, designed to react to a particular kind of stimulus.   This tuned web is then inserted into the incoming data flow, much like a coral would filter debris from ocean currents.

FluffballNow as a new data item of the selected type is ingested, the server takes that item as a nucleus and expands out from it by the number of steps specified in ‘sourceSteps‘.  This creates a floating ‘ball’ of ontological connections clustered around the nucleus.  The process of executing the query then becomes a matter of checking for connected paths from the floating nucleus to the stem of the coral fan in much the same way we did in the link visualizer (but more optimized).  If a path is found with a cost below that specified in ‘maxRelevance‘, the query matches and the nucleus record is a returned as a hit or triggers an interest profile notification.

In summary we can see that through the use of a pervasive and contiguous ontology, Mitopia® allows interactive creation and subsequent execution of knowledge level queries that make current generation database systems and queries look positively primitive.  The other half of this puzzle is ensuring that such complex queries execute rapidly.

NOTE:Obviously we have skipped a lot of detail in this initial introduction to the subject of knowledge-level queries.  You will find demos involving link analysis based queries in the “Mitopia.demo.m4v” (the same example scenario – go to offset 22:40) and “MitopiaBigData.mov” (go to offset 11:20) videos on the “Videos,demos & papers” page.  For a detailed look at other complex query types, see the “MitoipaNoSQL.m4v” video on the same page.

Dropping the other shoe

shoes4In the previous post we introduced Mitopia’s unique and patented heteromorphic language concept and showed the huge productivity and adaptability improvements possible over standard programming languages.  In that discussion we deliberately limited ourselves to examples that in no way depend upon leveraging any ontology or other capabilities associated with the endomorphic language.  But such examples are like fighting with one hand tied behind your back, so the eventual goal is to ‘drop the other shoe’ and explore the full potential of heteromorphic languages including their unparalleled power for data mining and integration applications.  First however, we need to present a few more details regarding nested entangled parsers, and the Carmot-E endomorphic language suite in particular.  A brief introduction to these issues is the subject of this post.  Trust me, dropping the other shoe is coming…

Nested Entangled Parsers 

WhatProgrammersDo-1The discussion of heteromorphic languages given in the previous post makes it clear how the evolution of the outer ectomorphic parser state controls the program flow within the endomorphic suite, however, if this were all that was possible, we would have created what is essentially a one-way language where no feedback in the other direction from the ‘rock’ to the left side of our diagram (see discussion here) would be possible.  Clearly to handle any conceivable programming situation, we cannot limit ourselves in this way, we must allow the ‘inner’ endomorphic language and state to influence and alter things in the ‘outer’ ectomorphic environment including of course the flow of the ectomorphic parser itself.  To accomplish this we must ‘entangle’ the nested parsers of the heteromorphic language.

Perhaps the simplest and most obvious form of entanglement between the ectomorphic parser state and the endomorphic language is the Carmot-E if/ifelse construct.  In this construct, the occurrence of a “<@1:5 if (condition)> cond_production” sequence in the grammar has the effect of discarding the production cond_production off the outer ectomorphic parser stack if the condition specified within the ectomorphic expression is not met.  This means that any conditional statement based on the state of the endomorphic environment at the time has the ability to alter how the ectomorphic language parses the input file without the ectomorphic language being aware that this decision was made for it.  In earlier sections we referred to problems with handling context sensitive and non LL(1) grammars and stated that these problems can be very hard to handle with a one-level grammar.  However, since the endomorphic language and environment has access to whatever mechanisms the system might provide for determining and storing ‘context’, another way to look at the idea of ‘entangled’ parsers is that it provides a simple formalism and means to overcome the parser theory failings as far as handling context sensitive grammars.  Because the cond_production is itself a production which may be of arbitrary complexity, this conditional statement is inherently ‘block structured’.  By adding the “<@1:5 ifelse (condition)> cond_prod1 cond_prod2” construct to Carmot-E, we have all the building blocks we need to implement explicit looping behaviors and all kinds of other constructs (for, while, do until, switch, etc.) that are required of a generalized programming language, but which are context sensitive and cannot be handled by regular grammars alone.  Just to be clear, we have made it so that the ‘outer’ ectomorphic language statements are no longer in control of their own program flow, not even in a parsing sense, since the flow can now be altered unannounced by the endomorphic language.  In other words, we have extended our language so that the only portion of it that was at least controlling flow within one production, can no longer be sure that this is so.  The two parsers are entangled so that each has the ability to control the program flow within the other when necessary, and both are under control of the external source data.  We now have a complete feedback loop between both sides of our generalized programming diagram above.  The applicability to data mining and data integration problems should be readily apparent.

But of course entanglement is far more pervasive and subtle that just the obvious examples like the if/ifelse construct.  As we shall see later, mechanisms are provided to alter the input token stream to the parser for various purposes (e.g., handling block comments) by the registered ‘recognizer’.  This is a means for reverse entanglement that we can use for even more exotic purposes, particularly since our registered ‘plugin’ that is executing the endomorphic language statements has access to a shared ‘context ID’ as does the ‘resolver’.  Suppose that context ID was shared between both the ‘inner’ parser and the ‘outer’ parser, now we open up all kinds of possibilities.  This kind of shared ‘context ID’ between both parsers of a heteromorphic language is commonplace (though not required) so that the ‘inner’ parser can communicate ‘modes’ with the outer parser’s ‘resolver’ function and thus change how it perceives the ‘source’ content.  This kind of capability is in fact used extensively particularly in complex MitoMine™ situations where the endomorphic language can directly control the ectomorphic parser (and the left hand side environment) in multiple ways, including (but not limited to) the following:

  • The $Ask() function can directly interact with a user through an endomorphic language specified dialog.
  • The $Exit() function allows the endomorphic language to force the ectomorphic parser to exit.
  • The $SetOptions() function can dynamically alter the options in effect for the ectomorphic parser.
  • The $ReplaceLine() function replaces the current input line being processed by the ectomorphic parser and forces that parser to re-parse the line as its new ‘source’ input.
  • The $GetSource() function can be used to obtain all or part of the endomorphic ‘source’ input which can be edited by the endomorphic language by using the function $PutAllSource() to replace all or part of the ectomorphic source.
  • The $SkipInput() function can be used to skip the ectomorphic parser over an required portion of the ‘source’ input.  The $ScanInput() function can be used to skip over ectomorphic input until a specified pattern is reached.

The ectomorphic language can dynamically create and invoke additional nested (to arbitrary levels) MitoMine™ heteromorphic variants which can share the original ‘input’ stream and thus can override the original ectomorphic parser.

Custom ‘plugin’ meta-symbols can be registered with both the ectomorphic and the endomorphic languages which can use any mechanism they wish to interact with or alter the left hand side environment.

As can be seen from the partial list for MitoMine™ above, the MitoMine™ data mining/integration heteromorphic language provides for an almost limitless variety of feedback mechanisms from the endomorphic language and environment to the left hand side of our generic program diagram.

Other Mitopia® heteromorphic language suites provide other feedback mechanisms, for example, the MitoPlex™ federated query capability is entirely built as a heteromorphic language (MitoPlex™ is the ectomorphic language) where the endomorphic components (referred to as ‘containers’ such as MitoQuest™) have multiple variants some of which can issue deeply nested MitoPlex™ queries (expressed in the same heteromorphic language), thereby creating communications spanning the entire network of machines implementing a Mitopia® installation, including both servers and clients.  All this functionality, that is all of Mitopia’s persistent storage and querying capabilities from the client to all nodes within the server clusters, is just one example of a heteromorphic language suite; in this case where the connections between each entangled parser pass over the network and are referred to as ‘queries’ in one direction and ‘hit lists’ in the other.

Carmot-E

Now that we have described the abstract form of a heteromorphic language suite, we can examine the most prevalent endomorphic language used within Mitopia®, that is Carmot-E.  As discussed before, the Carmot ontology language underlies most of what Mitopia® does.  Carmot-D is Mitopia’s Ontology Definition Language (ODL) and drives virtually all aspects of the system.  The Carmot language however comes in two variants, Carmot-D and Carmot-E.

Carmot-D is a language of type definitions only, and is designed to describe the types that make up the ‘world model’ used the by system and its users.  Carmot-D was deliberately designed so that it contains no means for creating executable programs within the language, no assignment operator, and no ability to declare local variables and/or types.  Everything in Carmot-D is targeted at understanding, representing, and discovering the system ontology, it is not a language designed to ‘do‘ anything, indeed it is designed to discourage that.

The primary reason why Carmot-D has no concept of a run-time program flow is that if one were to add these run-time abilities to a language whose very purpose is to discover what the external ODL specification says, there would be the temptation to declare local types and bury logic and behaviors within the code.  It is precisely to avoid this inevitable temptation within standard programming languages that the decision was made to split the Carmot language into two variants.

Carmot-E is the run-time part of the pairing, it is designed to manipulate at run-time, data collections held in the ontology defined by Carmot-D.  Once again, to prevent programmers from the temptation of standard language techniques, Carmot-E lacks any ability whatsoever to define new types.  It is not even possible to specify anything (e.g., a register) to be of a known type.  Every bit of data manipulated by Carmot-E must be either of a very few fundamental types (e.g., integer, real, string), or it must be defined by the Carmot-D ontology for the system which thus irrevocably defines its type and the type of any fields within it.  To set the type of a ‘variable’ in Carmot-E, you must assign something of that type to it (i.e., it is a ‘dynamically typed’ language).  The Carmot-E programmer is unable to rustle up local types and use them (and the assumptions and limitations they inevitably imply) to manipulate data.

This rigid separation allows the Carmot-D ontology to drive all persistent storage, analysis functions, user interface, etc. in an unambiguous way since it is impossible to define any other types of data except through Carmot-D.  This means all data within the system can be displayed, persisted, and manipulated through the ontology, no exceptions.  Conversely, the restrictions on Carmot-E ensure that run-time programmers cannot stray from the original philosophy that gave rise to Mitopia®, and thus cannot unintentionally render the system fragile and non-adaptive by burying things inside the code (as they would in a standard programming language) that should be explicitly associated with and discovered from the data.  It is the ability to unintentionally ‘ossify’ a system through these buried code assumptions that tends to render systems created using standard programming languages obsolete and non-adaptive.

In our generic diagram of “what programmers do” above, we said that the right hand side of any given programming undertaking can be thought of as the ‘system’ and contains things that are ‘fixed in stone’.  In Mitopia’s case the Carmot ontology is that right hand side, though it is hardly fixed since it can be changed at any time.  Carmot-E is the run-time part of the Carmot ontology language pair, and thus it is most frequently used as the endomorphic language for heteromorphic language suites that must be tied to Mitopia’s ontology.  Because of its prevalence, and because so many other ‘suites’ are built upon it, we must describe Carmot-E in detail herein before we can discuss other technologies in detail.  Remember however that Carmot-E is just one example of an endomorphic language and environment.

Syntax

The listing below gives the lexical analyzer specification for the Carmot-E language:

CamotELex

We can see from the specification above that Carmot-E supports an operator set that is functionally and lexically a subset of those provided by the C language.  These operators are well known, so it is clear that Carmot-E implements these operators by referencing this facility.  Evaluation of commutative operators (e.g., ‘+’) is always left-to-right.  This is particularly important since it allows string concatenation using the ‘+’ operator.  The ‘oneCat’ lexical exceptions are the use of the ‘\’ character in front of all occurrences of the ‘>’ character within operators, and the addition of the ‘ifelse’, ‘[[‘, and ‘]]’ tokens.  The ‘\’ character is required in front of ‘>’ characters because Carmot-E is an endomorphic language which means that Carmot-E statements will occur within meta-symbols of the form <@n:m: Carmot-E statements> and thus it is important that any ‘>’ characters within the Carmot-E statements are not interpreted as the ‘>’ that closes the end of the meta-symbol by the outer parser’s recognizer.  While this is possible to accomplish by adding ‘smarts’ to the outer parser’s recognizer specification, the decision was made to add the leading ‘\’ within the Carmot-E language so that the ‘>’ characters are ‘escaped’ (by the preceding ‘\’) when recognizing plugin meta-symbols.  This specialized behavior within the plugin recognizer specification is internal to the parser abstraction and thus allows the use of escaped ‘>’ characters in any endomorphic language.

The ‘ifelse’, ‘[[‘, and ‘]]’ tokens will be discussed later.  As far as IDENT symbols defined by the Carmot-E language (and recognized by the ‘catRange’ table), the table below defines these:

CarmotEtokens

Given the lexical analyzer specification and the IDENT token set and explanations, the Carmot-E parser specification given below should be relatively easy to interpret.  Virtually all the operators are implemented by the parser abstraction’s built-in plugin zero so the area of the grammar between the production for ‘expression’ and that for ‘primary’, is essentially standard and will not be discussed further.  Operator precedence rules are exactly the same as for C.  All integers are manipulated as 64-bit, all reals as ‘double precision’.  Booleans are represented as (and interchangeable with) 64-bit integers where zero is considered FALSE and non-zero is TRUE.  By convention procedure names should start with an upper case letter to distinguish them from field names in the ontology which always start with a lower case letter.

CarmotEBnf

Programming Environment

The Carmot-E environment is suited to data mining/integration and transformation to/from the underlying ontologyThe drawing above depicts the programming ‘model’ for any heteromorphic language suite based on Carmot-E as the endomorphic language.  The key features that Carmot-E provides are:

  •  access to a large generalized register set ($aa..$zz) which are dynamically typed (that is they each acquire the type of whatever was last assigned to them).
  •  A ‘current input’ register ‘$’ for passing IDENT tokens from the ectomorphic parser to the inner Carmot-E language.
  •  The ability to associate a Mitopia® collection of data held in the ontology and access and manipulate collection content via the Carmot ontology.  Access to both ontological fields and collection element tags (e.g., $tagName) is supported.
  • A complete suite of programming constructs including function calls, operators, etc.  Also provides control over ectomorphic parser flow.
  •  Access to all functions and procedures defined by MitoMine™ from other heteromorphic suites based on Carmot-E.  Ability to register and invoke additional custom functions and procedures.

This last point is critical, it means that any functionality built using a heteromorphic language suite that incorporates Carmot-E is able to use not only the capabilities provided directly by Carmot-E, but also most of the hundreds of built-in procedures and functions provided by MitoMine™.  This means that out of the box, without any real effort by the programmer, a heteromorphic language based on Carmot-E already provides an extensive library of functionality, targeted primarily at text processing.  Additional language specific functionality can be added to extend this ‘library’ as required by the application.

Note also that because Carmot-E provides access to the collection element tags functionality provided by Mitopia’s type collections abstraction, the Carmot-E client language is free to organize data in any number of ways that are ‘outside’ that defined by the system ontology.  This freedom obviously also includes the ability to define new custom types using the routine CT_DefineNewType() which then become known to Mitopia’s type system and can thus be manipulated ontologically within the associated collection(s) without resorting to element tags.

Tying Carmot-E to a Heteromorphic Language Suite

The Carmot-E endomorphic parser environment requires a single shared parser context record with the outer ectomorphic parser that makes up the the nested pair.  Since the type and content of the Carmot-E context record is not public, this means that all custom (i.e., not internal Mitopia® code) heteromorphic suites with Carmot-E as the inner parser must declare the entire shared parser context record so that it starts with an anonymous memory block (initialized by calling CT_MakeCarmotEParser) declared as follows:

ContextID

The constant ‘kCarmotInterpContextSize’ is guaranteed to be large enough to contain the required Carmot-E interpreter context so that the custom code can directly refer to any custom fields declared later within the custom context type ‘MyInterpContext’ while still being able to pass the same context ID reference to the Carmot-E API calls.  To associate the inner Carmot-E parser with the heteromorphic language environment, the generic parser initialization function for the custom API would look something like the following:

MakeContext

Then within your outer parser custom ‘plugin’, where it is time to execute the Carmot-E parser (e.g., <@1:5> in the MitoMine™ example), you extract the plugin ‘hint’ and invoke Carmot-E as follows:

Invokation

Where ‘p’ is a pointer to the outer parser context (i.e., MyInterpContext in the example above).  The final thing that needs to be done is to designate the ‘node’ within the collection associated with the heteromorphic language suite that is to be used whenever access to a field occurs using CT_SetPrimaryNode().  Where this happens within your custom API may vary depending on where the collection itself comes from.  In the case of MitoMine™ for example the node is created within the <@1:4> ‘plugin’ call.  In other cases (e.g., the GUI interpreter), the ‘node’ may be designated explicitly by the invoking code (e.g., through the SetNode() GUI interpreter procedure call).  See also the  section below describing the current input register ‘$’.

In most cases (e.g., MitoMine™, GUI interpreter, etc.) where you encounter Carmot-E as the endomorphic language, the API suite to the heteromorphic language has already done all the work necessary to tie in the Carmot-E functionality and also the connections to the left and right side of our generalized programming diagram so you do not need to make any Carmot-E API calls yourself, you just need to understand how to write Carmot-E program fragments within the outer ectomorphic parser grammar provided by the higher level abstraction.

General Purpose Registers

Because the Carmot-E endomorphic language provides all access to the internal register set, there are no API functions provided to allow custom code to access register values.  This is a deliberate choice because the memory handling associated with register accesses must be managed very carefully to avoid leaks or memory errors, so this code is all internal to Carmot-E.  The 676 (26*26) general purpose registers that make up the Carmot-E register set are more than enough to satisfy the ‘local variable’ needs of any heteromorphic program built upon Carmot-E.  Register values are initially all empty/undefined and as they are assigned they take on the fundamental type of the value assigned to them.  Carmot-E recognizes just five fundamental types that can be held in registers, they are Integer (64-bit), Boolean (held as a 64-bit integer), real (held as a ‘double’), string  (held as a heap allocated C string handle), and reference (type ET_ViewRef).  These fundamental types align closely with the fundamental types supported by Mitopia’s parser abstraction (Integer, Boolean, Real, Pointer, and Reference) except that in registers Carmot-E replaces the generalized Pointer type by a heap allocated handle to a C-string.  Carmot-E makes extensive use of the Reference type. It is not possible (except through private APIs) within Carmot-E to assign any other type to a register.  The use of handles to contain C-strings is driven by the fact that string concatenation is a common operation within Carmot-E programs (the ‘+’ operator of built-in ‘plugin’ zero concatenates string operands).  String concatenation within a heap allocated handle is trivial to implement since handles, unlike pointers, can always be resized without invalidating any references that might exist.  What this means is that the Carmot-E register set may contain multiple registers that reference external heap memory at any given time.  Since Carmot-E based parsers may be shared across thread boundaries, this makes memory management a considerable headache to handle correctly.  For this reason the functions CT_PossessCarmotEParser() and CT_DisPosessCarmotEParser() are provided, and the function CT_KillCarmotEParser() and CT_ZapAllRegisters() each ensure that any dangling memory references are cleaned up.

The symbol ‘$aa‘ is a register designator.  There are 26*26 registers ‘$aa‘ to ‘$zz’ which may be used to hold the results of calculations necessary to compute field values and/or local variables.  You may use a single character register designator instead thus ‘$a‘ is the same as ‘$aa‘, ‘$b’ is the same as ‘$ba‘ etc.  Register names can optionally be followed by a ‘:‘ and then an arbitrary alphanumeric text string (including ‘_’) which is completely ignored by the parser but serves as a convenient way of documenting the code to indicate what is in the register concerned.  For example the syntax “$i:LoopCounter” might be used to make it clear what the contents of register $i (that is $ia written out in full) are being used for.  All registers are initially set to empty when the parse begins, thereafter their value is entirely determined what is assigned to them.

The syntax ‘@$a‘ is effectively macro expanded to replace the occurrence of ‘@$a‘ by the string value in $a.  This means that supposing you have a field in the current record called ‘name‘ then the sequence $a = “name”; @$a = “Fred Bloggs”;  will result in the ‘name‘ field of the record being assigned to “Fred Bloggs”.  Similarly $a = “$ab”; @$a = “Fred Bloggs” will result in the assignment of “Fred Bloggs” to register $ab.  It is often handy to use this macro form inside what are effectively production ‘subroutines’ that take a number of indirect parameters.   For example, take the MitoMine™ heteromorphic language ‘subroutine’ portion below:

Add2CommaList

The addto_CommaList production above is basically a ‘subroutine’ that takes as its parameters the registers $ra:ListName and $rb:ItemName.  Register $ra must be set up prior to invoking the production to be the name of some other storage location being used to accumulate a comma separated list of names for example (i.e., $ra = “$al” means that register $al contains the comma separated list so far).  Register $rb:ItemName must be set up before the invocation to contain the item name to be added to the list (e.g., $rb = $a; $a = $TextBetween($og,”(“,”)”) sets up register $rb to specify that the actual item name string is held in register $a, and then sets up the content of $a by calling the $TextBetween function to extract whatever is between the braces in the current content of register $og).  Now adding the production addto_commaList after these setup calls will make the appropriate addition to the list. The code of the ‘subroutine’ first checks if the list is empty by following @ra:ListName to discover that the list is held in register $al, so effectively the ifelse condition is $IsEmpty($al).  The rest of the logic can be followed simply by expanding the references implied by ‘@$ra’ and ‘@$rb’.  The $TextContains condition within the production for checkin_commal ensures that if the list already contains the item name given, it is not added twice.

Note also the ifelse condition in the production do_comma_add which extracts the first character of the content of $ra (not @$ra) to see if the storage element is a field in the ontology (first character != ‘$’) or a register (first character == ‘$’).  Depending on which is true either the <@1:5:@$ra = @$ra + @$rb + “, “> or the <@1:5:@$ra = @$rb + “, “> production is executed.  This logic is necessary because when assigning values to registers the new value overwrites the previous value completely whereas when assigning values to string fields within the ontology, the default behavior is to concatenate the new string onto the end of any string already in the field.  Because the caller might have passed a field path for the comma delimited list location instead of a register name (for example $ra = “aliases”), it is important that our production ‘subroutine’ takes this into account.  This ability to use productions containing embedded endomorphic language logic and statement is an incredibly powerful feature conferred by the heteromorphic language concept and facilitated by the Carmot-E ‘@$r’ register indirection construct.  Without this construct in the example above we would have to repeat all this logic everywhere in the grammar that we want to add something to any comma separated list.  This could lead to the grammar exploding unnecessarily.  Using ‘subroutines’ like this, one production set can handle multiple different comma separated lists held in different locations and used for different purposes in the grammar.  Note that this ‘subroutine’ is comprised entirely of endomorphic program fragments and productions, that is it is independent of the outer ectomorphic grammar and all the ectomorphic productions within it have empty FIRST sets.  This is very common in heteromorphic programs.

As mentioned above, the general purpose registers are dynamically typed (within the 5 basic types supported) so that the assignment “$aa = 1” will set the type of register $aa to Integer.  Similarly “$ab = 1.0” sets the type of $ab to Real, and “$ac = “Hello World”” sets the type of $ac to String.  Following these three assignments, the statement “$ac = $aa” will dispose of the heap string allocation previously associated with $ac and set its type to Integer and its value to 1.  Note also that given the assignments earlier, the statement “$aa = $aa / 0.5” will set the type of $aa to Real and the value to 2.0.

When assigning from a field within the ontology to a register, the register acquires the fundamental type associated with the field, so that “$ba = anIntFieldName” sets the type of $ba to be Integer and the value to be whatever was in the integer type descendant field ‘anIntFieldName’.  Similarly for field types derived from the real number type.  If the field type in the ontology is derived from an array or reference to ‘char’ then the field is assumed to contain a C string which is assigned to the register.  Thus “$bb = name” sets the type of register $bb to String and the value to whatever is in the ‘name’ field of the current designated collection node.

This leaves only the question of what happens when you assign a persistent (#) or collection (## or @@) reference field to a register value.  In the case of a persistent reference field, if the field already has a unique ID value assigned, the register type becomes Integer and the value matches the 64-bit local unique ID within the persistent reference, otherwise if the persistent reference has a ‘name’ value, the register type becomes String and the value matches the referenced item name.

If the field assigned to a register is a persistent collection reference (## field) or a relative collection reference (@@ field) then the value assigned is the content of the ‘stringH’ portion of the reference (if any) and the register type becomes String.  To understand why this is so, read later posts.

If you attempt to ‘read’ a value from a register that has not yet had a value assigned to it, you will get a ‘kUninitializedValue’ error and the parse will fail.  You can suppress these errors using the ‘kPermitEmptyEvaluation’ parser option.  With this option in effect, the assignment $a = $b where $b is uninitialized will silently cause $a to be uninitialized afterwards.

The Current Input Register – $

The primary means whereby the endomorphic Carmot-E program obtains chunks of the ectomorphic ‘source’ in order to operate on them is through the ‘current input’ register, denoted by the single character token ‘$’ (token number 11).  The content of the ‘$’ register is determined by the most recent IDENT token accepted by the ectomorphic parser so for example in the heteromorphic grammar sequence production_one <3:DecInt> <@1:5:$a = $>, the value of register $a will be assigned to be the Integer value just accepted by the ectomorphic parser.  Similarly the sequence production_two <1:Identifier> <@1:5:$a = $>, results in $a being assigned to be the string value accepted as an ectomorphic match for the <1:Identifier> token.  In general, the value of ‘$’ at any given time can be thought of as equal to the value of the top element of the ectomorphic parser’s evaluation stack.  However, the ectomorphic parser generally does not do anything with the IDENT tokens that it encounters, and so it must usually implement an aggressive evaluation stack (EStack) cleanup strategy to avoid stack overflows.

In MitoMine™ strategy, the ectomorphic parser EStack is wiped completely after each MitoMine™ record acquisition completes which ensures that we do not get an ectomorphic parser evaluation stack overflow.  The code in the ectomorphic resolver causes the content of the Carmot-E ‘$’ register to be zapped immediately (using CT_ZapCurrentValReg ) after each ectomorphic token is accepted and records the EStack depth at the time.

The code within the MitoMine™ ectomorphic ‘plugin’ <@1:5> which actually invokes the inner Carmot-E parser checks if any IDENT token has been pushed by the ectomorphic parser since the last ‘accept’ completed (which can be determined from the saved value of ‘zapped’), the ‘plugin’ extracts the ‘evaluated’ value from the top of the ectomorphic EStack and assigns it appropriately to the Carmot-E ‘$’ register using CT_SetCurrentValReg().  This <@1:5> logic is a one shot if the ‘$’ register already has a value (by using CT_HasCurrentValReg) so that the value of ‘$’ remains available to subsequent <@1:5> productions, but is wiped as soon as the next ectomorphic IDENT token is accepted.  This strategy means that the ‘$’ register will be uninitialized most of the time and will have a value of the appropriate type within any call to <@1:5> providing it is immediately preceded by an IDENT token.  Other heteromorphic languages that use Carmot-E as the endomorphic component usually implement similar strategies.  The utility of this approach is most obvious in a data mining or data integration application (like MitoMine™), however, the content of ‘$‘ might just as easily be a user interface action, a query, or any number of other things.

The Ectomorphic Expression Symbol – #

The ‘#’ symbol occurring by itself refers to the ‘evaluated’ (i.e., PS_EvalIdent(aParseDB,TOP)) value of whatever is on the top of the outer ectomorphic parser’s evaluation stack (EStack).  Lets say that again: when the ‘#’ appears in the endomorphic Carmot-E source, this causes the OUTER ectomorphic parser to evaluate the top of its EStack and pass the result back to the Carmot-E program.  This evaluation process may convert the symbol value in EStack[TOP] to either an integer, a real, or possibly a string.  This construct is used when the outer ectomorphic language contains expressions for example, and the inner endomorphic language wishes to evaluate the expression and obtain the (usually) numeric result.  Note that these expressions are in the ‘outer’ language, not the ‘inner’ Carmot-E language, which may of course contain expressions of its own.

The key ability this confers to a heteromorphic language is that values within the source file can be expressed as arithmetic expressions, not just simple numeric constants.  It also means that the ectomorphic language itself can perform arithmetic computations (using built-in ‘plugin’ zero or otherwise), and can implement custom procedures and functions the results of which can actually be accessed as required from within the endomorphic Carmot-E grammar.  In other words both the ectomorphic and the endomorphic language grammars can be full-up languages containing custom function code which can be invoked from the other language of the pair.  This ‘#’ symbol is therefore yet another subtle form of entanglement between the parsers of heteromorphic languages built upon Carmot-E.  The potential uses of this capability to deal with complex interactive situations are extensive. The utility of the Carmot-E API function CT_GetParentParseDB() for implementing features like this within custom code should now be clear.   Heteromorphic languages are free to put ‘smarts’ wherever they are most appropriate, either in the endomorphic language, the ectomorphic language, or both.

As can be seen from the discussion above, Carmot-E, at least as far as registers are concerned, is a dynamically typed language.  We will explore additional Carmot-E functionality in future posts.