Moore’s law killed the robots!

DeadRobot

Mitopia® to this day still contains a built-in and tightly integrated robotic mass storage subsystem known as MitoStore™ which was expressly designed to meet the needs of a diverse range of multimedia formats in a fully configurable and extensible architectural manner.  Our initial attempts starting in the early 90’s to use a commercial mass storage software package met with little success due to the inflexibility of such packages and the inability to customize package behavior based on the needs of each different media type.  All such mass storage suites appear to be predicated on the assumption that there is a central authority (vendor supplied) that is in control of robotic autoloader hardware, and this is of course incompatible with our distributed scaleable database requirements.  As a consequence of these experiences, Mitopia® was designed such that mass storage is handled explicitly within the standard server model through an abstraction called the “Logical Mass Storage Layer”.  The logical MSS layer allows registration of standardized custom device drivers for new media types and robots (it is these custom drivers that form the bulk of the MitoStore™ subsystem), as well as complete control over all the various loading and routing strategies necessary to make optimal use of mass storage based on media type.  Within the logical MSS abstraction, storage is broken into four categories viz:

  1. Disk caches and buffers for media content held in the client machines.
  2. Disk and RAM caches for media content distributed throughout the drones of the server cluster that relates to the media type involved.  This includes RAID disks and sharing techniques such as Storage Area Networks (SANs.) based for example on fiber channel.
  3. Robotic storage based on removable media such as CD-ROMs, DVDs, tapes or other devices.  The model supports redundant media copies if necessary.
  4. Off-line archival storage of the removable media volumes in a storage facility requiring explicit human activity to retrieve and re-load off-line volumes if required.

Mitopia’s abstraction for a robotic storage device that can participate in the mass storage infrastructure consists of the following:

RobotAbstractionIn this model, near-line storage is organized into a set of possibly heterogeneous robotic autoloaders where each robot/autoloader is comprised of a set of storage slots possibly organized into shelves, a set of drives connected to the drones of the server cluster via some kind of data connection (e.g., SCSI), a controller which can be commanded via a control connection of some sort, and a transport which can be commanded to move media from and between slots and drives and in addition to/from a mailbox slot (which allows media to be added to and removed from the robot).  In some robots, the shelves themselves are removable as a block.  While the details vary considerably from robot to robot, by registration of suitable autoloader specific plug-ins, all robotic storage systems can be represented in this model.  For any given server incorporating robotic storage, the <ConfigString> XML tag supplied during MitoPlex™ server initialization (discussed elsewhere) can be used to specify the symbolic name of the autoloader(s) involved and the connectivity of the data connections.  Once the necessary driver for a given robot has been registered, the logical MSS layer is able to fully automatically integrate the use of near-line storage farms into the basic operation of any multimedia MitoPlex™ server.

The Mitopia® server code contains sophisticated strategies for routing requests based on where the required media actually resides and which drives can be used to access the media (and thus which drones are involved).  In all operations Mitopia® attempts to cache results in the hard disk storage of the various drones that are used to access the robots, and thus an automatic most-recently-used caching scheme is in effect at all times.  Additionally as new multimedia information arrives at the server, the logical MSS layer utilizes knowledge of the media format in the attached autoloaders to create archive ‘chunks’ in drone disk storage which are automatically moved to robotic media when complete, thus ensuring a smooth migration of multimedia data from cache to near-line storage.

Many media types (e.g., DVD) require specialized ‘burning’ activities in order to create them and an entire abstraction for media burning is also in place.  Essentially all that is necessary is to fill the robot(s) with blank media and Mitopia® takes care of the rest.  This philosophy is extended to handle off-line storage so that Mitopia® supports the concept of system ‘operators’ (for large installations) who can take media off-line (under software control via the mailbox) and who will be automatically notified by the system if an attempt is made to access the off-line media.  In this case the operator(s) is prompted for which media to retrieve and which robot to put it in (the system always checks that he did it right!).  The entire storage abstraction all the way from caches in the clients through to off-line warehouse storage is completely handled by the Mitopia® architecture in a tightly integrated manner which allows a far higher degree of control over system load balancing and performance than is possible with off-the-shelf mass storage solutions.  Mitopia® already contains drivers for many leading autoloader devices and the specification of a driver for new autoloaders is a relatively simple matter.

tiltrac_01bdrm5004ximg51651

But it is now the year 2014 and Moore’s law has continued its inevitable march as far as disk drive densities are concerned.

1000x1000$(KGrHqVHJBUFI3hghUVDBSTPzm0fg!~~60_35

A one terabyte drive is now commonplace and can be had for around $50!  Back in the early 90’s our robotic storage consisted of CD-ROMs (and later DVDs).  Because disk drives were tiny and outrageously expensive, it made sense to store large data sets in a robotic autoloader.  In different systems and configurations we supported a wide variety of autoloaders including: Cygnet Model D480, Pioneer DRM 7000, Plasmon D Series,  InfiniDISC,  Pioneer 1804 and 5004, TiltRac CD and DVD models, and others.  Indeed when the capacities of these models proved too low in some cases we also had two specialized custom robotic autoloaders designed and built to our needs, one by CyberKinetics, and another even larger model taking up its own small room using pick-and-place robots of the kind found in car assembly lines.  I fondly remember watching that bad boy pick up a video tape, shove it into a drive that already contained a tape, and because we’d temporarily disabled the force sensor (so it didn’t know there was a problem), continue shoving  the tape, and the drive, through the back wall of the room.  This of course happened during a client demo!  Ah…those were the days!  Some of these venerable devices are shown above, others regrettably I can no longer find pictures of.

ALTUSrobot_mindy_480x320One can fit 1,500 CD-ROMs into a single $50 1TB drive, 200 DVDs fit in the same space.  The economics of these robotic monsters has long since ceased to make any sense, they have been killed off by Moore’s law.  Of course those 1TB disk drives can still be picked-and-placed so there is still a few dark corners where the robots can eek out a living; places like the NSA spring to mind.  But storage technology hasn’t stopped and even those systems are already relics.

dead-meatdalek_series7_figure1A new breed of robots has emerged that can walk, fly, and go places that humans can’t.  The storage dinosaurs are gone and the Mitopia® code that supports them is an echo of a bygone era.  Long live the new warmblooded robots, extinction events create new opportunities.

In search of meaningful metrics

CompeteEatIn an earlier post I briefly touched on the subject of ingest performance and some of the tricks that Mitopia® uses to improve these ingest metrics in a distributed server context.  In today’s post I want to take this discussion a little further, to discuss some of the problems with the metrics commonly used to measure performance, and propose some new metrics (used within Mitopia®) that are better able to capture the true searchable utility of whatever was ingested, and thus create a metric that actually measures something useful – a thing sorely lacking with current measures.

Currently there are essentially three metrics (and countless variants thereof) used to measure ingest performance for databases, they are:

  • Bytes/s – This measures the number of bytes per second of data of some kind that can be ingested into a database.  This is clearly a truly silly and meaningless metric because we must ask what ‘ingest’ actually means here.  Does this mean that you can search on each and every word ingested? How about sequences of words?  What is the relationship between words and bytes (non-English text may require from 1-6 bytes per character, and how many characters make up a word is all over the map)?  Is the search of words exact or stemmed?  Can you search across languages? The list goes on.  Clearly this is by far the easiest measure to come up with since one need only total the size of all files fed in, however as a real world metric for comparison across systems, this measure is virtually useless.
  • Documents or Records/s – Again this is an easy measure to compute but is suffers from all the uncertainly of the Bytes/s measure, plus a whole load more such as: How big is each document (or record)?  Is the ingested item structured and searchable by meta-data, word content, or what?  For database records, we must ask a whole new set of questions about the resultant query operations possible on the fields, what operators (numeric, text, etc.) are supported,  Boolean logic between fields, range constraints, etc.   All these additional capabilities are critical to measuring real utility of the ingest, and yet none are touched by this metric.
  • Entities/s – This specialized metric is measures the number of recognized entity names (e.g., people, organizations, products, etc.) identified in a block of text that is being ingested.  The concept of an ‘entity’, as opposed to records or documents, suggests a move to an ontological knowledge-level organization, however this is rarely the case, and ultimately we are talking about specialized records in a relational database.  Once again, we must ask what is an ‘entity’?,exactly what kinds of information is recorded for an entity?, how extensive is our ability to search relationships between entities?, etc.

The results of the shortcomings is that while each metric measures something, none of them is really any use to parameterize what has actually been accomplished at the end of the ingest and how useful the resultant searchable data set actually is.  This problem is aggravated manyfold when one moves to a knowledge level (KL) system where the focus is largely on the nature of the many diverse relationships (both direct and indirect) that may exist between extracted ontological content and how those relationships might influence the interpretation of meaning in textual documents that reference known entities (or otherwise).

In order to overcome these problems and come up with meaningful KL metrics, it became necessary to define two new units the DRIP and the TAP (note the plumbing/data-flow metaphor!) for measuring the complexity of the information/knowledge extracted during a mining run for example.

The first unit is the DRIP (Data Relation Interconnect Performance) defined as (E) + (R) where E is the total number of ontologically distinct (as opposed to the RDBMS table metaphor) entities OR data items ingested, and R is the number of explicit directed relationships created between them.  Because relationships can be unidirectional or bi-directional (more useful), a unidirectional connection (e.g. a person’s citizenship) counts as an R score of one, while a bidirectional connection (e.g. an employment record) counts as two, one in each direction.  The sum of the number of entities and the number of connections between them accurately measures the usefulness of what has been ingested for systems interested in relationships (presumably all KL systems).  Entity ingest rate is measured in DRIP/s or DRIPS.  The DRIP measure however, fails to take into account Mitopia’s ability to dynamically hyperlink on a per client machine basis, at any time, between all textual data occurring within any record and every other item ever ingested.  This ‘lazy’ creation of connections allows each system user far greater flexibility in his analysis, and avoids freezing the data according to a particular focus or analytical technique during ingestion.  It is unlikely any other system can approach Mitopia’s current DRIP rate, and they certainly can’t provide this feature, so we ignore this vast potential contribution to the DRIP measurement for comparative purposes.

The second unit is the TAP (Text Acquisition Performance) defined as (T) where T is the total number of ingested and indexed words (not characters – regardless of language).  By indexed we mean separately searchable on any and all text fields such that the search time (on a single-server machine) does not exceed 1 minute when searching for all occurrences of any single word in any textual field of a data set where T is at least 10 million.  This is an arbitrary cutoff criteria, however due to the vastly different search strategies used by various systems, it is hard to think of a limit that would be deemed fair by all.  In theory, because Mitopia® allows three (two for English) distinct search protocols that easily meet this criteria for every indexed word (native language un-stemmed, native stemmed, and root-mapped to English stemmed — see earlier post for details) our ingest rates should count as at least 2-3T but once again we ignore this.  Textual ingest rate is measured in TAP/s or TAPS.

By measuring the complexity of the output collection(s) from a mining run and dividing by the number of seconds taken for the mining run to complete, one can easily obtain meaningful ingest rates for various different sources.  Some sources are more text intensive and so exhibit higher TAPS ratings, others are more ‘connection’ intensive and exhibit higher DRIPS ratings.  As a relatively balanced example, on a 2006 era Mac Pro quad-core (3 GHz) machine, hard drives only, doing end-to-end mining and persisting into servers (for subsequent search) of a standard data set containing around 25 different types of source, performance (using the “MitoPlex – Dump Storage DRIP/TAP measures” tool in Admin. window tool) was as follows:

  • Total cores time for mining phase: 4158s (average of  1,040 s in each core)
  • All server activity quiescent after: 30 minutes (1800 s)
  • Total textual data ingested (excluding image files): 278 MB
  • Mining Rate (4 cores): 267 KB/s
  • Total end-to-end (4 cores): 154 KB/s
  • Total Server Complexity: 791.5 KDRIP, 5.27 MTAP
  • End-to-end persist rate: 0.44 KDRIP/s,  293 KTAP/s

As stated above, nobody else quotes such metrics, and indeed without a KL underpinning (which only Mitopia® actually has), a DRIP measure for a relational database system even by the most generous interpretation, would be paltry.   Relational databases, despite the name, are notoriously bad at representing massive and changing relationships between things.  However, we can attempt to compute a rough estimate for the very simple TAPS measure from published benchmarks available on the web.  For example, Oracle Corporation in a 2010 White Paper on the Oracle 11g product, presents the following benchmark for a full text search ingestion application:

OracleFullTextFrom the above we can see that the 23 million 4K document/day measure (the highest performance), assuming that the documents were in English, and that the average number of letters in an English word is 5 (plus one space between each word), we see that the equivalent TAPS value would be:  23,328,000*4/24/60/60/6 = 180 KTAP/s.  This is of the same order of magnitude as the Mitopia® result (293 KTAP/s), although the actual range of the text searches available is in truth orders of magnitude higher in Mitopia® when one includes things like phrase searches etc.  After all, just as important as ingest rate  (if not more so), is query performance on the ingested data, and in this realm the gap between existing systems and Mitopia® only grows larger.

BenchMksOf course these rates can both be scaled, and the scaleability as a function of the CPU and other resources required is another critical issue.  This issue as well as the far more powerful search mechanisms available within Mitopia® and their effect on true searchable utility (which even the DRIP and TAP cannot fully measure) is fully addressed in the ‘MitopiaNoSQL’ video which compares Mitopia® real world search utility, cost, and other metrics with that attainable through both relational and NoSQL systems.  The results show that no existing system can come close to Mitopia’s performance in a scaled system.