Introduction
Information and documents, books and magazines
consist of ideas and images that are
linked together as a person goes through the media by ideas. These ideas are communicated in
blocks, or clusters made up of paragraphs, comprised of sentences from a broad vocabulary. As
language has evolved through the centuries we learned to use words in context with other words in
an ordered stream to communicate complex ideas to each other.
Inherent in the basic design mandate of our original search capabilities from Day One, was
the necessity to find very specified yet broad sets of items, intersecting with other very specified
yet broad sets of items, within variable yet controllable units of text, such as sentences, or
paragraphs, or whole documents, or individual lines, or even just a specified number of characters.
To this day, our ability to find things other systems cannot, rests largely on the original search
engine requisites which allowed for the finding of many complex types of search items or sets of
items, in very precise association with any number of other search items or sets of items, and within
a dynamically adjustable unit of text. The ability to fine tune these elements dictates relevance.
And the ability to quickly obtain not just any, but specifically designated RELEVANT text and
images, greatly aids individual problem solving capability and cognitive acceleration in one's own field.
A basic requirement for the ability to do this, and something which distinguishes this
technology, is an advanced package of very rapid pattern matching algorithms and a customizable
language/meaning matrix of over 250,000+ word connections in a dense web of associations, the
result of decades of private Research and Development. By using our package of development
tools the operational time to build a completely customized Competitive Intelligence System, or an
online library or information catalog, or an integrated text and image database, is considerably
shorter than by using other approaches, as well as more operationally effective in its ability to
locate relevant responses which would be otherwise missed using less robust or mature retrieval
methodologies.
Over the years the technology we developed was fleshed out and stabilized in a
commercially available relational database management system called "Texis", with its own web
scripting language called "Vortex". Making use of these tools, we write our own proprietary web
scripts making use of all the basic features of this software for which it was originally designed,
including the ability to manipulate the complex "Equivalence Matrix", or "Thesaurus", which
generates information correlation capability. In this way we can link database dexterity containing
our Intelligent Search Engine with the online forms support found in web browsers. This
combination allows us to rapidly create globally distributable information services via the Internet,
harnessing the power of the software in completely customized applications specific to each new
user group.
We maintain our own Unix server on which the base software resides, and from which we
host multiple, firewall secured, ID/Password protected applications with custom databases we
house and custom interfaces we write. Using best-of-class software and 24-7 network backup and
technical support from the company we founded in 1981 we are able to provide economical,
precise, and secure information mining applications for special case scenarios.
Our Information Mining Technology
At the core of all its applications, Mnemotrix Systems, Inc. uses the information mining
technology contained in Texis, an intelligent Relational Database Management System which is
embedded in a huge number of broad applications across the world. Herein we discuss some of the
features of this core search technology, as a foundation to the applications which Mnemotrix
creates and supports.
These are some of the features which comprise the flexibility of the set of tools we have
available, and which we use to create applications which are uniquely customized for each of our
clients' special needs.
Frequently Asked Questions
How is Mnemotrix's information mining technology different from other engines?
Our information mining technology uses the only search engine in the world with the
structure of a SQL relational database (rdbms = Relational DataBase Management System). SQL
as used here means Structured Query Language - not Microsoft's product named with that term!
SQL is an industry standard defined by the American National Standards Institute (ANSI), and its
counterpart, the International Organization for Standardization (ISO). All major database vendors
use SQL as their query language.
SQL provides many advantages for addressing complicated search requirements. It also
provides you with the confidence of a reliable, well-defined path for implementing unanticipated
new search functionality in the future. SQL is a rich, mature, open standard used by hundreds of
thousands of database application developers around the world.
All other information mining engines provide a much narrower range of capabilities based
on proprietary interfaces. No other engine provides the versatility of using SQL as its application
development model.
How is our technology different from other relational databases?
Our technology is the only relational database that can store and search text documents of
unlimited size within standard database tables. All other solutions that purport to accomplish this
employ, either explicitly or "under the covers," a loosely coupled external text index, and store documents in a binary large object (blob) field. That approach causes major bottlenecks.
What's so hard about integrating text-search with a relational database?
Text-searching and relational database management and its consequent information mining
applications are radically different paradigms for organizing and retrieving information. They were
developed over decades as completely separate technologies and do not marry easily. More than 10
years was dedicated and devoted to solving this problem; it is our "core competency," and
distinguishes Mnemotrix Systems Inc. with many hundreds of important client applications.
Which database (RDBMS = Relational Database Management System) does our technology use?
Our technology does not "use" another database; it is a complete database itself. However,
it can be used as an information mining engine for content residing in any other database.
What platforms do our applications run on?
As a general rule we host our own secure and private applications across the Internet but on
a password protected basis where individual users connect from wherever they normally connect to
the Internet, and make use of their applications on our own Unix server, which is hosted in a
completely secure firewall network environment on a T3. So, generally speaking, platforms are not
really an issue. Nevertheless, in the event that an application needs to live in a client's own
environment for some special reason, platform is still not really an issue, since Texis and all the
major software components of our applications run on the major Unix systems and Windows
NT/2000/XP. Supported Unix flavors include Solaris 2.5+, Solaris x86, Linux, Compaq Tru64
(DEC Alpha), FreeBSD, Irix, BSDI, HP-UX, AIX, SCO and Unixware. The compatible platforms
are those that are resilient enough to support the requirements of a secure, sensible, and robust
hosting environment.
What language was our software written in?
All the code that comprises our information mining software has been written in ANSI
compliant 'C' language. Programmer's API source code is available where needed for reference
and modification, or when collaborative applications are being supported, or require embedded
technology, and has been compiled and tested on at least 22 different Unix (and other) platforms.
Are documents stored within our database, or as separate files?
Either! It depends on the circumstances. Web-searching is a typical example of indexing
external documents: we can extract information about the different web pages and build an index
(database) based on that; search results consist of links to those pages. On the other hand, in, say,
an auction application, the original information typically exists entirely within the database: users
input their listings directly into the database; and search results consist of links to records within the
database.
Can our technology handle BLOBs (binary large objects)?
Yes. Our technology has a blob-type field useful for storing graphics
or other binary data. But note that in our search technology, textual content
of any size usually is put in a variable-size
(varchar) field. This provides superior text-indexing and searching functionality compared to
storing text into blobs. But if you have binary content, our search technology can manage the
storage of files much more efficiently than an Operating System file system! That is because our
technology keeps track of each record's location on disk, and can fetch it with a single disk seek-and-read operation; whereas operating systems are un-indexed, so that fetching files typically takes
four or more seek-and-reads to search through the directory structure.
In plainer terms, we have the architecture to robustly store and manipulate images of great
size, especially where the database consists of text descriptions and images together: such as
complex medical research applications containing data, MRI's, X-rays, and photos, GPR and GPS
data and imagery, and any number of other types of needs in architectural design, city planning,
science, and/or defense related, such as strategic studies, manuals, and other complex libraries of
data needed. At the same time, we can create a database of pointers to those images, rather than
having to store the images themselves, allowing for an extremely flexible database design where
images could be located in a variety of places, while the text database was very efficiently searched
and managed somewhere else.
How well does our technology scale up? What are the benchmarks?
Our search technology is by far the highest-performance product in the marketplace
providing full-text search and data mining within a relational database framework. It powers some
of the largest search sites on the internet. When Texis was the search technology engine at eBay it
served more than 20 million searches a day, and while eBay has attributed its various outages to
unrelated problems; they've never had a crash caused by our search engine! Our own application
servers can currently support hundreds of simultaneously running applications under a secure
rubric.
How many documents or records can we search?
There is no inherent limit. Our search technology is routinely used on the most heavily
trafficked web sites for searching databases of tens of millions of large records. It has been used
with hundreds of millions of records with no significant complications.
How quickly are our text indexes updated?
Instantly! Our search technology performs standard database record locking, unlocking,
and management of contention. It keeps the data consistent and available for all users while records are being inserted, updated, or deleted. No other search engine performs these database-type functions.
Does our technology do incremental indexing?
Yes. Items added to the database are searchable instantly. Our search technology takes
care of all index updating in background.
Can we search data in languages other than English?
Yes, our technology is used in many Latin based languages.
How does our technology handle the 8-bit "accented" characters of Spanish (or French or German or whatever)?
A simple configuration setting tells our search technology which character set you are
using. Accent characters and any other non-English characters will be preserved in the data and
become fully searchable, if desired.
Can our technology index languages using multi-byte characters (e.g., Chinese, Japanese, Hebrew, etc.)?
Yes, our technology has been used in these languages. However, the issues are somewhat
more complicated than for the single-byte alphabets. For example, a specific character in Chinese
may sometimes be a word on its own, and other times part of a different word. Chinese readers
discern the difference from the context, but there is no indication in the text as to which it is.
Any such language application would have to be discussed as a special case application.
Does our technology do "stemming"? How about in other languages?
Stemming refers to a process of stripping a word down to its root by removing suffixes or
prefixes (such as the "s" on the end of English plurals), and then searching for valid variations of
the root (known as morphemes). Our search technology provides very sophisticated morpheme
processing, with default rules that apply to English. Various aspects of morpheme processing may
be turned on or off, and the rules customized. A set of morpheme processing rules may be
specified for any language. A user organization typically will wish to customize these rules not
only for your language, but for a particular type of data or search style.
Does our technology have a thesaurus capability?
Yes, a very extensive one. This is also referred to, and was originally created as and named
the "Equivalence Matrix". Our Thesaurus was originally designed to be fully editable, and
customized for any special subject or group, and this is one of the features most used by Mnemotrix
in customizing an application for some special group or purpose. Our main Thesaurus consists of
over 250,000 root English language words with all of their synonyms and concept correlatives, and
is automatically drawn upon, along with the add-on thesaurus customized for each new user group,
for any query where concept searching is enabled. This ability of the program to build complex
sets of synomyms, modeling alternative concept structures, provides the researcher with an
automatic means of locating correlated and conceptually linked data, thus making the mining of
information relevant to a query significantly enhanced.
Much of the power of this basic feature has gone along the wayside as
concentration has been on mass market applications. Mnemotrix is the only company in the world that has
mastered the enormous potential of this "User Thesaurus" facility, allowing us to build advanced
mining applications based on custom user profiles. These custom applications take information
mining into zones of capability simply not possible by other means, and accounts for a large
measure of the functionality of the custom systems provided to the
list of clients.
We have learned by years of hard experience that this action cannot and
should not be completely automated,
and this is where our personal expertise has been most necessary, and made use of on a consulting
basis towards the creation of unique company and application profiles.
Can we mine data according to geographical locations, such as zip or country code?
Yes. Our search technology is unique in its ability to store text records containing
geographical locations and their associated image records, and efficiently perform a text search
restricted to some distance from a particular point ("swimming pool repair within 10 miles of
Columbus, Ohio"). This is accomplished by converting the locations into longitude and latitude.
Such applications can be set up with a visual geographic overlay for ease of user interface.
Does our technology handle natural language queries?
Yes. Users may enter any natural language question. By default, matching records or
documents may be presented in relevance rank order. There are many settings for "tuning" the
rankings.
Does our technology handle phrases? Wildcards?
Yes, both. A typical search form will consider text within quote marks as a phrase, and the
asterisk character as a wildcard. If desired, our search technology will accept wildcards within or at
the beginning of a word, as well as at the end. These features are under the control of the
application developer, who may turn them on or off, or change their behavior in various ways.
Does our technology support Boolean logic?
Yes. Full Boolean logic is standard within the SQL language. Our search technology also
understands the + and - operators popularized by web search engines. And our search technology
understands set logic, which can be used to express a command of the style "Find records
containing n or more words of my query." Absent explicit operators, the default logic is specified
by the application developer.
Does our technology contain fuzzy logic?
Yes. The facility we use to accomplish this is called approximate pattern matching. This
generates a similarity measure between any two words or patterns, expressed as a percentage of
closeness. The user or application developer may control the degree of closeness. This capability
most commonly is desired to accommodate spelling mistakes in either the queries or the data. It
can be useful in searching scanned documents, which tend to have errors resulting from the
imperfect OCR process. Developers should use this feature with caution, however. Fuzzy logic,
by its nature, brings back some records unrelated to either the user's query words or the intended
meaning. This tends to confuse and annoy users not expecting this style of response.
Do we handle numeric quantities in any special way?
Yes, in fact this feature is exclusive to our text search engine. It allows you to find
quantities in textual information in any way they may be represented. For example, our query
language allows you to put in a query for, say some numeric quantity greater than a million, and you would be able to find a reference to "1.6 billion dollars" buried within the text.
Can we mix and match these special search items dynamically in one query?
Definitely. We usually use our own experience to design effective, precise queries which
will continue to work reliably on a changing data stream, and store them in an easily
understandable pull-down menu which the user can simply click on, to profile information related
to their needs on an ongoing basis. The combination of specially designed queries with a
completely open query capability allows maximum freedom to the researcher. We have also
devised ways to help the user build complex queries easily, which can be passed out to multiple
databases.
Can we index documents stored on multiple servers?
Yes, elementary! Our information mining technology may create a searchable index of
documents anywhere on a network or on the Internet.
Can we sort results by date (or by price, or rating, or whatever)?
Yes. Our sorting power is one of the most popular features. You may sort the results of a
text search by any field in your data. For example, if your database contains an "author" field, you
can sort search results by author. This works efficiently even on large result sets, by taking
advantage of the powerful sorting capability inherent within relational database technology. Our
search technology can quickly sort tens of thousands of hits or more. Other search engines either
bog down sorting more than a few hundred items, or else their sorting capabilities are much more
limited. For example, one major search engine cannot perform relevance-ranking together with
sorting; another can sort by date only, not by other fields.
Can our technology find related results ("More like this")?
Yes, that is a standard feature. Our search technology can take any document or text
selection and turn it into a search for similar records. This is sometimes called "query by example."
Can our technology search document "zones" separately?
Elementary! What some people call zones, are in database lingo, fields. With our search
technology you may query any field separately or in combination with other fields. And queries
are not limited to text! If one field (zone) contains a postal code, for example, you could query that
with a numeric range such as 90011 through 97000.
Can we get past the complexity of a heavily fielded database?
We can redesign how data is stored and searched so that the user has much better access to
all of the information in proximity to other important information. We can write feedback into the
results process so that the user can more rapidly ascertain the specific relevance of the result list
and accept and reject data faster, getting to the crux of a research problem much more efficiently.
What is our relevance ranking algorithm? Is it tunable?
Our information mining technology contains a sophisticated automated ranking system that
may be tuned in various ways. Factors it uses include: closeness of query words to the beginning
of a document; order of occurrence of the query words; and proximity (closeness) of query words
to each other within a record. These factors may be weighted to change the ranking behavior. As
an example of how that might be useful: newspaper articles tend to have the most important
material close to the beginning, so in a newspaper search application, you might give that factor
more weight.
How can we make a large result listing more meaningful?
We have techniques which allow us to pull smart abstracts from the full text which can be
listed in the result listing without having to first read the article. We can also pull smart excerpts
which bring up the matched search results into the result listing so that it is faster and more
meaningful and requires a fewer number of clicks to get to the heart of a research matter. All this is
done on an automated basis by the program so that hand-written abstracts and excerpts need not be
done in most cases where the full text is available to the indexed database.
What about our Query Language?
Our Query Language allows us to put our Intelligent Search right into the middle of a
completely efficient, robust, relational database management system, so that finally all the best of
the world of relational databases can be married to all the best of intelligent text searching. And
using our web script language, we can design a web friendly interface which is easy to use, and yet
harnesses the power of all this for any type of application. For more information on this aspect, take a look at our
Query Help and Tutorial for Intelligent Text Searching.