On 5 Apr 1994, Leonard will wrote:
> I have been interested to read many of the postings on MUSEUM-L which
> give good advice on data organisation and design.
<stuff deleted>
> <...> Do other readers know of other models which are
> worth looking at? I know of the work done by CIDOC (the International
> Committee for Documentation of the International Council of Museums).
Several organizations (if not most these days) are developing ER models to
guide long-term system development and are fairly advanced in their
modeling, if not actual coding: e.g., Australian National Botanic Garden
(ANBG; contact Jim Croft); Missouri Botanical Garden (MBG; contact Christine
McMahon <[log in to unmask]>); IOPI; SMASCH; etc. I'm sure Jim Croft would
say there are "heaps" (literally), and he would be better than I at pointing
you to those that are on ftp/gopher/web servers. (You could begin by
looking at: huh.harvard.edu and life.anu.edu.au.) Also, the collections
management staff at the NMNH (USNM) and I developed a fairly comprehensive
(300+ pp.) information model for the transaction management portion (loans,
acquisitions, etc.) of a larger system being developed there. But I'm not
sure whether that model has been "approved" yet for public consumption.
(Despite my e-mail address, my affiliation there is now only unofficial.)
> <...> Do you know of any system which has really been
> brought to fruition based on a rigorous data model, or are most of the
> systems using this approach still at the planning stage (which seems to
> run and run ...)
Yes, some modelling exercises to tend to run on. The dilemma is between
getting the model 99.9% complete/accurate, and having to rework large
amounts of code for having gone off half-cocked. My recommendation would be
to strive for about 85% rather than 100%. Almost any exercise in logical
modelling will produce a model that requires more resources to build than
have been budgeted. The trick is to put it on the back burner, and to move
on to implementing the model once the basic structure is firmly in place.
(It is VERY difficult to accurately estimate the effort required to build a
system.) It is more important to get the basic structure of the model right
-- the strategy for handling the data -- and to move on to finding out
whether that strategy is realistic (coding), than it is to get every last
data element documented and into the model. Major headaches are experienced
in changing the higher-order structure of the model, but only minor
headaches in adding a non-key field here or there.
Also, I think using contractors to build your system is false economy (read:
FOLLY). If you use contractors, use them first as a vehicle to transfer
technical skill to your own technical staff (regardless of what the contract
says). The code they develop for you should be viewed as an inconsequential
by-product of the technology transfer exercise. If you plan to have
contractors develop something that your organization will not be able to
maintain and enhance itself, you are courting unhappiness. If your
organization isn't committed to supporting and enhancing the system, you
would probably be better off buying the complete package (a turn-key system)
from a vendor of museum/collection-oriented software.
> 2. Relational databases. I fully agree with the recommendations that a
> relational (or object oriented) database is the likely system for the
> future,
I wouldn't even use the qualifier "likely". The relational model is
currently accepted as THE standard for medium to large database systems, and
will remain so for the next 5-10 years. Also note that the object-oriented
model, if it comes to replace the relational model, will require much more
(not less) planning and analysis as a precursor to development. (See:
"Practical Application of Object-Oriented Techniques to Relational
Databases" by Donald K. Burleson, 1994, John Wiley & Sons, ISBN 0 471-
61225-1)
> but I wonder about the point which Jim Croft picked up that such
> systems are not good at handling chunks of text. This is my main concern.
Why? If you're concerned about basic information on hundreds of thousands
of objects -- a good sized collection -- you shouldn't be too worried about
large chunks of text for each object. You may never finish. This comment
is really an aside. More to the point... Even PC-based systems, such as
Paradox and Access (for smaller collections), have reasonable capabilities
for handling text, not to mention images and other "goodies" that don't fit
the strict relational model of data (where data values are drawn from well-
defined domains). Moreover, Object Linking and Embedding (OLE), or
analogous features on other operating systems, should enable you to use the
front end of choice for each kind of data stored in the database. But I
refer here only to creating or editing the text once it has been retrieved.
Retrieval is another story.
<stuff deleted>
> Most RDBMS packages seem to have a "memo" field, though perhaps only one
> such field per table, and this seems a potentially serious constraint on
> what curators can write. A "free-text" package may be "bolted on" to
> process such fields and make their contents searchable, but they do not
> seem to be fully integrated with the rest of the system. You are either
> searching "text" or "structured fields" and have to switch systems to
> change from one to the other rather than being able to mix them freely
> in a search query.
>
> Is there no system which gives the economy and convenience of splitting
> data into relational tables but without the restriction of fixed-length
> fields?
Searching free text is a problem; it is not part of the relational data
model. Most text management systems use an "inverted index", in which all
non-trivial words are extracted and built into large word index, where each
word is associated with many pointers to its occurrences in the various text
documents. I am unaware of any RDBMS with a major market share that
includes this sort of text management capability. Many have the capability
so search free text, but not with an automatically-built inverted index, so
we are talking S L O W. I am unaware of a magic bullet for searching un-
indexed text; it must all be brought up off the disk, placed in memory, and
scanned for the matching substring. As far as I know, your only options are
to use one, or the other, or to "bolt" the two together.
> Am I worrying about this unnecessarily? Can almost all object
> records be chopped into chunks of 256 characters or less, to fit into
> limited-length fields?
I wouldn't characterize it as unnecessary worry, but rather mental effort
spent in the wrong direction. The issue is not 256 characters or less; it
is the consistent expression of well-defined and mutually understood
concepts. All this searching, sorting, selecting, projecting, joining stuff
in relational databases derives from very tightly controlled data. People
who buy into the relational paradigm have accepted that predictable storage
and retrieval are best achieved through data content control. If you don't
want to lose a record among the kilo-, mega-, giga- bytes, then both the
creator of the record and the subsequent retriever of the record must have a
common world view (the information model!). And, as an after thought, yes,
the names of most concepts are less than 256 characters in length (unless,
of course, you speak German).
> Database systems such as Pick have true variable-length fields
> throughout, but though you can have "joins" they don't conform to the
> strict "relational" definition, and thus seem to be going out of favour.
> Are relational systems now efficient enough that we can have the elegance
> of using them fully and forgetting about "subfields" or "multi-valued"
> fields, or are these concepts still necessary for performance purposes?
Multi-valued fields are a trade-off; you get performance in some situations
and lose it in others. Subfields and multi-valued fields augment a
programmer's performance first, and system performance when the particular
multi-valued field (un-normalized representation) is dragged along with a
record being retrieved by another indexed field. But if you want rapid
retrieval/update of the single values within multi-valued fields, it may be
better to use a normalized representation; i.e., put them in another table.
Also be aware that substring/multi-valued search performance may be good
initially in Pick-based system, but may degrade significantly as your
database grows. (Can anyone support/refute this from direct experience?)
Stan Blum
(an information modelling and relational zealot)
|