On 5 Apr 1994, Leonard will wrote: > I have been interested to read many of the postings on MUSEUM-L which > give good advice on data organisation and design. <stuff deleted> > <...> Do other readers know of other models which are > worth looking at? I know of the work done by CIDOC (the International > Committee for Documentation of the International Council of Museums). Several organizations (if not most these days) are developing ER models to guide long-term system development and are fairly advanced in their modeling, if not actual coding: e.g., Australian National Botanic Garden (ANBG; contact Jim Croft); Missouri Botanical Garden (MBG; contact Christine McMahon <[log in to unmask]>); IOPI; SMASCH; etc. I'm sure Jim Croft would say there are "heaps" (literally), and he would be better than I at pointing you to those that are on ftp/gopher/web servers. (You could begin by looking at: huh.harvard.edu and life.anu.edu.au.) Also, the collections management staff at the NMNH (USNM) and I developed a fairly comprehensive (300+ pp.) information model for the transaction management portion (loans, acquisitions, etc.) of a larger system being developed there. But I'm not sure whether that model has been "approved" yet for public consumption. (Despite my e-mail address, my affiliation there is now only unofficial.) > <...> Do you know of any system which has really been > brought to fruition based on a rigorous data model, or are most of the > systems using this approach still at the planning stage (which seems to > run and run ...) Yes, some modelling exercises to tend to run on. The dilemma is between getting the model 99.9% complete/accurate, and having to rework large amounts of code for having gone off half-cocked. My recommendation would be to strive for about 85% rather than 100%. Almost any exercise in logical modelling will produce a model that requires more resources to build than have been budgeted. The trick is to put it on the back burner, and to move on to implementing the model once the basic structure is firmly in place. (It is VERY difficult to accurately estimate the effort required to build a system.) It is more important to get the basic structure of the model right -- the strategy for handling the data -- and to move on to finding out whether that strategy is realistic (coding), than it is to get every last data element documented and into the model. Major headaches are experienced in changing the higher-order structure of the model, but only minor headaches in adding a non-key field here or there. Also, I think using contractors to build your system is false economy (read: FOLLY). If you use contractors, use them first as a vehicle to transfer technical skill to your own technical staff (regardless of what the contract says). The code they develop for you should be viewed as an inconsequential by-product of the technology transfer exercise. If you plan to have contractors develop something that your organization will not be able to maintain and enhance itself, you are courting unhappiness. If your organization isn't committed to supporting and enhancing the system, you would probably be better off buying the complete package (a turn-key system) from a vendor of museum/collection-oriented software. > 2. Relational databases. I fully agree with the recommendations that a > relational (or object oriented) database is the likely system for the > future, I wouldn't even use the qualifier "likely". The relational model is currently accepted as THE standard for medium to large database systems, and will remain so for the next 5-10 years. Also note that the object-oriented model, if it comes to replace the relational model, will require much more (not less) planning and analysis as a precursor to development. (See: "Practical Application of Object-Oriented Techniques to Relational Databases" by Donald K. Burleson, 1994, John Wiley & Sons, ISBN 0 471- 61225-1) > but I wonder about the point which Jim Croft picked up that such > systems are not good at handling chunks of text. This is my main concern. Why? If you're concerned about basic information on hundreds of thousands of objects -- a good sized collection -- you shouldn't be too worried about large chunks of text for each object. You may never finish. This comment is really an aside. More to the point... Even PC-based systems, such as Paradox and Access (for smaller collections), have reasonable capabilities for handling text, not to mention images and other "goodies" that don't fit the strict relational model of data (where data values are drawn from well- defined domains). Moreover, Object Linking and Embedding (OLE), or analogous features on other operating systems, should enable you to use the front end of choice for each kind of data stored in the database. But I refer here only to creating or editing the text once it has been retrieved. Retrieval is another story. <stuff deleted> > Most RDBMS packages seem to have a "memo" field, though perhaps only one > such field per table, and this seems a potentially serious constraint on > what curators can write. A "free-text" package may be "bolted on" to > process such fields and make their contents searchable, but they do not > seem to be fully integrated with the rest of the system. You are either > searching "text" or "structured fields" and have to switch systems to > change from one to the other rather than being able to mix them freely > in a search query. > > Is there no system which gives the economy and convenience of splitting > data into relational tables but without the restriction of fixed-length > fields? Searching free text is a problem; it is not part of the relational data model. Most text management systems use an "inverted index", in which all non-trivial words are extracted and built into large word index, where each word is associated with many pointers to its occurrences in the various text documents. I am unaware of any RDBMS with a major market share that includes this sort of text management capability. Many have the capability so search free text, but not with an automatically-built inverted index, so we are talking S L O W. I am unaware of a magic bullet for searching un- indexed text; it must all be brought up off the disk, placed in memory, and scanned for the matching substring. As far as I know, your only options are to use one, or the other, or to "bolt" the two together. > Am I worrying about this unnecessarily? Can almost all object > records be chopped into chunks of 256 characters or less, to fit into > limited-length fields? I wouldn't characterize it as unnecessary worry, but rather mental effort spent in the wrong direction. The issue is not 256 characters or less; it is the consistent expression of well-defined and mutually understood concepts. All this searching, sorting, selecting, projecting, joining stuff in relational databases derives from very tightly controlled data. People who buy into the relational paradigm have accepted that predictable storage and retrieval are best achieved through data content control. If you don't want to lose a record among the kilo-, mega-, giga- bytes, then both the creator of the record and the subsequent retriever of the record must have a common world view (the information model!). And, as an after thought, yes, the names of most concepts are less than 256 characters in length (unless, of course, you speak German). > Database systems such as Pick have true variable-length fields > throughout, but though you can have "joins" they don't conform to the > strict "relational" definition, and thus seem to be going out of favour. > Are relational systems now efficient enough that we can have the elegance > of using them fully and forgetting about "subfields" or "multi-valued" > fields, or are these concepts still necessary for performance purposes? Multi-valued fields are a trade-off; you get performance in some situations and lose it in others. Subfields and multi-valued fields augment a programmer's performance first, and system performance when the particular multi-valued field (un-normalized representation) is dragged along with a record being retrieved by another indexed field. But if you want rapid retrieval/update of the single values within multi-valued fields, it may be better to use a normalized representation; i.e., put them in another table. Also be aware that substring/multi-valued search performance may be good initially in Pick-based system, but may degrade significantly as your database grows. (Can anyone support/refute this from direct experience?) Stan Blum (an information modelling and relational zealot)