Art without engineering is dreaming;
Engineering without art is calculating.

Steven K. Roberts, N4RVE

Database Downloading

by Steven K. Roberts
Online Today — August, 1983

It was only, I suppose, a matter of time. The writing craft has always had to deal with plagiarism, the software business is grappling with piracy, and now the on-line industry has suddenly found itself face-to-face with database downloading.

Yes, downloading. Is there about to be a major crisis in the adolescent on-line business — a business populated by such giants as Lockheed, H & R Block, Dow Jones, Chemical Abstracts, and the U.S. government? Is a new and potent form of information theft now within reach of the average personal computer user?

On-line information retrieval has revolutionized the research process, and it is quickly being embraced as a critical adjunct to doing business. Al ready, you can search through virtually all of the world’s literature for information on almost any conceivable subject.

In a current book project, for example. I am dealing with the issue of privacy in electronic mail systems. Rather than wander down to the library to dig up background material, I simply logged on to DIALOG via Telenet and selected the ABI/INFORM database of general business bibliographic information. It only took about a minute to establish that among the 200,000-plus records in that particular database, there are 669 that mention “electronic mail,” 1,299 that mention “privacy,” and 15 that include both terms. Here’s one of them:

Electronic Mail – A Mixed Blessing
Goldfield, Randy J.
Computer Decisions
DOC TYPE: Journal Paper
LENGTH: 2 Pages

The virtues of electronic mail speed, reliability, and potential economy have often been pointed out. There may, however, be some disadvantages in these electronic mail systems. Four potential problems to be considered are:

Assured message receivership
Documented contact records
Intrusive work options
Expanded accessibility

Since message receivership is assured, the receiver will be deprived of the age old excuse of not receiving the memo. Many people will feel that the recording of contacts is an invasion of privacy and will resent it. People will no longer be able to ignore pending messages, since if they sign on even for even a minute, they will receive all of their logged messages. Finally, if individuals’ names are on a port label within a system, they are just as susceptible to receiving electronic junk mail as they are to getting paper junk mail. The office automater should give these human factors careful thought before implementation of an electronic mail system.

DESCRIPTORS: Office automation; Electronic mail systems; Problems
CLASSIFICATION CODES: 5250 (CN=Telecommunications systems); 5210 (CN=Office automation)

Finding information this way short-circuits many of the traditional methods of doing research, be it for book-writing or for keeping an eye on your company’s competitors. You can now readily access databases of all sorts of things, including:

  • All 4.800 yellow pages directories in the United States
  • U.S. patents since 1970
  • The complete text of Commerce Business Daily
  • Detailed financial information on public companies
  • Chemical structures
  • The Congressional Record
  • The National Library of Medicine

and more — not to mention bibliographic references and abstracts corresponding to roughly 95 percent of all literature published in the last 10 years.

Great. But there are problems. Complex information resources tend to make complex demands on those who use them, from having to learn alien command protocols to trying to find ways to hold down the high cost of system use. (At a dollar a minute and up, on-line databases don’t lend themselves well to casual browsing.) Besides, there are a lot of interesting things that people would rather do with the retrieved information than print it on terminal paper or watch it scroll by on the screen — although those have traditionally been the only options.

‘Intelligent’ terminals

There are thus a number of motivating factors that have encouraged the development of “intelligent” database terminals. It has become somewhat passe to sit down with a TI Silent 700 terminal and perform database work; the stylish searcher is now armed with a microcomputer and sophisticated communications software.

There is more rationale for this than mere high-tech panache. Consider the possibilities…

  • Some reasonably straightforward software can enable a personal computer user to capture an on-line session on disk for later printing — or for inclusion in a word-processed text file, such as this one (with its “electronic mail” abstract about 4,000 characters back).
  • Adding a few features to the above allows the system to play an active role in the on-line searching process, at the very least providing automatic log on and rapid transmission of a prepared sequence of commands. Since connect-time charges apply to the time spent banging on the keyboard, this can translate into significant cost savings.
  • With even more sophisticated software, the user can save a search on disk, and then process it to eliminate duplicate references from different databases, plot statistical data, load econometric information into a spreadsheet system, and so on.
  • Finally, a subset of the remote database can be downloaded and used to create a local database that can then be searched at no additional cost. This is surprisingly easy — and it raises a host of fascinating legal and copyright questions.

Consider a real application. Suppose you are engaged in the artificial intelligence business — one of the “cutting edge” technologies of the ’80s. Perhaps your products are expert systems geared to various professional markets.

Now, AI is a fast-moving, complex field that is just beginning to overflow academia and become a viable industry in its own right. As such, there is a lot of literature being generated, but there is really no place to go for an ongoing up-to-date overview of what’s going on. How would you keep up with the activities of your colleagues around the globe?

No problem. Nearly all significant published literature in the field can be found in the “sci-tech” bibliographic databases (INSPEC, COMPENDEX, NTIS, and SCISEARCH in particular). All you have to do is sign on every now and then and see what’s new.

Well, it’s not quite that easy. At this writing, those four databases contain a total of 4,741 “hits” on artificial intelligence or knowledge engineering, and while there is inevitably some overlap, that’s still a lot of data. More appears every month. There are searching tricks that let you get just the latest information, or to reduce that 4,741 to a manageable number by “anding” the items with another term, but still it’s a lot of data.

But you have this personal computer, see, and this piece of database management software called, perhaps, dBase II. It occurs to you one day that all you have to do is perform one big, expensive search, download it, make your own database, and thereafter just update it now and then with the latest additions to the files. You could then search through those article abstracts to your heart’s content, finding “semantic networks” today and “predicate calculus” tomorrow — all at no extra charge.

Colorful. Illegal, of course, but colorful — and quite easy, as long as the numbers don’t get too big. Naturally, the database vendors aren’t particularly pleased about this, because by doing it, you rob them of revenue and violate an implied copyright agreement that you “sign” every time you access their files.

And therein lies the problem. On one hand, we have an information resource of unprecedented scope along with the tools to use it in interesting ways; on the other hand, we have copyright laws, along with vendors who get annoyed when people steal their information.

What’s to be done?

Planning for the inevitable:

Database producers and vendors are grappling with the problem right now. At the 1983 National Online Meeting in New York City, there was a well-attended half-day session on the subject, with representatives of the major industry protagonists vigorously exchanging ideas. There was one matter on which everybody seemed to agree: Database downloading is on the rise. There are no real objections to the more “basic” forms of the practice — storing the results of a search for subsequent reformatting, for example — but there is considerable concern about the implied ability to re-search the data. At the individual user level, the problem is bad enough, but it is equally easy for you or me to download the bulk of an expensive database, massage it slightly to camouflage its source, and then market it as our own product (at a lower price, of course). This could put the original database out of business.

Since there is no way for an on-line vendor to know exactly what is happening to the information that is being sent to a user’s terminal, this practice is exceedingly difficult to detect. One might be suspicious of large, general searches, but a clever downloader could easily steal an entire database by breaking it up into subcategories.

This inevitably raises the issue of database pricing. Since there is no practical defense against downloading, it follows that the vendors will have to price their products in a way that covers them against their anticipated losses.

The costs of on-line services are already in a state of flux for other reasons. The widespread use of both 300 and 1200 baud terminals renders a straight connect-time charge (still most common) quite unfair to owners of the slower machines. No scheme that includes a 1200 baud surcharge has been very well accepted by users, so the result is a gradual trend toward the sale of individual data items, rather than on-line time. This, presumably, ends up being added to a flat connect-time charge to prevent people from tying up the system’s network ports all day just to eliminate the hassle of signing on.

This is a start, but it still doesn’t protect vendors against the downloading of databases for subsequent re-use. Some have considered embedding “garbage” characters into records, presumably to confuse microsystems, but this is silly and easily bypassed.

Others have simply decided to raise all the prices and not worry about it, and still others are planning the creation of special “downloadable” formats that cost more, but lend themselves so well to subsequent manipulation that would-be downloaders will be willing to pay the difference. The problem here, of course, is that the standard formats are not particularly difficult to process. (One excellent commercial machine, the Cuadra STAR, inhales records for a service such as SDC’s ORBIT and converts them into its own database format within seconds. Their booth at the National Online Meeting was one of the best-attended of the show.)

The on-line industry is not alone in its struggle with this problem. Wherever the commodity is information, theft can easily take place without detection. Witness the battles being fought over software piracy and videotape copying. Even the set of recumbent bicycle plans I just purchased bore a futile warning on the cover that nobody but the purchaser could use them without mailing in a $10 royalty fee.

One thing that database producers can do to minimize the severity of the problem is to deliberately make their information available for use on local systems. BIOSIS, for example, is the premier life sciences databases, offering over four-million records covering some 9,000 journals and other information sources since 1969. Having a large research clientele, they have begun marketing subsets of their database on floppy disks, calling them BITS (which, of course, was immediately labeled “BIOSIS in Tiny Segments” by the users). Their approach is to make BITS so convenient and efficient that there will be very little motivation for people to copy the data illegally on line.

This seems to be the key. Eliminating the illegal flavor of database downloading through one means or another can have the effect of actually adding to a producer’s revenues. It will be interesting, over the next few years, to see how the industry deals with the continuing increase in the information-processing capability of its users.

Doing it yourself

By now, you are probably straining at the bit, so to speak, anxious to try out this new form of information manipulation — especially now that the major on-line vendors have made low-cost databases available in the evenings (DIALOG’S “Knowledge Index” and Bibliographic Retrieval Service’s “BRS After Dark”).

The first point we should make is that you can download anything, but subsequent use of it takes some doing. If the application is simple reformatting or word-processing, you can accomplish it quite simply with a standard micro and any of the 50 or so communications software packages that are now available. You can even do it with no special software at all, if you are willing to operate a bit clumsily — CP/M’s PIP utility, for example, can do the job quite neatly as long as the files don’t get too large.

But beyond that, you are looking at database software — or perhaps a special-purpose program to manipulate downloaded data from a service such as CompuServe’s MicroQuote.

The operation of one commercial machine will serve to illustrate the downloading procedure. The Cuadra Associates’ STAR system is marketed primarily for end-user database creation and management, and as such, offers a number of interesting capabilities for data entry and retrieval.

It is also quite adept at downloading, though that is not mentioned in the company’s eight-page brochure (there is only a coy reference to “data obtained from external sources”). A company spokesperson, in fact, stated that downloading capability is only sold to users who have permission from on-line vendors to take advantage of it, such as those with private files or contractual arrangements with a database producer. Of course, there is no way to monitor their subsequent activities . . .

Once you acquire downloading capability on the STAR, the procedure is quite straightforward: First, you go through a quick “database definition” session, wherein you establish the correlation between data fields in the incoming material and those that will later be used for local searching. For example, the SDC ORBIT database system uniformly tags all article titles with the “TI” prefix. When setting the STAR up for downloading, you can identify this as the “Title” field and define its various characteristics for subsequent processing.

Once all that is done, you call the commercial database service just like you would any other time (dial the number, enter your password, etc.), and then issue a “TALKTO” command that makes the STAR system act like a dumb terminal — as far as the other system is concerned, anyway. At that point, you perform the on-line search as usual (all articles mentioning “artificial intelligence,” for example), hitting a CTRL-P to begin recording as soon as the information starts to flow.

Now you can wander out for a cup of coffee. The STAR is downloading the data and storing it on disk. When all the applicable records have arrived, you stop the local recording with another CTRL-P and sign off the expensive remote system.

At this point, you possess the data in re-usable form. You can examine it with a .VIEW command, perform a FORMAT to clean it up and make it compatible with your local software, and execute a function called STAR-LOAD that integrates it into the local database. Finally, you fire up the STAR database software itself and perform an INDEX operation that produces the “inverted index” — the key to subsequent searching.

You now have a fully-searchable in-house replica of the database subset that you defined — and paid for once — on the dollar-a-minute system. The material can be accessed as flexibly as you like: “Find all articles mentioning both fifth generation and Japan that were written in English during or after 1982.” No problem. The system supports all those great features that information junkies have grown to love — truncation, set manipulation, sorting, flexible formatting — in short, a STAR user has a true in-house on-line information retrieval system.

Of course, this capability doesn’t come cheap. The STAR is based on a hard-disk Alpha Micro System, and in the minimum 8.5 megabyte configuration costs about $30,000. The price tag goes up to $50,000 for the 120 megabyte version. Similar functions can be performed with small machines, however, provided you have the software available to accomplish the basic steps described above.

But the economics of system ownership, the cost of on-line time, and the ongoing expense of keeping the local database updated give rise to a break-even point that might be a bit high for casual use. One participant in the National Online Meeting session on downloading commented that a $30,000 system costs about $7-10,000 a year to own and operate — equivalent to about 10 hours a month of on-line time. Even though the local system can be available 240 to 360 hours a month as a custom-tailored electronic library, it would have to actually be used quite heavily to offset the operating costs.

Of course, there are a lot of people who would do so quite happily, including those with suspicious commercial motives.

Cuadra Associates is very careful to avoid mentioning downloading in their advertising, being well-aware of the fact that there are forces out there that are trying to hold Sony liable for manufacturing machines capable of copying video programs. They don’t want to be caught in the same legal quagmire — although it doesn’t take much technical sophistication to realize that any microcomputer with a modem and some communication software can do a perfectly adequate job of downloading an expensive database. They may be a bit clumsier than the STAR, but many of the database packages marketed for micros will perform the subsequent manipulation quite adequately. They all have one limitation or another (dBase II, for example, is limited to 1,024-character records even though many on-line records are twice that). But it can be done.

And it is being done — on an increasing scale. The challenge is now before the database industry: “What are you going to do about it?”

Steven K. Roberts is the president of Words’Worth, Inc., a high-technology business communications firm in Columbus. He is currently working on his fourth book.