e-Publications
Chemistry & the Internet 1998	back to e-Publications

CHEMINT98: CHEMISTRY AND THE INTERNET, SEPTEMBER 1998, IRVINE, CALIFORNIA

A conference report by Dr. Wendy A. Warr, Wendy Warr & Associates

Title (click on the link to view the article)

Virtual Communities in Chemistry

Virtual Conferencing: Chemists Discussing Chemistry

After the Honeymoon: Delivering Chemistry on the I-Net

ChIN's Web Page : Selected Chemical Resources on Internet

Modularity - A New Paradigm for Electronic Scientific Interchange

Trends in Electronic Publishing: The HighWire Press Perspective

Producing Internet Editions of Scientific Journals

ISI Chemistry Server

Web Application Development in Chemistry: So Many Tools, So Little Time

Integrating Chemistry with Neighboring Information Universes

Molecular Modeling through the World Wide Web

Chemical Dataflow Programming in a WWW Environmen

Chemical Needles in Haystacks: Meta-data, MIME, Markup and Models

The TelePresence Microscopy Collaboratory and DOE 2000

TeleSpec - Telecooperation in Spectroscopy / Infrared Spectrum Prediction and Interpretation via the Internet

ACD/ILAB: Connecting Distributed Chemical Information Resources to a Unified Web Front End

Virtual Education

Making Chemistry Available via the WWW for Education

VChemLab: A Virtual Chemistry Laboratory

The CSIR Web Service -- Making it Easier to Find the Chemistry Software and Information You Need

Closing Remarks

Most of the slides used in the presentations are available at the conference Web site. This report is designed to be complementary: it is detailed but purely textual.

Virtual Communities in Chemistry

Wendy Warr, Wendy Warr & Associates

This talk is available in the ChemWeb/VEI library. It has been said that the total of all printed knowledge doubles every five years (Reuters). The volume of "unpublished" information, including that shared at conferences, over the telephone etc., is also increasing apace. Later speakers talked about lists, electronic journals, and virtual conferences, so I concentrated on the concept of virtual community (VC). My paper on this subject will appear in the Journal of Chemical Information and Computer Sciences 1998, 38(6).

Statistics abound about the exponential growth of the Internet. The US government has reported that it took 38 years for radio to win 50 million users; 13 years for TV to get 50 million viewers and just 4 years for the Internet to gain 50 million users. These figures must be regarded with some skepticism, since the world population has increased considerably over the last 40 years, and the Internet is older than many people think. It has been said, however, (Koenig and Sione) that the Internet is only at Stage 2 of a cycle through which all enabling technologies go. This is the stage at which there is concern about issues of equity and access and regulation starts to take place. Only at Stage 3 does the technology become efficient and effective with a move towards competition and deregulation.

Mark McDonough has defined a VC as "a group of people who share something important in common but lack sufficient opportunities to interact face to face". Howard Rheingold, the father of VCs, defines them as "Incontrovertible social spaces in which people meet face to face but under new definitions of both 'meet' and 'face'."

With increasing emphasis on teamwork, "teleworking", and the "virtual company", terms such as Computer Mediated Communication (CMC) and Computer Supported Cooperative Work (CSCW) are appearing in the literature. Ray Dessy uses the term "collaboratories" for systems in which chemists share information electronically. Electronic laboratory notebooks and virtual conferencing may be part of a collaboratory.

Community users deprived of "social cues", non verbal cues, and the subtleties of face-to-face interaction develop "alternative" methods of verbal communication. Discourse may include the use of abbreviations (e.g., IMHO) and emoticons. Since members are unable to see, hear or feel one another, they have a degree of anonymity. This can have the advantage of egalitarianism but it may impede decision-making and lead to undesirable social inhibition (e.g., "flaming"). Successful communities have a loyal core of members, who have a group purpose and a feeling of belonging and who establish a code of behavior for the community.

Some observers believe that virtual communities are antithetical to commerce but others, such as Hagel and Armstrong, authors of Net Gain, are convinced that a strong commercial element can actually enhance trust and commitment among the members. Virtual communities will create "reverse markets" where customers hold the power base and seek out vendors. The business model is no longer one of vendors "pushing" products but one of the vendor as an agent. A commercially successful community must build up a critical mass of content and members.

A common design practice to create a sense of community in networked applications has been to imitate urban planning, creating town squares, segregated spaces, and public and private discussion spaces. An example is seen in some of the some of the icons in Engineering Village. We shall later see a chemical community with its "shopping mall" and "library". Engineering Village is successful in two respects: it was profitable within 18 months of its inauguration and it is a resource which engineers find indispensable. Those two facts are linked.

Three chemical communities are ChemCenter from the American Chemical Society, chemsoc from the Royal Society of Chemistry, and ChemWeb.com, organized by ChemWeb Inc. Chemsoc is the newest site for chemists and the "home of the international chemistry societies' electronic network". ChemCenter provides information about professional services, conferences, publications, databases, education, shopping and resources. It links with the very useful ACS Publications, ACS Web, and CAS sites but it has very little content of its own. ChemWeb.com is the most developed of these three communities. Services offered include library, databases, shopping mall, meetings, The Alchemist (a "Webzine"), Your Room (the member's own room), and Site Search. ChemWeb.com has about 35,000 members to date and expects 50,000 well before the end of 1998.

One problem with all three chemical communities is the length of time it takes to load the home page and navigate through the site. The figures in the table below were obtained using the independent "tune-up" available at Websitegarage.

Site	Load Time Score	Load Time 14.4K modem
Warr	Excellent	9.70 seconds
Derwent	Good	27.85 seconds
Chemsoc	Fair	43.56 seconds
ChemWeb.com	Fair	50.41 seconds
ChemCenter	Poor	65.31 seconds

Probably few of you are using 14.4K modems but I have stripped out my ISDN and T1 figures for simplicity and used the times that give most differentiation. It is the relative timings that matter here. In fairness, I should, point out that ChemWeb.com scores "excellent" in some other Websitegarage categories (HTML design, unbroken links etc.). Also, downtime is negligible for both ChemCenter and ChemWeb.com. (Note that these tests cannot be repeated today on the new version of ChemWeb.com since registration is needed even to open the home page.)

The ChemWeb.com Job Exchange currently has about 400 records but less than 100 are vacancies - the remainder are from members seeking jobs. (Remember that one of ChemWeb.com's aims is to provide services for its members.) The vacancies are almost all in Europe and North America. The Job Exchange is one of about 16 searchable databases on ChemWeb.com. Three of them, the ACD, NCI-3D and the Investigational Drugs database are substructure-searchable. ChemWeb.com is the only one of the three chemical communities to offer substructure searching (using MDL's Chemscape/Chime).

The Alchemist is ChemWeb.com's Web magazine which includes features such as: chemistry news headlines; ChemWebpicks (a selection from the World Wide Web); Catalyst (topical issues written by David Bradley); Research News; Warr Zone (my own pick of the news); Filternet (a digest from discussion lists and newsgroups); scientists' profiles and interviews; Spotlight (a closer look at major chemistry topics); Conference Diary; book reviews; and software reviews.

Text searching in ChemWeb.com is performed by BiblioteK software which was purpose-built for ChemWeb.com's sister club BioMedNet, but has been further developed for ChemWeb.com. Library, databases, jobs, shops, members, The Alchemist, Web sites (i.e., ChemDex Plus) and conferences in any combination can be searched in Site Search and individual sections (e.g., the Job Exchange) can be searched within that section.

In August 1998 there were 218 books and 9 software products for sale in the ChemWeb.com shopping mall. More have been added since. There is a secure server for ordering, and carrying out financial transactions. Discussion groups are a recent addition to ChemWeb.com. Discussions cover a variety of topics. ChemWeb.com is particularly keen to encourage communication amongst its members and a sense of belonging and ownership. It is also the only one of the three sites where member profiles are searchable.

Real time communication in the ChemWeb.com virtual conferences is another unique feature. The "conference center" used is that provided by Virtual Environments International. Text-based conferencing in real-time has been taking place in fields other than chemistry for some time. Extending such an environment to chemistry, and including graphics, poses profound technical problems but ChemWeb Inc. and VEI have had some success with eight experimental conferences to date. Sound, video, rotatable 3D chemical structures and other features have all been included in the lectures.

I myself have logged on for all the lectures to date. Handled with patience, in the spirit of an experiment, they are an exciting development. The software is evidently immature and resource intensive, and some users report problems getting through their firewalls. Cynics have commented that it is easier and quicker just to view the conference transcript and slides after the event. I disagree: there is no substitute for the stimulus of joining in the discussions as they are actually taking place. Simply reading a history afterwards defeats the whole object: it gives no opportunity for participation. (There is a facility for post-lecture discussions, but, in practice, it is little used.) "Real" conferences and video-conferences are quite different in concept and in experience.

I carried out a "straw poll" among 61 of the attendees at the first virtual conference. I had 22 replies. Nearly all of the respondents have made significantly increased usage of the Internet over the last two years (mainly for email) and, interestingly, about two-thirds of them have some feeling of "virtual community".

Finally, a few conclusions about ChemWeb.com as a community. On the plus side, content and membership are on the increase. But I would hate to have to guess what the critical mass is, or when it will be achieved. Some of the challenges are as follows. Successful communities have a very focused membership, that is, the members have a very specific interest. In theory, ChemWeb.com serves the broad community of "chemists" in general. In practice, I suspect that many members are in the pharmaceutical industry, and/or in the medicinal chemistry, chemical information and computational chemistry communities. Member-generated content is low although members do contribute to the Job Exchange, the shopping mall and discussion forums. The community is certainly in the "nice-to-have" category at the moment. It needs to become "essential" if it is to succeed. Speed of access, and efficiency in moving around the site, definitely need attention, but the problem will to some extent correct itself as bandwidth on the Internet increases.

Back to the top

Virtual Conferencing: Chemists Discussing Chemistry

Barry Hardy, Virtual Environments International

This talk is available in the ChemWeb/VEI library. Hardy did his real talk demonstrating how it appeared in the VEI virtual conference auditorium. He started with some background before discussing the VEI virtual auditorium and the ChemWeb library and demonstrating the chemistry whiteboard. Email is probably the most commonly used form of correspondence between small groups of people or two people but there are various limitations. For example how do you communicate with larger groups? Email tends to be temporary. How do you archive the information? It is not necessarily a very easy way to share chemical information, although some recent enhancements like MIME types are improving that situation. Email is also used in mailing lists and bulletin boards and this is a way for a large group of people to communicate asynchronously. There is a tremendous increase in Web-based publishing. Virtual conferencing can be asynchronous, synchronous, or quasi-synchronous.

Hardy showed a of electronic conferences, including ECCC, ECTOC etc. and he picked EGC-1, held in 1995, in conjunction with The Glycoscience Network, as an example. He showed a navigation map of the conference center with a speaker in the South Room, a trade center, coffee shop, hotel foyer etc. The conference was asynchronous and you could read the transcript later. Barry showed the VEI welcome and auditorium. Structuring and scheduling are very important: scheduling is even more important in a virtual meeting than in a real meeting. Documents and materials were available in the library area. There was a messaging system so that members of the conference could exchange messages. There was also a virtual exhibition. A list of attendees and their profiles and photographs were available. What is said in the virtual bar is not recorded.

Next, Hardy discussed the VEI virtual auditorium. HTML/Java interfaces were used at first but it was hard to support all users and browsers so they dropped Java. In future Java will be an option. In a VEI real-time conference, there is a virtual slide projector and the slides change within a window at the right of the screen. The discussion (which is moderated) takes place within a transcript window at the left. User profiles are possible. Photograph(s) of the presenter(s) appear at the top in another scrollable window. The "Who" function allows participants to view a list of all the people who have logged on. User privileges can be set. Private chat and whispering is possible for participants in the main room. The transcript is recorded. Various user options, such as a "thread" letter at the left, can be set. Hardy demonstrated the audio feature. Rotation of molecules and other graphic effects are possible in the slide show.

Henry Rzepa and a panel of speakers were involved in the first ChemWeb/VEI event Since then there have been other conferences at monthly intervals. Peter Murray-Rust has spoken about CML. Karl Harrison's talk included animated GIFs and a QuickTime movie. Chemical animation (e.g., hydrogenation of alkenes) was done with Shockwave for Director. Hardy also showed Harrison's dynamic HTML animation examples e.g., a calculation on magnetic susceptibility balance and an 18e electron counting quiz. He played a recording of Johnny Gasteiger talking in another lecture.

Finally Hardy did a live demonstration of the collaborative 2D and 3D Whiteboard. This is a VEI-Cherwell Scientific collaboration. Hardy demonstrated "releasing the pen" and passing it to Paul in Oxford so that Hardy and his colleague thousands of miles away could operate in real time on the same chemical diagram.

In the discussion following Hardy's talk, Henry Rzepa asked whether these virtual conference techniques will be expensive and thus will be elitist tools for very rich companies. Hardy said that costs will come down. There was a question about different time zones. Hardy said that you can have three sessions in the day but time zones are a genuine problem with real-time communication.>Two take-away messages in conclusion. People talk of the "one-stop shop" for all their chemical information needs. This has been expressed in nonsensical marketing statements such as "you can do it all on ChemCenter". Even if ACS Publications and Elsevier Science were to applaud each other in public and put all their resources onto one site, would this be desirable? Would you really want a "Microsoft of electronic publishing"? However, if I had to decide which one of the three communities has the best chance of success in approaching "one-stop-shop" or "essential" status, my money would be on ChemWeb.com at the moment.

Back to the top

After the Honeymoon: Delivering Chemistry on the I-Net

Bill Langton, Tripos

Why do chemistry on the I-net? It has advantages for communication and collaboration: sharing results and accessing information. A search on "chemistry" and "net" found 200,000+ hits. There were 12,000 hits on "chemistry" + "sex" + "net".

In education, graduate research and learning the problem is lack of money. In the industrial domain there may be more money but there is a shortage of time and people for R&D and interdisciplinary exchange. Langton talked of 3 phases: introduction, honeymoon, and reality.

In the introductory phase, participation was mainly by academics, for presentation of content and Mosaic and Netscape made the Web accessible. Chemistry on the Web goes back to about 1995 with Bachrach and Rzepa being recognized innovators. In the honeymoon phase, features were a promise of quick results and the "cool factor" effect. Companies "stampeded" and there was technological hype from companies such as Sun who had something to sell. The reality was that early promises were not met, things took longer than expected, there were divisions over strategy (e.g., to use Java or not) and we are still awaiting killer applications.

Computational chemists had no incentive to move over to Web-based tools; bench chemists are using the Web for non-business reasons; but in bioinformatics there is a demand for Web-based applications. So, what happened to perception, expectations, technology, and implementation? It was supposed to be easy, quick and cheap. Langton showed some time bars for HGML/CGI (pre-1995 to 1998), helper applications and plug-ins (mid-1995 to 1998), Java and CORBA (1996 to 1998) and browser wars (1997-1998).

Technologically, browsers failed to keep pace: there were issues around performance, reliability and robustness. Technology companies have not delivered. In implementation, early applications missed the mark; existing, traditional applications were left out; and applications did not appeal to a broad audience.

Future success will depend on integration of many technologies, utilization of software development practices, and realization that the Web is not the answer to every problem (e.g., real-time molecular modeling is not possible). We have learned to stop focusing on the technologies, to pay attention to users and their needs, and to use technology to design appropriate solutions. Web tools cannot be isolated from traditional applications. Communication is a two-way street.

Back to the top

ChIN's Web Page : Selected Chemical Resources on Internet

Xiaoxia Li, Fang Xu, Xinjian Yan, Zhang Yuanyang, Zhihong Xu, Laboratory of Computer Chemistry, Chinese Academy of Sciences

The ChIN project (International Chemical Information Network) was started in 1993 by the Federation of Asian Chemical Societies (FACS) recognizing the significance of the Internet as an important medium for chemical information and for progress in chemical resource discovery on the Internet. The construction of a Web page for ChIN, supported by UNESCO, started in 1996. Now the page is the only chemical site with comprehensive directories of chemical resources on the Internet in China. Besides the archives for ChIN's activities, more than 500 selected chemical sites on the Internet are indexed on ChIN's Web page, thus far under 18 categories.

Although the goal of ChIN's page, like many other comprehensive chemistry index sites, is to help chemists making use of the Internet's chemical resources, ChIN's page does have some special features. First, it aims at selected resources: both the category subjects and the indexed items in each category are carefully chosen. As chemical databases and chemical software are the basic tools for chemists, they are chosen as the primary lists. New applications such as e-journals and e-conferences on the Internet are emphasized. Lists of chemical meetings, organizations, mailing lists, important news, selected publications and books, as well as patent information, are available. A special category about chemical information services in China can be found. Gateways to other well known comprehensive lists and search engines are also provided.

Second, the approach is information-based. Instead of providing only a hyperlink or a link with a short description of the resource indexed, a summary page is created for most of the chemical resources included on ChIN's page. The summary page describes the indexed resources in more detail. The whole set of the summary pages forms the information base, where full-text searching can be performed by Internet users for more precise and fast location of information on ChIN's page.

Third, there is knowledge accumulation. Some categories are created on ChIN's page to accumulate expertise postings in mailing lists and newsgroups, which often provide useful clues for chemical information. For example, the category "How to Find Property" is an effort to accumulate the knowledge on physical and chemical property data. The combination of it with the category of "Chemical Databases" on ChIN's page might provide more hints for physical property data on the Internet or about other sources.

Back to the top

Modularity - A New Paradigm for Electronic Scientific Interchange

Joost Kircz, Elsevier and University of Amsterdam

The Communication in Physics project involves "translation" of knowledge into text, applications etc. One of Kircz' co-workers is a linguist; another is a physicist. Spoken language is speech in context. Kircz referred to Aristotle's unities of time place and action. Written language is codified speech ready for ageless storage and transport. It is not speech-on-paper.

In electronic language, recontextualization is possible. Electronic language in not written language or just another memory medium other than paper. It is not canned speech in context. After the revolution of the printing press, in religion, the truth is in the writing, not as Plato said. In law there was a unification of law systems. In science there started a 400-year tradition of scientific publishing. This tradition started in 1665 with Le Journal de S�avans. Illustrating "In the beginning was the Word" (the truth is in the writing) Kircz showed a hypertext example of The Bible in four languages. He talked about reusability of works or parts thereof, indexing, dictionaries, bibliographies, and yellow pages. (The Web-crawler was an early invention!)

Kircz talked about standardization of the presentation and judgment of scientific works on paper and the societal role publishers and libraries play. The eight pillars of canned wisdom are:

authentication (intellectual ownership and fame)

validation (peer review)

certification (quality stamp, journal name, imprint)

indexing (meta-data)

storage (print on paper)

dissemination (submit to journal)

retrieval (catalogs)

disclosure (interlibrary loan and photocopying).

This is why we love libraries and publishers but who will pick up the bill?

Kircz's project is currently in the phase of translation from old practice to new forms. Present day electronic journals are mostly just searchable paper journals plus appendices (video, sound, data-set etc.) In reality we will see different roles for textual and non-text information. Text will act as an explanation of the picture. Virtual reality offers tactile, visual, and audio sensations (and eventually scent?).

Print on paper induces a linear form of the discourse as if it were a novel or detective story. However reading of regular scientific articles is most often a search for some (yet unknown) needle in a multitude of haystacks, browsing, flipping pages etc. In Kircz' research group they take the intrinsic characteristics of the technology seriously and do not try to mimic the paper world in the electronic world. Hence the breaking up of knowledge representation into modules. Different concepts deserve different presentation and different handling. An article is a set of modules (which may be sound or video, for example) and the modules are linked in some way. This also reduces a lot of repetition in articles.

Existing journals are analyzed as raw material for a different presentation in four representation spaces: conceptual, range, domain, and bibliographic. Modules, according to the conceptual ordering, are elementary ("atom") modules or complex ("molecular") modules. A molecular module is a set of modules with a linking system (compare atoms, molecules and bond types). A modular structure is a set of modules with a linking system (compare atoms, molecules and bond types). Some molecules may be empty, some are just pointers.

Next to the conceptual modules we need a special module as switchboard or linchpin. This is called the meta-information module. This comprises maps of the domain, specific terms and bibliographical terms. Part of the meta-information is the abstract. In a modular (distributed) information environment, the abstract becomes crucial. In the old linear form the abstract acted as a kind of mini-introduction. In Kircz' model where the reader can enter the "article" from every module, the abstract gets a centralizing role in the organization of the various kinds of information. The structure of the abstract in a modular environment is a special study of the linguist of the group. Kircz's full list was:

Meta-information
Positioning - situation and a central problem (compulsory)
Methods module (theoretical, experimental, numerical)
Results module (raw data, treated results)
Interpretation (remains argumentative text)
Outcome. Findings. New problems.

The range of information may be microscopic, mesoscopic, or macroscopic. Kircz also summarized the organizing relations (sequential, proximity, hierarchical, range-based, representational and administrative) in the link taxonomy, and the discourse relations: problem solving, transfer, zooming,communication center function, explanation and comparison.

Back to the top

Trends in Electronic Publishing: The HighWire Press Perspective

Michael Newman, Stanford University

This talk is available in the ChemWeb/VEI library. The HighWire Press is a unit within Stanford University Libraries. About four years ago, the American Society for Biochemistry and Molecular Biology (ASBMB) was searching for alternatives for delivery of the Journal of Biological Chemistry (JBC). Stanford Libraries agreed to take on the development of the Web version of JBC. The HighWire Press team was appointed in early 1995 and released the Web version by May 1995. Since then, HighWire has developed electronic versions of SCIENCE, PNAS, and more than 60 other journals. It has developed a variety of new features and implemented access control systems for subscriber-only access to many of the journals. It has also brought together a group of scientific societies and other publishers of high-impact scientific journals.

The mission of the HighWire Press is to form partnerships with publishers of scientific information and to apply technological expertise to promote scientific communication. Goals are to ensure that professional organizations and scientific societies maintain their market share and to promote co-operation among these organizations as publishers to enhance the delivery of information to readers.

Electronic versions diverge from print in several ways. First the electronic version can be published much faster than a printed publication. Second, some journals are taking advantage of the Web by publishing more articles in the electronic version than in print. Third, it is easy to include supplementary data linked to the article. Fourth, an electronic journal can include video, sound, and other media. Fifth, electronic journals offer new possibilities for bi-directional communication between authors and readers. Sixth, content awareness facilities are possible: HighWire has recently developed the CiteTrack system, which allows readers to receive alerts when new content is published. Seventh, links can be made to related content. Newman considered each of these seven points in turn.

JBC and other HighWire journals offer titles, and in some cases abstracts, of articles to be published in future issues. Within the next year some articles will be routinely published electronically ahead of print. A journal is no longer a series of issues: individual papers can be published on acceptance. JBC is still published both electronically and in print, but some HighWire journals, for example Pediatrics, incorporate electronic-only articles.

SCIENCE is an example of a journal that offers online-only supplementary tables, figures, and other materials. In offering supplementary materials, the publisher has to decide the level of peer review these materials receive. Another important issue is where the supplementary material will be located. In the case of SCIENCE, the supplementary material is maintained at the SCIENCE Web site, but other journals offer links to the author's Web site. An example of the use of new media is Molecular Biology of the Cell which recently started to include video essays (e.g., of chromosome separation) in each issue. Articles about animal communication could include sound. The electronic medium can be much more interactive than print. The British Medical Journal (BMJ) has always had an active letters to the editor section: BMJ has turned the section into a moderated email discussion group. The electronic version offers more correspondence than print and it is also faster and more interactive.

The Web also gives new opportunities for offering content awareness in journals. Recently, HighWire introduced CiteTrack in some journals. This alerts readers by email whenever new content is published that matches topics, authors and articles of interest. The editors of BMJ assign every article to at least one of 120 subject categories which act as preformulated searches.

One of the most powerful features of the Web is the ability to link text to related information. HighWire links cited references to abstracts in MEDLINE, links articles to molecular sequence data, and provides links between articles and the papers they cite. For cited references it offers links in both directions (not possible in print). It also offers cross-journal links which in many cases (e.g., JBC and PNAS) are free. Each weekly issue of SCIENCE includes two or three "Perspectives": short articles, usually highlighting a particularly significant research article also published in the same issue. Each week, the editors at SCIENCE select one commentary for enhancing. Enhancements consist of links to outside related Web sites. For example, a recent commentary discussed research explaining altruistic behavior in terms of Game Theory and the Prisoner's Dilemma. There are lots of Web sites related to Game Theory, so the enhancements for this article consist of links to such sites. Thus the enhanced Perspective becomes more than just a mini-review: it is also a guide to evaluated Web sites on a topic.

Next, Newman turned to the important question of archiving. If the electronic journal were just an exact electronic copy of the printed journal, we could simply treat the printed journal as the archival format. However, with electronic-only features, new archival solutions need to be found. Functionality as content (e.g., links to MEDLINE, or videos in Molecular Biology of the Cell) is also an issue. Evolving technology is possibly the most challenging aspect of developing an archival medium: even reading floppy disks written just three years ago can be a problem. Related to this is the changing interface for electronic journals. Finally, standards are needed for archiving electronic journals. When the archiving problem is solved, JBC and other journals may possibly be moved to an electronic-only format.

Lastly, Newman discussed pricing. In the most common model for electronic journals, the institution pays for access, not ownership. If the library cancels an electronic title after five years, the institution no longer has access to that title. Some HighWire titles are taking a new approach to this problem. These publishers have concluded that making older content free will have little effect on current subscriptions. A second issue is non-subscriber access. For most HighWire journals, non-subscribers can view tables of content, titles and abstracts and they can search for articles. If you subscribe to JBC, and you find an article in PNAS cited in a JBC article, you can view the PNAS article, even if you do not have a PNAS subscription. However a better "pay-per-view" model is still needed. The final development in pricing is the development of the HighWire Marketing Group, (HWMG). This is a co-operative effort by HighWire publishers to offer their journals as a package.

Back to the top

Producing Internet Editions of Scientific Journals

David P. Martinsen, Lorrin R. Garson, Jeffrey D. Spring
Advanced Technology Department, Publications Division, ACS

In 1975, ACS journal production moved in-house using a content tagging model. In 1980, one thousand articles from the Journal of Medicinal Chemistry were loaded on BRS as experimental prototype. From 1982-1985, full text of 16 ACS journals on BRS was a commercial product. In 1985, experiments with CD-ROM, with OCLC, began. In 1993, supplementary information for J. Am. Chem. Soc. was first delivered by gopher. In 1996 the first two Web versions of ACS journals were introduced. In September 1997, ACS completed the process of creating Web editions for all 26 journals. The process for producing the Web editions is based on the electronic files used in the print editions. In January, 1998, the ASAP (As Soon As Publishable) program was started, providing for publication of material on the Internet 2-11 weeks prior to publication in print. The Web articles are usually posted within two days after final corrections from authors have been made to the peer-reviewed, accepted manuscript. The Web release date is the date of public release and this date is printed in the journals.

Martinsen gave a diagram of the previous production process. The author's manuscript (usually with a diskette) is input to the Xyvision system from which the print journals are produced. SGML is also output from Xyvision and this is used to make CD-ROMs and WWW editions. Graphics are digitized and input to the Xyvision system and into CD-ROM and WWW editions. One objective in making the whole process more efficient was to disturb the existing process as little as possible. For example, SGML could be made the master version.

Digital Object Identifiers (DOIs) are a fundamental of the new system. An ACS example of a DOI is //dx.doi.org/10.1021/jo980301f. Persistence is an important feature of DOIs. The 50,000 articles on the Web are registered at the DOI foundation.

DOI is used rather than URL. The PubMed links to ACS articles now use DOIs.

By August 18, 1998, there were 51,210 articles from 26 journals on the Web. Average file sizes are 54K HTML, 18K Figures, 1K equations, and 259K PDF. In 1998 (to September 13) the number of loaded articles was 13,442 in journal issues, 13,862 ASAP and 1748 in the ASAP pipeline. The service level objective is 98% uptime; the average to date in 1998 has been 98.07%. There is a high availability system with redundant connections to the Net and there is a formal disaster recovery plan. ACS is committed to archiving. SGML is seen as one archiving format but ASCII composition files, PostScript and PDF are all being held.

In future, the idea of a journal issue may be eroded. Then there will be no pagination etc. so maybe DOI is a solution for citing articles. DOIs will appear in the ACS articles about 1 month from now. Capabilities unique to the Web will be added. XML is more powerful than HTML, simpler than SGML, and has unicode support. MathML and CML are XML components. XML is currently being used in electronic commerce applications so there are economic driving forces.

Back to the top

ISI Chemistry Server

Nikolai Kopelev, Matthew Clark, ISI

Traditionally chemical information has had a narrow focus and weak links outside chemistry. Current trends are to further diversification and increasing scattering. ISI's objective was efficient integration of information using the Internet as a tool. The first step was to put chemical data on the Web: initially Current Chemical Reactions (CCR) and next Index Chemicus (IC).

For the Web of Science every document from over 8200 journals is used and every cited reference is captured. Over 7000 books per year are also used. Many disciplines, including humanities, are covered. Chemistry data includes over 400,000 reactions, growing by 42,000 annually. The reactions were in REACCS then ISIS. ISI has structural information and biological activities for over 1,500,000 compounds, growing by over 200,000 compounds annually. The chemical database is a part of ISI's comprehensive database. There are bi-directional links between the Web of Science and the chemistry server.

The integrated system has these key benefits:

Web access to reaction and structure data
navigation by citations (efficient search)
another level of details (reaction and compound data)
multiple links
link to library holdings
link to publisher's full text and ISI document delivery
potential for integration with other data
personal management of results.

Kopelev gave an example: a search on "stereoselective" and "synthesis" (chemistry) and on HIV-1 protease inhibiting activity (biology). A result was Palinavir. Building blocks for synthesis were shown. Now Kopelev did a reaction query. He showed an article view (all the reactions in the article display) and a reaction view. "Advantages", as specified by the author, can be displayed, e.g., simple reaction, good yield etc. Next, Kopelev connected to the Web of Science. He pointed out "link to Web of Science" and "link to chemistry server" on the displays. By jumping across links he found synthesis information for Indinavir, Ritonavir, Nelfinavir and Sapinavir.

The ISI chemical databases will be available in November or December. Publishing full text is a complicated issue: ISI simply points the user to the publisher. A collaboration with Derwent Patent Explorer is ongoing and other databases are being investigated.

Back to the top

Web Application Development in Chemistry: So Many Tools, So Little Time

Matt Hahn, MSI

Hahn said that his talk would be seen as controversial. He took a commercial viewpoint. First he considered the requirements for Web development. The software developer wants easily to create, reuse and integrate software components, to build Web applications quickly. End users want an engaging, exciting experience

In theory, tools on the client include HTML documents, Java applets, plug-ins and JavaScript. Server side tools include HTTP server, database and compute engines. In reality the situation is more complex. Browsers include various versions of Netscape and Internet Explorer (IE). There are multiple HTTP servers: from Netscape, Sun (Java Server), Microsoft etc. Content may be in HTML, XML (CML) or ActiveX documents. Dynamic content uses Java applets, Netscape plug-ins, ActiveX controls and DHTML. There are many scripting languages including JavaScript, Jscript, VBScript amongst others. Content delivery may be by CGI, Java servlets, Active server pages, JDBC ODBC, or oleDB. For distributed communication there is CORBA, or COM (DCOM) or Java RMI. Molecular visualization may be done with JPEG GIF, MPEG, AVI, VRML (1 and 2), Chime, WebLab Viewer etc. This list is not exhaustive. There are too many tools. There are 10,000 different possible ways of building a Web based application from the tools listed here.

As a reality check, ask yourself has any one been successful deploying cross-platform applications that rely upon anything but simple HTML? No. Why? There are just too many components and fundamental incompatibilities. We are trying to give the user an engaging, exciting experience. He is sitting at a client: any machine that supports a browser. Most cross-platform issues arise on the client. It is near impossible for the developer to achieve cross-platform capability so he should maximize both availability and robustness. In practice, Microsoft dominates the client base and Microsoft provides a robust set of tools.

MSI's WebLab is a family of Web based products including Gene Explorer, MedChem Explorer, Diversity Explorer and WebLab Viewer. It was originally designed to be cross-platform. It was heavily reliant on Java, and had a great look and feel, but poor cross-platform reliability.

In a Microsoft-centric solution, Windows integrates with the Web at the operating system level. The browser is not central to the Web experience, Windows itself is. Microsoft offers high quality Web development tools, superior to anything else on the market. Microsoft technology on the client side offers the IE browser; HTML and ActiveX documents for content; ActiveX controls for dynamic content (plug-ins are a dead end because they are very specific); VB and JScript scripting languages; and DCOM for distributed communication. You use COM objects not Java applets. ActiveX documents let users view and edit non-HTML documents through the browser, integrate existing documents into the browser or any other application, and merge menus and toolbars within the client application.

Hahn gave some examples. He showed WebLab Viewer standalone then brought up IE and while still within it could use WebLab viewer. He manipulated a molecule within IE (this is part of ActiveX document technology), took a Word document and read it in IE and clicked on a molecule. He manipulated it within the Word document. You can also use a PowerPoint file in IE.

He also demonstrated ActiveX documents with frames. An application (WebLab, for example) can be used in a frame. A molecule can be manipulated within Word within a frame. The viewer can be in one frame and the Word document in a second frame, all active and live. Another example is conformational analysis. Hahn had a live graph on the left and the Viewer on the right. He clicked on peaks and troughs on the left and viewed related conformations on the right. He also did interactive playback using a scripting language. The same application (WebLab Viewer) was used in all the different contexts. You can link from PowerPoint to IE to Word to WebLab Viewer. Windows is the client of choice. Microsoft has a significant stake in Web development tools and ActiveX document technology integrates any content type.

Back to the top

Integrating Chemistry with Neighboring Information Universes

Dave Weininger, Daylight Chemical Information

Historically, information technology has supported the creation, development, maintenance and dissemination of data stores as large field-specific databases. Such databases are generally organized to allow queries for a particular specialty which have little or no meaning when applied to other types of databases. Although these systems may be very effective for specialists in a particular field, they are poorly suited for integration of informatics between different fields. This is unfortunate, because interdisciplinary research is where the action is today.

Chemical information systems are characterized by an information model which is based on associating data with molecular structure. This very powerful model has essentially transformed chemistry into a molecular science by providing a rigorous basis for the storage and retrieval of chemical information. It has also resulted in nearly intransigent data systems which are of little use to non-chemists. The shameful fact is that, with respect to interdisciplinary integration, chemists are worse than most. Integration with neighboring universes such as bioinformatics and genomics is essential.

In the chemical universe, sometimes the structure-data connection is overt, e.g., in a Thor database, but usually the connection between structure and data is indirect, e.g., in CAS and MDL systems, structural information is indexed with a registration number. As an example, Weininger asked "Are there any Japanese patents for Best?". He looked up name "Best" in Derwent's World Drug Index (WDI ) and found that "Best" is a name for diazepam in Argentina. He looked up the structure in SPRESI and found Japanese patent office numbers. Structure is a universal chemical language. No-one needed to agree on a standard name or number for "Best": for free we can use structure.

Other fields use other information indexing methods, often for historical reasons. We would like to search these databases by content rather than by predetermined index. Thus the Available Chemicals Directory (ACD) and PubMed/Entrez are in different information universes. The Web could be "indexed" on the chemical information model.

Neighboring universes are:

Chemical databases - our core, catalogs, registries, properties
Chemical literature - document-oriented informatics
Reaction databases - transformations
Process chemistry - large databases for few reactions
Combinatorial libraries - super-specialized representations
Natural products - both molecular and goop
Hedonics data
Chemical patents - a DMZ between legal and chemical universes
Bioinformatics data - primarily sequence-oriented data
Genomics data - special tools for special sequences
Polymer data - important information which is poorly organized
Materials data - typically lots of data for few substances
Crystallographic data
Modeling data- not currently database oriented, but could be
Computational results - a cache for computational services
Spectral data - more generally, LIMS-oriented systems
Clinical trials - large, specialized, underutilized datasets
Legal documentation - e.g., MSDS

We want to search the chemical by content: it needs re-indexing or reorganizing by content. Chemical databases use the core information model of chemistry and are relatively easy to integrate. The easiest way to integrate chemical and reaction databases is to cross-index reaction components, as with Current Chemical Reactions and Index Chemicus, but often people license one and not the other. Polymer data tends to be based on primary reactant. Thus polypropylene is seen as propylene. It is hard to integrate process chemistry. There is a large amount of data about relatively few reactions and it is rarely integrated with conventional chemical databases. Benefits in combinatorial library design are therefore not reaped. For materials there is lots of data about a few molecules. It could be integrated easily with chemical databases but this not done in practice, perhaps because of lack of motivation.

LIMS data is often about sample numbers not structure. LIMS manufacturers are market driven: they produce turnkey solutions which do not interoperate. Spectral databases are often tied to LIMS systems and often published for use with proprietary software. The NIST Mass Spectral database is distributed as a flat file. [Steve Heller pointed out that it is partly on the Web and will soon be substructure searchable with Daylight software. NIST is looking at putting PDB on the Web too.] Special methods are needed for huge databases of combinatorial libraries. Experimental design and robotic control need linking to combinatorial libraries. Reagent acquisition and inventory control systems are in the standard chemical information universe and they need linking to combinatorial library databases. Hedonics information has aspects of both LIMS and combinatorial library data systems.

Markush structures from patents are in two incompatible systems from CAS and Derwent. There are no WWW services for generic structures yet. Patents are legal documents, used to obscure data. Crystallographic data for large molecules is in PDB; for small ones, in CSD. Conditions for use of CSD are restrictive but this may change. Bioinformatics and genomics data are primarily sequence-based. Entrez is linked to Genbank and Medline but chemical information is not well integrated with bioinformatics.

The emergence of Web technology provides both the motivation and method for cross-field data integration. It is not the technology that most chemical informatics people would have chosen, but it is here in a big way and it is here to stay. Integration of the universes could be a productive field of research.

Back to the top

Molecular Modeling through the World Wide Web

Peter Ertl, Novartis Crop Protection AG

Ertl also gave a presentation at the ECCC3 conference.

The classical approach has been to move from chemistry to modeling to chemistry. Novartis is trying to optimize the process as "chemistry & modeling" combined and to move modeling to the desktop. Unix systems are hard for bench chemists to use; Web based tools offer platform independence, ease of use, and high interactivity, with CGI, JavaScript, Java etc. An in-house Web-based molecular modeling and chemical information system has been in use (currently by more than 170 users) in Novartis, Basel since 1995. The system is aimed mainly at the bench chemists, enabling them to perform basic tasks including:

easy retrieval of molecules and related data from the company databases
creation and editing of molecules by using a structural editor written in Java
calculation of important molecular physicochemical properties
sophisticated molecular and 3D property visualization interface to quantum chemical calculations
molecular and substituent similarity searches
interactive QSAR analysis.

Ertl demonstrated the molecular engine screen. Reference number, structure or SMILES can be input. CORINA, MOPAC and in-house programs for logP and other physical chemical parameters are available. Ertl showed the Java molecular editor with various molecular images, spacefill, ball and stick etc. Chemists can generate the GIF images. Ertl showed more screen shots of electrostatic potential and surface properties. Chemists can run quantum chemical calculations: an AM1 calculation was shown and 2D and 3D structures, MLP, MEP, HOMO and LUMO appeared in 6 windows on the screen.

Chemists can run similarity searches. There are more than 500,000 molecules in the database. Similarity in physical chemical properties is used. Tables can be made and statistical analyses (QSAR) performed. Then chemists can predict activities for new molecules.

Ertl also discussed similarity of functional groups and the concept of bioisosterism. The chemist draws a molecule and an R group, indicating the point of attachment. He chooses search criteria e.g., hydrophobic and electronic properties, size, H bond capabilities, and a search of 80,000 substituents is started. A list of bioisosteric substances is output with similarity scores.

Ertl gave a diagram of the anatomy of a Web application. The Web server and Perl scripts are at the center with double-headed arrows going out to the following features:

SMILES database

CORINA

MOPAC

MGP

Database search (linked to a database of functional groups)

Conversion of SMILES to 2D structures

Generating GIF images

Display of answers

Query handling - ME applet and Java-JS communication

Java and C were written in house. CORINA and MOPAC are the only commercial components. Ertl listed these advantages of intranet Web-based molecular modeling tools:

very good acceptance by end-users
no introductory training necessary
easy maintenance
constant upgradability
zero license costs
no limit on the number of users.

After the paper, someone asked whether chemists can trust the answers they get. Obviously, care is needed in providing such tools for a chemist. QSAR is especially a problem but the number of parameters is limited and chemists use cross validation so they cannot use an overtrained equation. Someone suggested that prediction should be for similar molecules only.

Back to the top

Chemical Dataflow Programming in a WWW Environment

Wolf-Dietrich Ihlenfeldt, Computer Chemistry Center, University of Erlangen-N�rnberg

This paper is available on the Web. Dataflow programming is a powerful paradigm not only for visualization, but also for chemical information processing if the mechanisms of data generation and data transport are adapted to the peculiarities of chemical data objects and the operations to which they are subjected. In a standalone environment, implemented as a classical control program organizing the data flow and interaction, this type of approach promises to be a possible solution to the interoperability problem chemistry faces with the various incompatible monolithic packages which were until recently the norm for chemical software. An important trend is to provide Web-based interfaces to established programs in the short term, and to develop in new projects more modular, encapsulated and Web-enabled application modules in the long term. However, combining such modules into a working environment still raises interoperability issues.

Ihlenfeldt introduced a general-purpose framework for the ad hoc combination of processing steps, I/O modules and other tools using a general chemical information processing plug-in based on the CACTVS core library. He demonstrated a chemical information processing environment wherein small modules taken from various sources including remote Web sites acting as tool repositories can be assembled in a graphical manner to complex processing pipes and other structures. These storable setups can be supplied with structures from files, editors, databases or other sources, which are then passed along and processed according to the assembled sequence of processing steps. These steps can include calls to external programs and remote computational services for information generation, filtering and manipulation of compounds or their attached data, generation of new compounds and other data objects, export to display tools or output of result data into local files for further analysis.

There are so many opportunities for problem solving with computational or chemical information methods but there are too many isolated, monolithic programs. Lacking are modular sharing of modules, ad hoc combination of methods and network/Internet integration for method sharing. A framework for open method integration and sharing is needed, with an extremely modular structure, which could handle all kinds of data and would have external chemical property/data ontology. It should be run-time extensible in every aspect (properties, datatypes, chemical objects, I/O modules), with support for a distributed, networked operation, and it should be usable at all experience levels.

The basic concept of a solution is to identify a small set of classes of operations, to provide visual objects representing operations, to have molecules interact with waystations while they travel and to have exact capabilities of active objects specified by scripts, modules etc. Workbench objects are input and output files, tools, visualizers and editors, database portals, molecular ensembles, computation requesters, archives and tables. Coupling of objects uses a dataflow paradigm and visual programming. Objects are stored in and retrieved from Internet repositories or databases.

Ihlenfeldt discussed an example setup, starting with connecting tools. A set of tools is selected and assembled into a pipeline. The pipeline sequence is initialized for a stream of molecule objects. The next steps are modify, compute and filter before temporary collection. Finally there is visualization and output. In the input stream, a trusted I/O object accesses local files and acts as generator of molecule objects (e.g., reactions and tables). I/O modules are extensible. There is a standard set of postprocessing directives or one can add more tools. Property computation allows arbitrary data to be computed. . Possible classes of operation are script or wrapper, local; shared library, local; client/server, external, synchronous; and client/server external, asynchronous. Trust/access levels are adjustable. Parameterized tools may present panels. The parameters, which can be saved, are used for action. They may change direction of flow, delete, or create. The visualizer has a bi-directional interface to other CACTVS programs. Tools can act as wrappers for standard molecular graphics programs. There are passive visualizers and active editors.

"Initialization", "Sequence Init", "Action", "Sequence End", and "Destruction" are standard steps in event scheduling. Processing objects bind functions to events: asynchronous events are error, timeout, and computation completion. Event sequence and operation modes are controlled by a built-in scheduler.

Next, Ihlenfeldt considered basic tool operation procedures. Every tool object operates in isolated Tcl interpreter. Interpreter capabilities are adapted to trust level. A tool receives lists of incoming objects. It can examine current data, modify, filter and add. It can create new objects and delete, and it schedules objects for transport. A tool can post additional events ("seqinit", "error"). It does not know about the environment. Only communication link global are object handles.

Application levels are related to user experience:

Level A: Turnkey operation setup, fixed parameters

Level B: Turnkey setup, adjustable parameters

Level C: Assembly of prebuilt tools

Level D: VHLL script/tool/wrapper developer

Level E: Shared library/client-server developer

At the core of the CACTVS toolkit are scriptable interpreters with high-level chemical functionality. Applications are short scripts, with or without GUI. The application domain may be batch processing, rapid prototyping, CGI, data integration, or custom applications. The toolkit implements an open data model with property ontology, method lookup and consistency management: there are no limits to the kind of information processed. The toolkit offers portable scripting for all environments: it runs on 10 UNIX variants plus WinNT 4.0, standalone or in a WWW browser plug-in.

To use the CACTVS Workbench on the WWW, you install the basic CACTVS plug-in, which loads the latest version of the workbench (or a preconfigured set-up). You browse for tools and data, import tools, datasets, properties, visualizers etc., initiate operation sequences for results and save the set-up and parameters for reuse as copy or references.

Back to the top

Chemical Needles in Haystacks: Meta-data, MIME, Markup and Models

Henry Rzepa, Imperial College

Full details of this talk are available on the Web. In 1863, Samuel Butler wrote "... the general development of the human race�be well and effectually completed when all men, in all places, without any loss of time, at a low rate of charge, are cognizant through their senses, of all that they desire to be cognizant of in all other places.". In 1998, Tim Berners-Lee said "Once the Web has been sufficiently 'populated' with rich meta-data, what can we expect? First, searching on the Web will become easier as search engines have more information available, and thus searching can be more focused. Doors will also be opened for automated software agents to roam the Web, looking for information for us or transacting business on our behalf. The Web of today, the vast unstructured mass of information, may in the future be transformed into something more manageable - and thus something far more useful".

The Internet is a globally available resource, with standards overseen by both the W3C and subject bodies such as IUPAC. Intranets are closed systems implementing components of global standards adapted to local requirements, with access via firewalls. Extranets are global systems with access only via authentication, implementing global standards augmented by bespoke solutions. This talk is concerned primarily with the Internet plus secure access to intranets and extranets.

The first Internet search engine, Lycos, was invented in 1994 by Andrew Muldin. This design, based on the title field in structured HTML documents, allowed indexing of 30,000,000 documents in 4 weeks. In a search Rzepa did in May 1998 for "Chemical", AltaVista found 3,886,160 "matches", The top hits:

1. American Institute of Chemical Engineers Welcome Page
2. Institute of Chemical Technology of Food
3. Chemical Online Home Page - Chemical Online: Community and marketplace
4. Chemical Engineering WWW Server
5. Ashland Chemical Company

were not necessarily what everyone would consider the top chemistry sites. Unfortunately, the title field is often blank or irrelevant (inherited MS Word title), and you cannot "force" document association with chemistry. Much content is hidden in extranet or legacy "databases" and the "impact factor" is low value.

Can signposts be added? Around 1997, matters started getting a little better. For example, a page could be added to AltaVista. The help page mentions meta-data tags e.g.,
<META name="description" content="We specialize in grooming pink poodles.">
<META name="keywords" content="pet grooming, Palo Alto, dog">

AltaVista will then do two things. It will index both fields as words, so a search on either "poodles" or "dog" will match, and it will return the description with the URL: "Pink Poodles Inc. We specialize in grooming pink poodles". AltaVista will index the description and keywords up to a limit of 1,024 characters.

Few authors of chemical pages bother to use the feature whereas some (usually commercial) sites abuse it. The Netscape Portal appears to disregard meta-declarations; hit ranking is related to word frequency and it seems likely that presence on the list can be purchased. Manual portals (such as CAS) do a human review of sites but this is expensive, out of date, and arbitrary. How good are the portals at finding needles in haystacks? Rzepa tried the URL http://www.ch.ic.ac.uk/motm/viagra.html in the Netscape Portal. The URL was declared to be related to

2.Molecule of the Month

3.Oxford

4.Molecule of the Month

5.Chemscape Chime: Download

6.The Nobel Prize in Chemistry 1996

7.Press Release: The Nobel Prize in Chemistry 1997

8.Chemistry on the Internet

9.Ask AltaVista for backlinks to this page.

10.Australian Chemistry Network (OzChemNet)

11.School of Chemistry, University College (UNSW) 1997 Handbook Entry

The AltaVista advanced portal also produced an odd mixture, admittedly including two molecules of the month. Rzepa showed the Viagra molecule of the month meta-data, which has arbitrary chemical semantics: HTML has no formal way of expressing chemistry.

What are the essentials for a chemical portal? Ninety-nine per cent of molecules on the Internet are currently expressed as GIF illustrations. Use of the "alt" field would only be a very partial solution. Should someone produce a search engine indexing chemical ALT fields?

The Dublin Core (DC) Model for meta-data is an Internet RFC 2413 (Glossary and Internet Media Types). Meta-data describes an information resource. Meta-data is data about other data. Rzepa gave an example. The Resource Description Framework (RDF) is the basic language for writing meta-data; a foundation which provides a robust flexible architecture for processing meta-data on the Internet. RDF will retain the capability to exchange meta-data between application communities, while allowing each community to define and use the meta-data that best serves its needs.

XML (the eXtensible Markup Language) is the encoding syntax for RDF. Rzepa listed some search engines supporting meta-data:

Berkeley SWISH plus Crawler

Australian MetaWeb Project including Dublin-Core Harvester

PrismEd a configurable meta-data editor which will cope with Dublin Core plus RDF

Commercial ones including Verity Search97, Infoseek's Ultraseek, AltaVista Search97, and Microsoft Site Server 3.0

Search Engine Watch: http://www.searchenginewatch.com/ and

Internet Detective: A QA Tutorial.

See also Wes Sonnenreich: Guide to Search Engines, Wiley Computer Publishing, 1998, price: US$ 34.99

The DC-CHEM Model might be a model for chemical implementation:

MIME-based document components might replace illustrations with models. Papers on Chemical MIME should soon appear in J. Chem. Inf. Comput. Sci. and Pure and Applied Chemistry. Rzepa showed a display from an MDL Molfile plus JCAMP Analytical data (plus Gaussian cube files) plus XYZ animation.

Both Ihlenfeldt and Brecher have highlighted a problem in indexing and searching for data from models. A chembot.txt (perhaps based on chemical MIME?) might detect its presence. It is also a good idea to put "alt" fields in model data. Other problems remain. There is no easy way of indexing analytical data, no standard linking mechanism between the two models (.ano file versus Chime method?), no way of indexing and searching for the links. What about embed versus applet elements? HTML 4.0 goes some way toward resolving this. A more structured approach is needed.

There is a need for multi-component chemical models, for example, the VRML approach (see Casher, Leach, Page and Rzepa, Chemistry in Britain, 1998, 34(9), 26). Rzepa showed the solvation potential energy surface of 1-(4-methoxyphenyl) cyclohexane-3,5-diol, containing components of molecular structure, geometry, energies, Hamiltonians (and insight!). A VRML model is more structured than a GIF image, for example. There has been some progress in indexing such multi-component models. Links to searches via extranet-type databases are possible. A VRML model can easily be disassembled to component parts and it can have "methods" associated via script nodes. Interdisciplinary links to models in mathematics, bioinformatics and physics are possible.

Rzepa turned to chemical objects and methods, from the CGI web server-client model in 1994 to the distributed computing/application server model in 1998. The current industry focus is on object relational datablades and cartridges etc. for legacy systems (in Oracle) as an intranet and extranet solution. The life sciences working group of the OMG is strong in bioinformatics, but currently weak in chemistry. There are distributed computing models with Java, CORBA and RMI (remote method invocation). Rzepa's own team is working on a Chemical Object Store (COS) using remote method invocation, an Object Request Broker (ORB) for Java; on serialization of molecule object information into an ObjectStore; on authentication and signing of Chemical Objects via the Belsign Agency; and on retrieval of molecule object information and reconstitution as PDB for Chime display and local use. Other projects are jSPEC, a signed Java object for analytical data; Molda, a signed Java object for molecular visualization; and Quest from Jobjects, a Java object for CD-ROM based indexing and searching. Cherwell Scientific has ChemSymphony Beans. Ideally, all these various tools should be converted to molecular components for re-usability but who can or will coordinate this?

One way of describing molecular components and objects is with XML and Chemical Markup Language: Rzepa showed a Web page of Peter Murray-Rust, at the Virtual School of Molecular Sciences, University of Nottingham. Self-describing (chemical) information is needed if the Web is to be a seamless database. There needs to be more effective use of tools for simple meta-data and DC. Rzepa has prepared an IUPAC discussion paper on a global consensus on DC-CHEM. Is there a need for an indexing and searching resource in chemistry akin to AltaVista or Yahoo? This could be an alternative or a complement to CAS and Beilstein. Where will the chemistry go: on Internet or extranet?

Rzepa emphasized the need for authors to capture data for expression as a model or structured component document. There is increasing use of chemical methods expressed as modular objects but authentication procedures such as document and object signing (not encryption) are needed. More finely grained (chemical) content can be handled with XML and Chemical Markup Language. This will lead to inter-operability within chemistry and with other disciplines such as mathematics, for which there is MathML.

Back to the top

The TelePresence Microscopy Collaboratory and DOE 2000

Thomas Pierce, Argonne National Laboratory

Zaluzec's 89 slides are available in the ChemWeb/VEI library. The key to all experiments is the interaction of investigators with instrumentation, data and collaborators. In working with instruments the experimentalist needs to monitor the progress of the experiment and to have real-time control. Investigators working with data need real time access to current results, analysis tools and a high performance data engine. Working with collaborators, investigators need to discuss the experimental progress, view data while it is being acquired, sketch out trends and access supporting documentation.

Zaluzec described the challenges. There all sorts of controls: instrumentation, data, standards, legacy platforms, human factors, time and budget. Mechanisms include networks servers, browsers and tools. There is input from users and enablers and a new paradigm for interactive R&D and education leads to the output. Functional requirements for collaboration are persistence, sharable entities, sharing techniques, methods and protocols, session and access control, discovery mechanisms, transport mechanisms, resource management and real world interfaces. Zaluzec listed some key issues in more detail, under the headings collaboration functions, persistence functions, device interactions, security and enabling functions.

He gave reasons why a WWW site is an ideal prototype model of a persistent electronic space. A persistent electronic laboratory is always there, with a collaboratory interaction zone. There are active and passive access controls and scalability is a characteristic also. Sharable entities in the collaboratory include people and expertise, data, instrumentation, application programs, and sessions. Fixed data and time sensitive data are shared in a laboratory/office environment using tele- or video-conferencing, and instrumentation. Session control must not get in the way of doing work but access levels, passwords, user certificates and encrypted keys are needed. Discovery mechanisms include search for the various sharable entities (state sensitive) and directory services. If security gets in the way it will not be used. If security is too weak, valuable assets are at risk. The Web-based prototype uses client and server certificates for authentication, SSL encryption, and host data protection via directory access. The Entrust security context is used and a security services engine is created. Zaluzec listed the hardware (e.g., network protocol) and software (e.g., data protocol) infrastructure requirements for transport mechanisms. Resource management involves availability of people and instruments, access control and floor control. Zaluzec listed the features of real world interfaces: hardware, software and computational.

Microscopy is one of the few methodologies applied to nearly every field of science and technology. Microscopes vary in complexity, right up to multimillion dollar research tools. Electron microscopy (EM) and microanalysis are experimental methodologies which employ electron-optical instrumentation to characterize matter spatially on scales which range from tenths of a millimeter to tenths of a nanometer. The principal modalities used are imaging, diffraction and spectroscopy. Zaluzec showed photographs of the components of the Advanced Analytical Electron Microscope at Argonne National Laboratory (ANL), and a QuickTime movie.

The TelePresence Microscopy Collaboratory, TPM, is a persistent virtual location around centers of scientific interest. It integrates operation and control of scientific experiments and provides opportunities for distance learning and remote collaboration. It also provides a set of requirements which taxes the limits of the Internet. Benefits include access to unique research tools, persistent electronic laboratories, and education, teaching and training to and from remote sites. Zaluzec showed a graphic of the architecture and gave tables of demographics of usage: 51% educational and 67% in the USA.

The collaboratory is platform independent, with an intuitive GUI. The system is responsive to the user and adaptable to a wide range of instruments, and it provides what the user needs to do the experiment. Test bed sites are the ANL analytical electron microscopy instrumentation project and the DOE 2000 Materials Microcharacterization Collaboratory, MMC.

TPM provides access to an instrument room, instrument status, experimental data, online control, video conferencing and electronic notebooks. Zaluzec gave a diagram of the current architecture for instrument access and control. The ANL WWW TPM server provides platform independent access. Zaluzec also illustrated the hardware architecture and the "next generation" software architecture. Remote operations are conventional imaging, diffraction, high resolution imaging and spectroscopy. Persistent electronic space tools are data archiving, session archiving, data mining and electronic notebooks.

An electronic notebook is a repository for objects that document scientific research: text, numerical data, images and drawings. Data can be input, retrieved and queried. Why use an electronic notebook? Virtual laboratories encourage shared remote access to expensive, state-of-the-art resources. Remote control of scientific instruments logically requires online documentation of capabilities and data. Collaboration of distributed researchers is enhanced by a common record-keeping device. Zaluzec gave a diagram of the notebook architecture and a table to show the challenging bandwidths issues involved in imaging over the Net.

The goals of DOE 2000 are improved ability to solve DOE's complex scientific problems, increased R&D productivity and efficiency, and enhanced access to DOE resources by R&D partners. Strategies are to build national collaboratories, build an Advanced Computational Testing and Simulation (ACTS) toolkit, provide an authentication and security infrastructure, foster partnerships, use off-the-shelf solutions whenever possible, and conduct R&D when necessary to meet objectives.

National collaboratories put unique or expensive DOE research facilities on the Internet for remote collaboration, experimentation, production or measurement. They provide collaborative tools such as video conferencing, shared electronic notebooks, shared whiteboards, shared document creation, shared data viewing and analysis. DOE 2000 technology R&D projects address the collaborative integration framework, electronic notebooks, collaborative session management, shared virtual spaces, scalable security architecture, ESnet quality of service, and floor control. ANL, LBNL, NIST, ORNL, and the University of Illinois collaborate in MMC. Zaluzec showed various screens illustrating multisite collaboration and large table of the MMC instrumentation resources available at the various sites. Network-based video-conferencing is possible but immature.

Zaluzec showed a screen for some Java-based tools in microscopy and microanalysis. The MMC/TPM collaboratory revolves around a common theme of microscopy and microanalysis applied to both education and research. By placing creative scientists, having varying complementary expertise, together in a new environment which allows convenient, rapid and dynamic interactions to flow unencumbered by the limits of time and distance, TPM expects not only to foster, but to enhance, the ability of those individuals to conceive and execute scientific research.

Back to the top

TeleSpec - Telecooperation in Spectroscopy / Infrared Spectrum Prediction and Interpretation via the Internet

Paul Selzer, Jan Schuur, Markus Hemmer, Valentin Steinhauer, and Johann Gasteiger, Computer-Chemie-Centrum, University Erlangen-N�rnberg

Slides for this talk are on the Web.

Substance identification by IR is usually performed by comparing an experimental spectrum with a reference spectrum from a spectrum library. This identification technique assumes that a reference spectrum for the query spectrum is available, but there are only the 100,000 spectra stored in the largest infrared spectral database whereas there are over 16,000,000 known chemical compounds. The team at Erlangen has therefore developed a method, based on a combination of a neural network with a novel structure coding scheme, that allows rapid simulation of infrared spectra (Schuur, J. H.; Selzer, P.; Gasteiger, J. J. Chem. Inf. Comput. Sci. 1996, 36, 334-344). The method is available via the Internet as part of the TeleSpec-project. A user can perform interactive spectra simulation experiments and interpret the results by online analysis of the neural network. The aim of the project is to establish an Internet-based spectrum collection, discussion and interpretation forum.

Neural networks can be used to study complicated relationships such as those between spectra and structures. They are capable of inductive learning; training and prediction are separate experiments; predictions are fast; and there are no limits on molecule size. In the method, a structure is converted to a fixed length code which is input to a counter-propagation (CPG) network. A fixed length code for an IR spectrum is output and the IR spectrum is thus displayed.

The structure code is known as a 3D MoRSE code (3D-Molecule Representation of Structures based on Electron Diffraction) details of which have been published. To generate the code, a 2D structure is drawn and it is converted to a 3D structure with CORINA. Atomic properties for the 3D structure are calculated by PETRA and the code is made with a program called CODE3D. IR spectra have commonly been simulated by inputting substructures for a molecule to a statistical program, or pattern recognition, or neural networks. No matter how many substructures are used, the list will never be complete. MoRSE code does not have this disadvantage.

To train the neural network, structure and spectra codes are input to the cube (x times y neurons and z weights) of the CPG network. The neuron which has weights most similar to the input vector is determined and the weights of the neurons are adjusted such that they become more similar to the input vector. The training set was made by searching the SpecInfo database with a query structure and selecting the most similar molecules.

Selzer showed good correlation (r=0.899) between the simulated and experimental IR spectra for a triazine molecule, using a training set (unsupervised) of 50 molecules. He showed analysis of a CPG network: assignment of molecules from the training set in a 2D square grid where one triazine occupied a central square and similar triazines were in 7 of the 8 squares around it.

In one approach to structure prediction, potential degradation products of an unknown natural product are generated using a reaction prediction program, these products are input to a neural network and the IR spectra are simulated. The natural product is submitted to IR spectroscopic analysis. The experimental and simulated IR spectra are then compared.

Selzer showed a tree of structures obtained from the reaction prediction program, from hydrolysis or reductive dealkylation of cyanazine. He showed a tree of the corresponding simulated IR spectra. Three experimental spectra had r=0.885, 0.987 and 0.719 in comparison with the simulated spectra. The proposed triazine structures in the first two cases were correct. The third proposed structure had two amino substituents and one hydroxy substituent on the carbon atoms of the rings while the correct structure had two hydroxy groups and one amino.

The method can simulate IR spectra over the entire frequency range, it is rapid, and there are no limits on the size of the molecule. However, the quality of a result is dependent on the data available. Selzer concluded with some slides showing the sort of screens seen by users who simulate spectra in TeleSpec.

Back to the top

ACD/ILAB: Connecting Distributed Chemical Information Resources to a Unified Web Front End

Valeri Kulkov, ACD Labs

This talk is available in the ChemWeb-VEI library. The ACD/ILab project started at ACD in June, 1996, initially as an attempt to move existing desktop-based ACD software into the client-server environment. A Web interface seemed to be the most logical way to create a platform-independent client. By August 1996, an IUPAC name generating program, a logP predictor, boiling point and vapor pressure predictors, an H-NMR predictor and a C13-NMR predictor were ported to the ACD/ILab in the form of regular CGI scripts.

Those services were followed by a pKa predictor and pKa database search, a CAS name generating program, and the LogD, aqueous solubility, adsorption coefficient, and bioconcentration factor predictors. ACD/ILab operated on a totally free basis for over a year. On October 1st, 1997, ACD/ILab became a commercial service. However, 8 services including surface tension, density, Molar Volume, Molar

Refractivity, Refractive Index, dielectric constant, polarizability and parachor remain free. The newest additions to ILab include ChemSketch as a Windows client, P-NMR and F-NMR predictions, C-13 and H-NMR database searches, including substructure search capability, and a structure elucidation service. The number of services in total approaches 30.

Since April 1997, ACD/ILab has been visited by over 3100 users from over 100 countries who have performed approximately 80,000 predictions and database searches to date. The six most popular ILab services are:

H-NMR prediction (43.6%)

IUPAC Name generation (20.6%)

C-13 NMR prediction (13.3%)

pKa prediction (6.5%)

Boiling Point prediction (5.6%), and

LogP prediction (4.0%)

The development of Web applications still remains a big challenge even with many visual tools available for the development of Web applications. The main problems are lack of reusable tools for handling chemical objects, lack of GUI components in HTML that makes it nearly impossible to create a truly intuitive user interface, incompatibility between different browsers and even different versions of the same browser, bugs in browsers and browser Java implementations, security restrictions imposed on Java applets, network bandwidth limitations, limitations of the HTTP protocol, and the steep learning curve for server-side development.

If a commercial service is planned for deployment on the Internet, the concept design, planning and user database development and Web integration are also considered to be a significant problem. From the user's point of view, learning new Web-based information services also presents a challenge, since there are currently no standard ways to handle chemical data objects and reports. Thus, users are likely to have to learn user interface functions each time they work with a new chemical information resource.

The ACD/ILab Open Server Interface is designed with the idea of reusing the same chemically-intelligent user interface to provide access to third-party information resources. ACD/ILab relies on Java applets in handling chemical structures and spectra. Thus, no client software installation is required to take full advantage of the structure drawing and spectrum visualization capabilities of the ILab. There is no need to perform software upgrades since the most recent Java classes are loaded from the ACD/ILab server automatically. ACD/ILab is designed with the idea of providing maximum ease of use. Future versions of the HTML interface will employ advanced features of Netscape Communicator and MS Internet Explorer. For compatibility with older browsers, a "generic" HTML interface will be maintained. A recent addition to the family of ILab clients includes ChemSketch Online, a free structure drawing package for Windows. This communicates directly with the ILab server, using HTTP and SSL protocols.

All client software providing access to the ILab resources is free. ACD reserves the trademark name and copyright on the client software as well as on the server-based information resources. Providers of information resources will establish their own licensing terms for the use of information derived from their sources. ACD/Labs is not going to charge non-commercial providers for hosting their information resources at the ILab. Commercial providers are expected to enter into a dealership agreement with ACD/Labs who will assume responsibility for accounting, billing and collecting payments from the ILab customers, thus avoiding multiple user accounts and multiple billings when customers use services from different providers.

ACD/ILab operates in three different modes. Pay-per-transaction mode implies that a user is billed a specific amount per transaction performed. Subscriptions allow a user to gain access to the ILab resources for a limited time without imposing any limitations on the number of transactions. Site licenses are established for institutions that purchase specific subscriptions for a number of clients. Clients are identified by their IP addresses and they do not have to enter their login names and passwords.

The ACD Structure Drawing Applet (ACD/SDA) is a complete structure drawing, editing and visualization tool written for JDK 1.0.2 and compatible with most Java-enabled browsers. The applet can be used for composing substructure queries to databases and visualizing results. It is platform independent. It is also chemically-intelligent: it understands valency and atom charges and sets them automatically as the user draws a structure. If a mistake is made, the applet shows where. Kulkov demonstrated many of the applet's features.

The Universal Spectrum Display Applet (ACD/USVA) reads spectroscopic files, in JCAMP, SPC or netCDF formats, and provides visual representation of a spectrum. The ACD Predicted NMR Spectrum Display Applet is used to show the results of NMR predictions at the ACD/ILab. The applet plots a spectroscopic curve along with the corresponding structure and tables of chemical shifts and coupling constants. As the user moves the mouse cursor over the spectrum display area, the peaks will become highlighted as well as the atoms they correspond to in the molecule, and vice versa. All these applets are compliant with JDK 1.0.2 specifications, compatible with most Java-enabled browsers, and platform independent.

ACD/ChemSketch Online is a Windows client to the ACD/ILab. It has an intuitive chemical structure drawing interface, automatically launches helper applications to display chemical data objects such as spectra, and includes a free spectrum viewer. Network traffic is minimal since only data files are transferred. A report generation facility is included and there is no need of Java-capable browser software (a feature especially useful for Windows 3.1 users).

The ILab Transaction Server (ITS) is a central part of the ILab server. The ITS has evolved from the simple CGI scripts at the beginning of the ILab project into a fairly sophisticated piece of software. It has a typical 3 tier architecture and includes Connection Holder, Resource Locator, Resource Monitor, Payment Processor, Service Handlers, Transaction Processor and many different pieces working together. There is a client database that communicates with the ILab transaction and services. Those services will imply CORBA resources, Java classes, Enterprise JavaBeans and various kinds of network resources that can be connected to the ILAB Transaction Server.

Open Server API for the ACD/ILab was in its alpha state in September, 1998. Before the final release can be made, licensing terms and guidelines for resource providers must be finalized. The released version may also include wizards for easier building of Java code and HTML forms.

Back to the top

Virtual Education

Tim Brailsford

The nature of higher education is changing and is characterized by more students. lower staff to student ratio, and lower per capita funding. So, more cost-effective teaching methods are needed and new markets are sought. Scientific knowledge is increasing and becoming more specialized. Science is changing rapidly, particularly molecular sciences. Thus, there is a need for specialized postgraduate courses for mid-career students.

Mid-career courses have to be relevant to industry (perhaps involving collaboration), flexible (students need to be able to study in the workplace or in their spare time), and up to date. Options are short courses, part-time courses, distance-learning and virtual education. All have their snags.

Chemical education on the Internet (virtual education and molecular resources) is an innovation of the Virtual School of Molecular Sciences at Nottingham University. Short courses and a course on structure based drug design are run. The latter is accredited by the University of Nottingham. One third of students on the Masters course already have PhDs.

Virtual Education (VE) is not CAL, although CAL may be part of VE. The most important thing about virtual education is not technology, but content, community, etc. The benefits of VE are as follows. There are no geographical limitations - anyone with Internet access can participate. Students study at their own speed, in their own time and at the most convenient location. VE is highly flexible: different study programs can be provided for different students. There are no time-tabling problems.

Technology for VE should be universal (platform independent), robust, easy-to-use, and cheap, or free. Bleeding edge technology is to be avoided, so at Nottingham they use HTML content, viewable with common WWW browsers. Free molecular viewers such as RasMOL and Chime are used. There is some use of Java and VRML. Email and bulletin boards are used for group interaction. Course management tools such as WebCT are also in use.

The content of a Virtual Course (VC) has to be high quality, designed for the WWW, not lecture notes. The WWW is a new medium. Content includes open hypertext learning resources (e.g., a Daylight tutorial on SMILES), guided learning assignments, and guidance through existing resources not necessarily intended for teaching (e.g., a molecular database) but support must be provided.

Asynchronous communication is done using mailing lists and bulletin boards; real time communication is more difficult. Some, but not all, participants like the immediacy of virtual conferencing (MOOs). Audio and video conferencing may be useful in keeping in touch but is not (yet) suitable for complex information. It is not good enough or reliable enough for day to day teaching. Brailsford discussed Internet resources for molecular science. Software tools such as Chime and molecular databases (sequences etc.) are not intended for teaching and students will need support. Tutorials are designed for teaching.

The structure-based drug design (SBDD) certificate requires 60 credits; 120 credits lead to a diploma. There is a major skills gap in industry here. This is a specialist course for mid-career training aimed at medicinal and computational chemists using activity centered, assignment based learning. Students are looking for transferable skills; they already have PhDs so they are not looking for letters after their names.

Glaxo Wellcome, SmithKline Beecham, Pfizer, Astra, and Oxford Molecular sponsor the course and 18 of the 19 students are from sponsoring companies. Postgraduate industrial tutors provide original teaching material, form a steering committee and have local tutor or councilor roles. In the light of their increasing importance they have been appointed honorary members of the university.

SBDD Modules include Unix and networks, the WWW, molecular graphics and modeling, molecular databases, drug design, molecular interactions, protein structure and modeling, and a dissertation. At least one assignment per module is assessed. The assignments are submitted as HTML, informally by WWW, and formally by email.

Back to the top

Making Chemistry Available via the WWW for Education

Mark J. Winter, University of Sheffield

This talk is available at Winter's site on the Web. He demonstrated that "non-programmer" chemists can program computers using high-level languages and existing educational software can be adapted to provide chemical information and tools on the WWW.

If you need a computer program not obviously available, and you are prepared to write a program, then Winter asks the following questions. Would you rather learn something simpler than C++? Are your needs relatively simple? Is execution speed not particularly critical? If you answer "yes" to all the above then you should consider a scripting environment. A scripting language is more or less plain English but can give still full control over the computer.

HyperCard is a programming environment for the Macintosh. It is astonishingly useful but is sometimes sneered at by "real" programmers. It has database capabilities, uses a plain English scripting language (HyperTalk), and can run AppleScript system scripting. Its graphics capabilities are somewhat limited. A number of other programs are based upon the HyperCard concept, for example, SuperCard for MacOS, or a Windows/MacOS plug-in. It uses SuperTalk, a variant of HyperTalk. It has better graphics and media versatility than HyperCard but is sometimes slower in execution. Other examples are ToolBook for PCs and MetaCard for Windows 3.1/95/98/NT, Unix/X11, and MacOS.

Winter next described Chemdex and Chemdexplus. Chemdex started life in 1993 as Winter's bookmark list which he then placed on the Web. At the time there were only half a dozen or so chemistry sites so it was easy. As the number of chemistry sites increased, it became more difficult to handle and so was moved to HyperCard, and later to SuperCard. The Chemdex files are written under script control from a SuperCard project. Chemdexplus is an enhanced version of Chemdex, under development in collaboration with ChemWeb.com. The key differences between Chemdex and Chemdexplus are that the latter is searchable and that the database software is more sophisticated.

WebElements is a periodic table on the World Wide Web. It is an online resource attempting to add value to the WWW. It is not a tutorial, a hypertext essay or a reaction database. The origin of WebElements is an unfinished 1989 HyperCard "stack" (MacElements) which addressed the periodic table. In essence there is a "card" for each element with some data on it. In the first version of WebElements, MacElements used HyperCard's database facilities to store data, a HyperTalk script extracted the data and inserted them into HTML templates and HyperTalk wrote the HTML block to text files.

The second version of WebElements is a little more complex. SuperCard is now used for its color capabilities. It works as above and in addition there are about 25 files for each element; alternatives are parallel frames or frameless interfaces (with the same data in each version); and a number of data visualization styles are buried within the WebElements file structure. Each is drawn using SuperTalk to extract the data from the database and to plot the graphs. Winter demonstrated the pages for bromine, including Virtual Reality Markup Language (VRML) for the crystal structure, and a plot of density versus atomic number, in which you jump to the appropriate element by clicking on the plot. Varying shades of red on the table can also be used.

Winter showed a WebElements site traffic graph based on more than 10 million element page downloads. The frameless interface is preferred. Popularity of an element declines with atomic weight. Readers like shiny metals and there are sawtooth regions in the graph. Interestingly, the graph matches elemental abundance in sea water.

Winter showed a 3D histogram of density using a QuickDraw 3D file. QuickDraw 3D is a cross-platform application program interface (API) for creating and rendering real-time, workstation-class 3D graphics. It consists of human interface guidelines and toolkit, a high-level modeling tool kit, a shading and rendering architecture, a cross-platform file format and a device and acceleration manager for plug and play hardware acceleration. QuickDraw 3D is now integrated into the QuickTime 3 release. QuickDraw 3D images can be embedded within a document and viewed using an appropriate browser plug-in or an external viewer. Winter reproduced a specification of a box in 3DMF text format: the 3DMF he listed displays a block with an attached URL and an attached description at a defined position in space. In principle, all sorts of special effects could be applied to the block, but these may distract attention.

VRML is a way of describing virtual reality images using plain text. Virtual reality images can be embedded within a document and viewed using an appropriate browser plug-in or an external viewer. Some sources of VRML plug-ins, or external VRML viewers are Quick3D for Macs and PCs, and QuickDraw 3D Viewer Application for Windows . Microsoft Internet Explorer displays 3DMF files without modification on MacOS and there are various 3D plug-ins for NetScape. Winter gave an example of markup to display a block with an attached URL and an attached description at a defined position in space. In principle, all sorts of special effects could be applied to the block, but these may distract attention.

He gave examples of VRML to display chemistry: a 3D histogram of melting points (click on one bar and go to carbon, say), the crystal structure of sulfur, and flying through the more complex crystal structure of a metal. Some more sources of VRML plug-ins and external VRML viewers are WorldView, Cosmo Player from Silicon Graphics, Live3D (which came with NetScape 3 and was good but may be defunct) and various 3D plug-ins from NetScape.

The Sheffield Chemputer is a calculator for a few chemical properties accessed via the WWW, and a demonstration of HyperCard and SuperCard stacks as cgi applications. The Chemputer relies upon AppleEvents to pass parameters and results between programs (WebStar and WebElements.cgi). The first requirement is to parse a chemical formula, including brackets and pseudoelement symbols (essentially string manipulations). Some components are HyperTalk compiled into machine code using CompileIt! for speed. Available calculators include isotope patterns, element percentages, oxidation number, electron count, VSEPR (valence shell electron pair repulse) shape calculation, and MLXZ category (a way of categorizing complexes according to ligand type which avoids oxidation state formalism).

Winter demonstrated the Sheffield ChemPuter isotope pattern calculator. He predicted the shape of SeCl2O2 using VSEPR rules. There is an attached tutorial. You can use the Chemputer to calculate an isotope pattern for an arbitrary chemical formula from within your own WWW pages. Under normal conditions it can handle hundreds to thousands of calculations per day without problems. Winter gave an example form you could use in your own WWW pages to perform the calculation.

In conclusion, HyperCard and SuperCard can maintain and export databases of various types to the WWW, can produce useful static and interactive graphics which convey chemical information in a number of formats, and can act as the engine in interactive cgi applications for chemistry.

Back to the top

VChemLab: A Virtual Chemistry Laboratory

A. P. Tonge, H. S. Rzepa Department of Chemistry, Imperial College

This talk is available on the Web. Tonge described the design and implementation of two interactive Internet-based applications for the storage, retrieval and display of molecular structures, spectra and physicochemical properties. The first, suitable for small-scale laboratory or teaching applications, uses a JavaScript-controlled structure to load data from discrete files located on a server or CD-ROM, into a local Web browser. The second, suitable for scaling to larger commercial-sized applications, uses a Java-based Distributed Object Computing architecture to allow a local client persistent access to a remote object database, in which information for different classes is stored. Both browser applications have a dynamically-created page structure and use readily available plug-ins and Java applets for the active display of 3D molecular structures and their experimental spectra.

The Imperial College Virtual Chemistry Laboratory is a World Wide Web-based project for storage and retrieval of chemical information, 2D and 3D molecular structures and their associated experimental and spectroscopic properties. It is also a standalone resource for undergraduate practical classes which do not require direct Internet connection (i.e., it can be loaded from CD-ROM). Information such as melting points and spectra would usually be found in spectral atlases or safety manuals but VChemLab makes it more accessible.

A paper about the project will appear in the November/December 1998 issue of J. Chem. Inf. Comput. Sci. The project aims to take advantage of the Internet, with platform-independent methods for the storage, retrieval and display of chemical information, using standard Internet tools. The World Wide Web has developed rapidly from simply allowing client browsers to view static HTML-based Internet pages. The designers of VChemLab wanted the system to be interactive. User controlled active browser content is now possible with Java applets, plug-ins and JavaScript. Java applets are downloaded '"bytecode" applications. They are browser independent (JVM). Plug-ins are embedded applications which parse specific MIME types. They are browser dependent. JavaScript is an event-driven scripting language giving user control of HTML elements and interframe communication. Browsers have developed to become platform-independent front ends for extranets and corporate Intranets.

Tonge showed a diagram of the client server architecture. The file server handles structure and spectra files, and data and image files. HTML and JavaScript pages, and JavaScript arrays, are sent to the client which uses dynamic JavaScript and does HTML page generation. Tonge displayed the VChemLab HTML page hierarchy and the construction of a JavaScript molecule data array object. He did a search for carcinogens, rotated benzene using Chime, and displayed a JCAMP spectrum.

Then he discussed interrogation of remote large databases over a network. A Web browser plus Java has a universal client and is platform independent. Applets allow automatic software distribution over the Internet, which reduces maintenance. Objects use Java classes as a template and combine data and data-handling methods

In Distributed Object Computing the client (browser) process has access over a network to methods and data handling procedures of remote objects running on different machines. Scalability, searchability and security need a design which allows users to submit searches (including substructure searches) to the full remote database before downloading a subset to the browser.

Tonge listed some tools which can be used in distributed computing on the Web. CGI is popular but has a lot of disadvantages. There is no client processing, performance is poor. It is stateless and does not handle objects. CORBA is popular for bioinformatics. Interoperable ORBs (IIOP,IDL) are useful for legacy systems but there is a query over cost. RMI uses serialized objects. It is Java only but it is free. JavaSockets (with which you must decode everything) and DCOM (from Microsoft) are also available. The team at Imperial chose to use RMI.

Tonge showed a diagram of the COS (Chemical Object Store) architecture and the ORB/RMI architecture. You register a remote object with the RMI Registry (on the server) and when the client applet connects to the server, the Registry returns a stub (proxy remote object) to the client. The client invokes a remote method on the stub and stub and skeleton communicate over the network. The skeleton invokes a method on the remote object and the remote object returns the requested data.

Tonge demonstrated an RMI test of PSE molecule databases. Distributed computing means that you may want to operate outside the Applet security sandbox, e.g., write files to the local disc. Both Netscape Navigator & MS Internet Explorer allow you to use trusted applets, whereby the applet Java class files have been digitally signed with an Object Signing Certificate from an approved authority (e.g., BelSign, VeriSign, Thawte). The Applet Archive on the server is signed. Netscape Capabilities classes and Microsoft Authenticode are examples. Tonge concluded with a "higher risk" demonstration RMI test of PSE molecule databases, accessing a machine in London.

Back to the top

The CSIR Web Service -- Making it Easier to Find the Chemistry Software and Information You Need

David E. Bernholdt, Northeast Parallel Architectures Center, Syracuse University

This talk is available in the ChemWeb-VEI library. Chemistry Software and Information Resources, CSIR, pronounced "caesar", is a new WWW based tool for the chemistry community, the goal of which is to facilitate the "discovery" and use of software resources in chemistry research and education. Finding things on the Web can be difficult and the software which underlies a great deal of what we do is also very hard to find in many cases. This was the challenge for the CSIR project which has now two primary components, the Chemistry Software Exchange and a mailing list archive.

Probably 75% of the queries on the Computational Chemistry list seem to be of the type "Does anybody know software to do x?" There are many chemistry-related mailing lists and a lot of them discuss software, so CSIR is a resource for learning about software, for learning about how to use software, and for help in interpreting results. Mailing lists are scattered all over the Internet, in many cases they are not archived, and many of the archives do not have very easy access.

The lofty goal of the Chemistry Software Exchange is to make this the place people look at if they need software tools for research. What Bernholdt is trying to create is something called a "virtual software repository". It contains both commercial and non-commercial chemistry-related software. There are also some special categories for database services and Internet services. The Chemistry Software Exchange has basic cataloging information. It is designed to provide sufficient information for a user to locate some candidates and go and investigate them in more detail. Include are names, a URL pointer to the software product and vendor, target platforms (if known), and information about cost or licensing arrangements. The information is organized by functionality and it is also fully searchable.

CSIR can either act as an actual repository for software, or point users to where the software actually lives, or point users to the salesperson for the vendor. Catalog entries follow an IEEE standard which was recently approved. This standard, the Basic Interoperability Data Model, provides an object-oriented framework for describing software, for describing libraries of software, and for describing review processes that are applied to software. It is also extensible, so that if you do not like the basic model, there is a standard way to extend it. In using this standard, you can now produce repositories of software that can talk to other repositories.

The project started with the National HPCC Software Exchange (NHSE), of which Syracuse was a part. They developed a tool called the "Repository in a Box", RIB, which is a set of perl scripts and related things that allow users to create a repository which follows a standard. (There is also commercial software with similar functionality.) The NHSE has developed a number of sites e.g., the NetLib library of mathematical software. The NHSE and the Repository in a Box are successors to the NetLib library: a more formalized way of cataloging software. The NHSE has several repositories around parallel computing tools, linear algebra software, and so on, and the chemistry repository, CSIR, was the first non-mathematical library of software.

There are others: the Department of Defense Computational Chemistry & Materials program, for example, has a repository and recently the Crystallography project in the UK (CCP14) has installed an RIB archive to set up the repository of crystallography software. All these installations use RIB. CSIR is affiliated with the NHSE project and there are a number of other repositories pointed to on the Web page.

Now Bernholdt turned to mailing lists. He has found more than 90, about two thirds of which are actually active. His list is not complete but it is probably the most comprehensive available. He created an archive of chemistry-related mailing lists which provides a single point of access to a large number of mailing lists. It is accessible with any standard Web browser and the lists can be searched in any combination. The archive contains about 130,000 messages over the last two years and it is possible to add back issues. CSIR can also host mailing lists. CSIR is designed to be a service to the chemistry community, so hosting something like charm-bbs (which it now does) is an obvious extension.

RBI uses CGI scripting/perl scripts to do everything. It enforces a data model and simplifies a lot of the management and operation of the repository. Bernholdt had started out with hand-generated HTML pages for each catalog entry and there were significant problems. If certain rules and conventions are followed, these catalogs can actually interoperate. Using the RIB software, Bernholdt can produce a large catalog out of bits and pieces of interoperating catalogs.

In the RIB, asset information is stored in HTML files, but it is stored as meta-data, so that the user-visible content does not matter to the RIB. The RIB can import an asset from other repositories or from arbitrary Web pages. Since all the information that the RIB needs is in meta-data, software vendors can actually maintain the information themselves on their web pages in the meta-data, they can register the URL with CSIR and Bernholdt can download that and incorporate it into the catalog, without actually having to store the information. CSIR is not actually doing this yet, but that is the approach that Bernholdt is aiming to implement.

Bernholdt showed what the meta-data looks like. BIDM is the Basic Interoperability Data Model. Then there is the name of the information field, and the various contents, and the organization line is actually a link. So, one object represents an organization which might have produced the software, and another object represents the software asset itself. The software assets can actually be combined into a library.

The mailing list archive actually started as an entirely separate project at Northeast Parallel Architectures Center (NPAC), which was archiving a lot of the computational newsgroups from the Usenet. Berholdt thought that it would be nice to have one for chemistry too so he started locating lists. NPAC now subscribes to 90 lists, connected to an ORACLE database system, used for its full text search capabilities. The basic effort went into creating the front end to access the system and also the interface to process all the mail and news feeds as they come in. There are probably some better tools that could be used now, but at the time the system was first created it was state of the art and it still serves a purpose. The ORACLE database server talks with the Web server and that communicates with all the browsers. All the mailing lists and newsgroups feed into a series of processors which routinely incorporate information into the database.

CSIR is not meant to replace available facilities, it is meant to help people find them and use them better. So the mailing lists are only archived: the people who contribute to the mailing list themselves provide the content and CSIR does not allow people to post from its archive directly into the list. On the software side, a listing in the CSIR repository can only enhance the chances that somebody is going to become interested in your software product, if you are a vendor. CSIR is not trying to compete with other repositories like QCPE, and places that sell software, That said, Bernholdt is very interested in promoting the idea of interoperable software repositories, all indexed in one central place.

Looking ahead, Bernholdt really wants to focus on the catalog, and getting more catalog entries. There is a definite need for a better classification scheme. Creating a taxonomy of software is not easy. Bernholdt has the same scheme that the DoD repository is using. NIST has spent a great deal of effort classifying numerical algorithms, but it is inapplicable to chemistry. Ultimately, Bernholdt would like to shift from the software meta-data being kept on the CSIR site to as much as possible of it being on the vendors' sites. This means lower maintenance for CSIR and more current information. It also means lower maintenance for the vendor. On the mailing list side, incorporating back issues would be a useful thing to do. Bernholdt would like to start mining the archive for information about new mailing lists and new software announcements. Finally, the NHSE is investigating secure software distribution technologies: anything from electronic shrink wrap to satisfying the US export control regulations. An article about CSIR has appeared in Bernholdt, D. E.; Fox, G. C. Trends Anal. Chem. 1997, 16, 230.

Back to the top

Closing Remarks

Henry Rzepa, Imperial College, London

Rzepa's remarks are in the ChemWeb/VEI library.

The idea for this meeting came into being about 16 months earlier. All of us working on the Internet get to know an awful lot of people virtually and often have an urge to meet them physically. The committee felt there was a real need to bring the community together. Although the number of attendees fell short of the organizers' hopes, there was an interesting mixture of people from countries in Europe and North America and as far afield as Kuwait, Australia, and China.

Quite a few of the talks were new to Rzepa. He was delighted to discover the "information universes" that were explored. The standard of presentations and the facilities offered by the Beckman Center were remarkably high. Rzepa suggested that philosophical vision and reality checks could have been the two extremes of material presented, whereas in fact, the conference achieved a happy balance somewhere in between those extremes.

Looking to the future, the committee had issued a questionnaire. The earliest responses reiterated what Rzepa had just said, but suggested that the poster section needed to be expanded to allow for more "social dynamics". The feedback suggests that another ChemInt ought to be organized, possibly in another location. Steve Heller and Steve Bachrach are maintaining the ChemInt Web site where they hope to capture proceedings of the 1998 meeting and details of the next one.

This page updated on 11th March 1999

e-Publications

Chemistry & the Internet 1998

CHEMINT98: CHEMISTRY AND THE INTERNET, SEPTEMBER 1998, IRVINE, CALIFORNIA