Prologue
In the beginning, before the creation of the Galaxy, there was the Wanderer...Spinning a lingering web in its wake were the robots, spiders, Worms(WWW), and Crawlers(Web)...Bringing order to the Chaos of the newly formed Galaxy were a bunch of Yahoos.
The wisdom and greatness of the elder Yahoos has gone unquestioned and unchallenged for an eternity...the Seekers of Information could not obtain all of the great one's secrets, nor could the shockwaves from the great Excitement rattle the sacred altar of Yahoo!.
But now, there is a new force in the universe...the Hotomi Bot(1) threatens to tear apart the very fabric of the web!
That was around April or May of 1993. About that same time, www.mit.edu went online as one of the first 100 web servers in the world. Naturally, this was not MIT's official homepage(2), because at that point, nobody had homepages. It was actually a server set up by a bunch of students that collectively called themselves SIPB (the Student Information Processing Board). Their pages provided a central starting place for exploring MIT's web sites, providing helpful information for "surfers" who were still confused by the whole concept. (3)
Bow down and give thanks to Archie. The grandfather of all search engines was Archie, created in 1990 by Alan Emtage, a student at McGill University in Montreal. The author originally wanted to call the program "archives", but had to shorten it to comply with the Unix world standard of assigning programs and files short, cryptic names such as grep, cat, troff, sed, awk, perl, and so on. For more information on where Archie is today, see: http://www.bunyip.com/products/archie
At the early date of 1990, there was no World Wide Web. Around this time, Tim Burners-Lee probably had a bad dream in which a scary monster with "HTTP" etched into its hide slowly ate up all of the Earth's resources. Nonetheless, there was still an Internet, and many files were scattered all over the vast network. The primary method of storing and retrieving files was via the File Transfer Protocol (FTP). This was (and still is) a system that specified a common way for computers to exchange files over the Internet.
It works like this:Some administrator decides that he wants to make files available from his computer. He sets up a program on his computer, called an FTP server. When someone on the Internet wants to retrieve a file from this computer, he or she connects to it via another program called an FTP client. Any FTP client program can connect with any FTP server program as long as the client and server programs both fully follow the specifications set forth in the FTP protocol.
Initially, anyone who wanted to share a file had to set up an FTP server in order to make the file available to others. Later, "anonymous" FTP sites became repositories for files, allowing all users to post and retrieve them. Even with archive sites, many important files were still scattered on small FTP servers. Unfortunately, these files could be located only by the Internet equivalent of word of mouth: Somebody would post an e-mail to a message list or a discussion forum announcing the availability of a file.
Archie changed all that. It combined a script-based data gatherer, which fetched site listings of anonymous FTP files, with a regular expression matcher for retrieving file names matching a user query. (4) In other words, Archie's gatherer scoured FTP sites across the Internet and indexed all of the files it found. Its regular expression matcher provided users with access to its database.
Veronica and Jughead - but where is Betty?
Gopher is like FTP, but for documents instead of files. Gopher servers contain plain-text documents (no images, no hypertext) that can be retrieved. Archie's popularity had grown such that in 1993, the University of Nevada System Computing Services group developed Veronica(5) (the grandmother of search engines). It was created as a type of searching device similar to Archie but for Gopher files. Another Gopher search service, called Jughead, appeared a little later, probably for the sole purpose of rounding out the comic-strip triumvirate. Jughead is an acronym for Jonzy's Universal Gopher Hierarchy Excavation and Display, although, like Veronica, it is probably safe to assume that the creator backed into the acronym. Jughead's functionality was pretty much identical to Veronica's, although it appears to be a little rougher around the edges.
The lone Wanderer
If Archie was the grandfather of search tools and Veronica the grandmother, their child, and thus the mother of all search engines, was Matthew Gray's World Wide Web Wanderer. The Wanderer was the first robot on the web and was designed to track the web's growth. Initially, the Wanderer it counted only Web servers, but shortly after its introduction, it started to capture URLs as it went along. The database of captured URLs became the Wandex, the first web database. Matthew Gray's Wanderer created quite a controversy at the time, partially because early versions of the software ran rampant through the Net and caused a noticeable netwide performance degradation. This degradation occurred because the Wanderer would access the same page hundreds of time a day. The Wanderer soon amended its ways, but the controversy over whether robots were good or bad for the Internet remained. What's a Robot got to do with the Internet? The term robot has special significance to programmers. Their version of the term is mostly unrelated to the metallic lumbering creatures of Asimov lore. A synonym for robot "automaton" is actually more enlightening. Computer robots are programs that automatically perform a repetitive task at speeds that would be impossible for humans to match, just like the tasks today's robots perform in factories.
On the Internet, the term robot or bot has become a bit broader. For the most part, it refers to programs that explore the Internet for some sort of information. Web robots search the Internet for web pages, usually for the purpose of compiling a large, searchable database. This category of robot is often called a spider. The spider robot falls right into the standard definition of performing a repetitive task. Other types of robots on the Internet push the interpretation of the automated task definition.
The chatterbot variety is a perfect example. These robots are designed to communicate with humans about some topic in a human-like manner. Some of them are fairly convincing; others are obviously quickly written computer programs. Chatterbots are sometimes used as an intuitive way to communicate certain basic information to users. An example is the milk robot, which can answer lots of questions about milk. One could force this type of program into the definition above by saying that it performs the repetitive task of communicating with clueless people.
The ALIWEB Strikes Back!
In response to the Wanderer, Martijn Koster created Archie-Like Indexing of the Web, or ALIWEB, in October 1993. As the name implies, ALIWEB was the HTTP equivalent of Archie, and because of this, it is still unique in many ways. ALIWEB does not have a web-searching robot. Instead, webmasters of participating sites post their own index information for each page they want listed. The advantage to this method is that users get to describe their own site, and a robot doesn't run about eating up Net bandwidth. Unfortunately, the disadvantages of ALIWEB are more of a problem today. The primary disadvantage is that a special indexing file must be submitted. Most users do not understand how to create such a file, and therefore they don't submit their pages. This leads to a relatively small database, which meant that users are less likely to search ALIWEB than one of the large bot-based sites. This Catch-22 has been somewhat offset by incorporating other databases into the ALIWEB search, but it still does not have the mass appeal of search engines such as Yahoo! or Lycos.
Invasion of the Spiders!
As the web grew, it became more and more difficult to sort through all of the new web pages added each day. Matthew Gray's Wanderer inspired a number of programmers to follow up on the idea of web robots, or spiders, as they are now called. These programs systematically scour the web for pages by exploring all of the links on a starter site, which is a page that contains many links to other pages. The concept was that by definition, every page on the web must be linked to another page. By searching through a large number of pages and following all of the links, a user will discover new pages that have their own collection of links. The hope is that most of the web can be explored through the continuous repetition of this process. This process caused a great deal of controversy because some poorly written spiders were creating huge loads on the network by repeatedly accessing the same series of pages. Most network administrators thought they were a bad thing, so naturally programmers created even more of them.
By December 1993, the web had a case of the creepy crawlies. Three search engines powered by robots had made their debut: JumpStation, the World Wide Web Worm, and the Repository-Based Software Engineering (RBSE) spider. JumpStation's web bot gathered information about the title and header from Web pages and used a very simple search and retrieval system for its web interface. The system searched a database linearly, matching keywords as it went. Needless to say, as the web grew larger, JumpStation became slower and slower, finally grinding to a halt.
The WWW Worm indexed only the titles and URLs of the pages it visited. It used regular expressions to search the index. Results from JumpStation and the Worm came out in the order that the search found them, meaning that the order of the results was completely irrelevant.
The RSBE spider was the first to improve on this process by implementing a ranking system based on relevance to the keyword string.
The Easily Excitable Spider
The popular public search engine, Excite, has roots that extend rather far back in the history of the web. Initially, the project was called Architext; it was started by six Stanford undergraduates in February 1993. Their idea was to use statistical analysis of word relationships in order to provide more efficient searches through the large amount of information on the Internet. Their project was fully funded by mid-1993. Once funding was secured. they released a version of their search software for webmasters to use on their own web sites. At the time, the software was called Architext, but it now goes by the name of Excite for Web Servers.
Billions and billions of catagorized links...
Unfortunately, these spiders all lacked the intelligence to understand what it was that they were indexing. Therefore, if you didn't specifically know what it was that you were looking for, it was unlikely that you'd find it. This deficiency prompted the creation of EINet Galaxy, now known as the Tradewave Galaxy, which is the oldest browsable/searchable web directory.
Because it is a directory, Galaxy/eiNet links are organized into hierarchical categories. For example, a top-level category might be called "Computers". Within the Computers category there might be subcategories for "IBM", "Sun Microsystems", "Digital Equipment Corporation", and so on. Within each of these subcategories would be further subcategories, although these would be more or less consistent across the various machine types.
As an example, all of the computer company categories might contain the subcategories of "Hardware" and "Software". This method of organization allows users to more effectively explore the contents of the database by narrowing the field of interest. The Galaxy/eiNet went online in January 1994. It contained Gopher and Telnet search features in addition to the web-searching features. Interestingly enough, Gopher was vastly popular as a document-sharing tool when the web was born. The Gopher search capability was probably the primary reason for the creation of the EINet Galaxy. (There weren't really very many web pages to search through in January 1994!) The web page search capability was simply an additional feature. Through the present, Tradewave (www.tradewave.com) still clings to its directory-based roots; it uses no bots or spiders to seek out new URLs.
Therefore, the Galaxy/eiNet is a true directory in the sense that it lists only URLs that have been submitted to it, and all categorization and review of the submitted URLs is done by hand. This results in higher-quality pages and more relevant searches, but far fewer pages to search through.
Yahoo!
At this stage in the game, people were creating pages of links to their favorite documents. In April 1994, two Stanford University Ph.D. candidates, David Filo and Jerry Yang, created some pages that became rather popular. They called the collection of pages Yahoo! Their official explanation for the name choice was that they considered themselves to be a pair of yahoos.
As the number of links grew and their pages began to receive thousands of hits a day, the team created ways to better organize the data. In order to aid in data retrieval, Yahoo! (www.yahoo.com) became a searchable directory. The search feature was a simple database search engine. Because Yahoo! entries were entered and categorized manually, Yahoo! was not really classified as a search engine. Instead, it was generally considered to be a searchable directory. Yahoo! has since automated some aspects of the gathering and classification process, blurring the distinction between engine and directory.
The Wanderer captured only URLs, which made it difficult to find things that weren't explicitly described by their URL. Because URLs are rather cryptic to begin with, this didn't help the average user. Searching Yahoo! or the Galaxy/eiNet was much more effective because they contained additional descriptive information about the indexed sites.
Brian's WebCrawler: Some Spider!
As bots got better and better, one rose above the pack with it's unique ability to index the entire text of a web page. Other bots were storing the title and the URL, and the first 100 or so words of a document, but it was WebCrawler that first allowed the user to search the full text of entire documents.
The history of WebCrawler is best told by those responsible: "In early 1994, students and faculty in the Department of Computer Science and Engineering [of the University of Washington] gathered in an informal seminar to discuss the early popularity of the Internet and the World-Wide Web. Students typically try out their ideas in small projects in these seminars, and several interesting projects were started. The WebCrawler was Brian Pinkerton's project, and began as a small single-user application to find information on the Web. Fellow students persuaded Pinkerton to build the Web interface to the WebCrawler that became widely usable. In that first release on April 20, 1994, the WebCrawler's database contained documents from just over 6000 different servers on the Web. The WebC rawler quickly became an Internet favorite, receiving an average of 15,000 queries per day in October, 1994 when Pinkerton delivered a paper describing the WebCrawler."
Eventually, the demand for WebCrawler devastated the network resources at the University of Washington. Although a number of companies invested in server equipment to ease the load on the WebCrawler servers, there was no solution to the bandwidth issue. At one point, the service became entirely unusable during the daytime hours. Finally, America Online (AOL) saved the day by purchasing the WebCrawler system and running it on its own network.
In 1997, Excite bought out WebCrawler, and now AOL is using an Excite derivative as the engine behind its own NetFind. The most important point about WebCrawler is that it was the first full-text search engine on the Internet. Until its debut, a user could search through only URLs or descriptions. The descriptions were sometimes created by the engines themselves or reviewers trying to rate the sites. A final word about WebCrawler from the company itself: "Several competitors emerged within a year of WebCrawler's debut: Lycos, Infoseek, and OpenText. They all improved on WebCrawler's basic functionality, though they did nothing revolutionary. WebCrawler's early success made their entry into the market easier, and legitimized businesses that today constitute a small industry in Web resource discovery." (www.webcrawler.com)
Mellon-Mania: The Birth of Lycos
Lycos was indeed the next big kid on the block, bursting out of the labs at Carnegie Mellon University during the July of 1994. The person responsible for unleashing this force onto the world is Michael Mauldin. He is currently on leave from CMU, acting as Chief Scientist at Lycos, Inc. In a paper describing design decisions made while programming Lycos, he gives a very nice history of the service.
"Work on the Lycos spider began in May 1994, using John Leavitt's LongLegs program as a starting point. (Lycos was named for the wolf spider, Lycosidae lycosa, which catches its prey by pursuit, rather than in a web.) In July 1994, I added the Pursuit retrieval engine to allow user searching of the Lycos catalog (although Pursuit was written from scratch for the Lycos project, it was based on experience gained from the ARPA Tipster Text Program in dealing with retrieval and text processing in very large text databases (9) ). On July 20, 1994, Lycos went public with a catalog of 54,000 documents. In addition to providing ranked relevance retrieval, Lycos provided prefix matching and word proximity bonuses. But Lycos' main difference was the sheer size of its catalog: by August 1994, Lycos had identified 394,000 documents; by January 1995, the catalog had reached 1.5 million documents; and by November 1996, Lycos had indexed over 60 million documents -- more than any other Web search engine. In October 1994, Lycos ranked first on Netscape's list of search engines by finding the most hits on the word 'surf.'" (6)
Hide and Seek
Representatives of Infoseek, another major search engine, say that they founded their corporation in January 1994. Although this may be true, the search engine itself was not accessible until much later that year. Initially, Infoseek was just another search engine. It borrowed conceptually from Yahoo! and Lycos, not really innovating in any particular way. Yet the history of Infoseek and its current critical acclaim show that being the first or most original isn't always that important. Infoseek's user-friendly interface and the numerous additional services (such as UPS tracking, News, a directory, and the like) have garnered kudos, but it was Infoseek's strategic deal with Netscape in December 1995 that brought it to the forefront of the search engine line.
Infoseek convinced Netscape (with the help of quite a bit of cash) to have its engine pop up as the default when people hit the Net Search button on the Netscape browser. Prior to this, Yahoo! was Netscape's default search service.
Return of the DEC
Digital Equipment Corporation's (DEC) AltaVista was a latecomer to the scene; it had its online debut in December 1995. Nonetheless, it had a number of innovative features that quickly catapulted it to the top. The least of the features was its speed. Run on a bunch of DEC Alphas, it had the horsepower to handle millions of hits per day without slowing down in the slightest. The rest of its features, all available from introduction, changed the face of search engines forever. AltaVista was the first to use natural language queries, meaning a user could type in a sentence like "What is the weather like in Tokyo?" and not get a million pages containing the word "What".
Additionally, it was the first to implement advanced searching techniques, such as the use of Boolean operators (AND, OR, NOT, etc.). Furthermore, a user could search newsgroup articles and retrieve them via the web as well as specifically search for text in image names, titles, Java applets, and ActiveX objects. Additionally, AltaVista claims to be the first search engine to allow users to add to and delete their own URLs from the index, placing them online within 24 hours.
One of the most interesting new features AltaVista provided was the ability to search for all of the sites that link to a particular URL. This was very useful for web designers who were trying to get some popularity for their pages; they could frequently check to see how many other pages were referencing them. On the user interface end, AltaVista made a number of innovations. It put "tips" below the search field to help the user better formulate a search. These tips constantly change, so that after using the search for a few times, users see a number of interesting features that they possibly did not know about. This system became widely adopted by the other search engines.
In 1997, AltaVista created LiveTopics, a graphical representation system to help users sort through the thousands of results that a typical AltaVista search generates. LiveTopics is interesting as a search tool, but conceptually it is more confusing than the standard search format. Although its innovative qualities are uncontested, its effectiveness remains to be seen (altavista.software.digital.com/search/showcase/two/index.htm).
A Spider Named "Slurp!": The Powerful HotBot
On the May 20, 1996, Inktomi Corporation was formed, and HotBot was unleashed upon the world. This is the youngest of all of the major search services, but even at its young age, it has already caused quite a stir in the online community. According to the company: "Pronounced 'ink-to-me', the company name is derived from a mythological spider of the Plains Indians known for bringing culture to the people. Inktomi was founded in January 1996 by Eric Brewer, an assistant professor of computer science at the University of California at Berkeley, and Paul Gauthier, a graduate student in the computer science Ph.D. program, with a desire to commercialize the highly-effective technologies developed during their research." (www.inktomi.com/press/icf-pr.html)
The Inktomi search engine was quickly licensed to Wired magazine's web site, HotWired. This site's popularity accounted for much of the initial fervor over HotBot. Wired's reputation as the oracle of the Net made promoting the site fairly straightforward.
So what's the big deal? Just another search engine? Well, yes and no. HotBot is probably the most powerful of the search engines, with a spider that can supposedly index 10 million pages per day. According to the Wired web site, HotBot should soon be able to reindex its entire database on a daily basis. This will ensure that the pages returned from a search are not out of date, which is now common with other search engines.
Additionally, HotBot makes extensive use of cookie technology to store personal search preference information. A cookie is a small file that a site can store on your computer. This file can be read only by the site that generates it. It can hold a small amount of text or binary information. This information is often used by sites to store customization information or to store user demographic data. HotBot recently won the PC Computing Search Engine Challenge, a contest between the major search engines. Representatives from each company were asked questions that could be answered only by a web search. The engine that most effectively led the representative to the right answer won the question. Although this challenge proved very little more than the searching abilities of the various representatives, it still garnered quite a bit of critical acclaim for HotBot, further increasing its popularity.
Information Overload: METAbolic Shutdown
What the PC Computing Challenge did show was that different engines pull up completely different sets of materials for similar searches. This makes it extremely frustrating to find what you want on the web, because a query that has little effect using one engine may turn up a gold mine of information on another.
Additionally, the little differences between the engines, especially regarding the support of Boolean operators, has a large impact on the type of query format that works most effectively. The current solution to this problem is the META engine. META engines forward search queries to all of the major web engines at once. The first of these engines was MetaCrawler. MetaCrawler searches Lycos, AltaVista, Yahoo!, Excite, WebCrawler, and Infoseek simultaneously.
MetaCrawler was developed in 1995 by Eric Selburg, a Masters student at the University of Washington (the same place where WebCrawler was developed a few years earlier). Like WebCrawler, MetaCrawler soon grew too large for its university britches and had to be moved to another site.
Here, Eric tells the story of how MetaCrawler became the go2net search engine:
MetaCrawler was conceived in spring of 1995 by myself and my advisor, Oren Etzioni, as my master's degree project. It grew rapidly in popularity once we released it publicly, gaining many new users after Forbes mentioned us in a cover-page article. Use jumped after C|Net reviewed all the major search services, ranking us No. 1, with AltaVista No. 2 and Yahoo No. 3...
In May of 1996, I (along with most of the rest of the AI department at UW) created NETbot. ...When I left NETbot to return to research at UW... MetaCrawler was now under 24/7 monitoring service, the code was as reliable as ever, and we had made several performance improvements. ... There was a realization that Netbot was ill-equipped to handle negotiations with the search services for continued MetaCrawler use.
Thus, the decision was made to license MetaCrawler to go2net, who could provide the resources necessary to make MetaCrawler viable as well as negotiate with the search services toward mutually beneficial arrangements. (www.metacrawler.com/selberg-history.html)
MetaCrawler functions by reformatting the search engine output from the various engines that it indexes it onto one concise page. Throughout MetaCrawler's history, the search engine companies that it worked with did not entirely approve of this procedure. The most common complaint was that the advertising banners that the search engines had on their sites were not appearing when a user employed MetaCrawler. This meant that their ads were not reaching the intended audience, reducing their ad revenues. The move to go2net heralded MetaCrawler's concession to these concerns.
Now MetaCrawler displays the ads from each search site right above the results. MetaCrawler users were not thrilled by this change because it increased the time it took for the result page to download. However, skillful design of the result pages now causes the text to load first, calming the restless native users.
Are You Savvy Enough to Search with Me?
Colorado State University also has a tool called Savvy Search that searches up to 20 engines at once, including a number of topic-specific directories such as Four11 (e-mail addresses), FTPSearch95 (files on the Net), and DejaNews (UseNet database). It's faster but less reliable than MetaCrawler.
SavvySearch's solution to the problem of differing types of search engine query formats is to ignore them all. Users should not try and enter complex search strings into SavvySearch. MetaCrawler at least tries to tackle this problem by creating its own search syntax (using + to indicate AND, - to indicate AND NOT) and by converting this syntax into the equivalent command for each engine. However, neither MetaCrawler nor SavvySearch let you tap the full power of the advanced search syntaxes offered by most engines.
One Click, DoubleClick, Red Click, Blue Click
We've already briefly touched upon the relationship between advertisers and search engines, but the area is now of such importance that it deserves its own section. It wasn't long after the advent of search engines-especially when Yahoo! made its much publicized move from the servers at Stanford to those at Netscape-before advertisers noticed that search engine sites were receiving numbers of hits in orders of magnitude greater than any other type of site on the web. Receiving daily hits in the millions, search engines seemed like advertising gold mines. This realization prompted the creation of many of the other current search engines.
References
1. Hotomi sounds better than HOinkTomi, don't you think?
2. This fact did not thrill MIT network administrators when the web became popular a year later. Although they made an attempt to wrestle the URL away from SIPB, the students prevailed, and to this day MIT's own homepage is located at http://web.mit.edu. There is an interesting allegory relating to this at the bottom of SIPB's main page at http://www.mit.edu for those that are curious.
3. such as the document, "Inessential Refrigerator Restocking", which is still available at: http://www.mit.edu:8001/sipb/documents/
4. Michael Maudlin, "Lycos: Design choices in an Internet search service" 1997
5. The name Veronica officially expands to Very Easy Rodent-Oriented Netwide Index to Computerized Archives -- somehow I think they worked the expansion out afterwards, but you decide.
6. Michael Maudlin, "Lycos: Design choices in an Internet search service" 1997