internet.com
You are in the: Small Business Computing Channelarrow
Small Business Technology
» ECommerce-Guide | Small Business Computing | Webopedia | WinPlanet |Refer-It

WinPlanet Software Downloads and Reviews for Small Businesses
Search
Power Search | Tips
-
Navigate WinPlanet
WinPlanet Home Page

Software
Download Index
In-Depth Reviews
Tips & Tutorials
Updates
News

Software Categories
Browsers
Chat / Conferencing
Desktop Utilities
Development
Internet Apps
Multimedia
OS Service Packs
Productivity Tools

Software Glossary

WinPlanet Newsletter

internet.commerce
Partners & Affiliates













Small Business Computing
Small Business Computing
Ecommerce Guide
Webopedia
WinPlanet

WinPlanet / Tips & Tutorials

Download of the day
Internet Explorer 8

Most Popular Software Downloads
Mozilla Firefox 3.0
Ad-Aware 2008 Free
Internet Explorer 7
QuickTime for Windows
Paint Shop Pro
Mozilla Firefox Portable Edition 3
AVG Anti-Virus Free
Windows XP Service Pack 3
Ashampoo WinOptimizer
Adobe Flash Player
Windows Live Suite

Most Popular Software Articles
Windows Vista Tips & Tricks, Part 1
Windows Vista: Worthy of the Hype?
Windows Wireless Zero Configuration: Five Steps to Sanity


Software Reviews

The Inner Workings of Robots, Spiders, and Web Crawlers
Getting Up to Speed with Webbots
Lee Underwood

There are three basic types of search engines: crawler-based, human-powered, and a combination of the two.

The human-powered search engines – directories – don't really search. They rely on input from other humans. A Web site URL is submitted manually to the directory, sometimes with a short summary of the site. It's reviewed (though not always) by a human and then indexed. Directory-based search engines include Yahoo, Open Directory, and LookSmart.

The indexes of crawler-based search engines are fed data by computer programs called robots. They are also called spiders or Web crawlers because they 'crawl' over the Web. Google, AllTheWeb, and Teoma are all crawler-based search engines.


Search Engine Robots
Search Engine     User-Agent
AltaVistaScooter
AOLSlurp
ExciteArchitextSpider
GoogleGooglebot
InfoseekInfoseek
LycosLycos
NetscapeGooglebot
NorthernLightGulliver
TeomaTeomaAgent
WebCrawlerArchitextSpider

According to The Web Robots FAQ, "A robot is a program that automatically traverses the Web's hypertext structure by retrieving a document, and recursively retrieving all documents that are referenced." The robot doesn't physically visit the site, such as a virus would do. It accesses it by requesting documents, much like a Web browser.

There are hundreds, if not thousands, of robots sweeping across the Internet 24/7/365. Their goal is the same: track down information and return it to the sender. Most robots are useful. They feed data to search engines like Google. Many of them are used to map the Internet. Some do a combination of mapping and indexing. A partial list of the robots (user agents) used by search engines is shown at right. A more complete list can be found at the Web Robots Pages.

There are also some robots that have a more sinister purpose. Robots such as EmailSiphon and Cherry Picker, for instance, are spambots that look to harvest e-mail addresses to add to their spam lists. Those robots scan Web pages looking for two things:

  1. E-mail addresses, which are then used as targets for spam.
  2. Hyperlinks, which the robots then follow, beginning the process again.

For the most part, robots are pretty straightforward when it comes to traversing a Web site. But robots don't think for themselves; they are programmed to operate in a precise manner. They don't understand JavaScript, frames, or Flash very well. They can't access password-protected areas. Often, robots can't properly index a site because there are too many bells and whistles. If they encounter any roadblocks, they move on. They don't try to figure out how to go around them.

When a robot visits a Web site it does one of two things:

  1. it looks for the robots.txt file and the robots meta tag to see the "rules" that have been set forth, or
  2. it begins to index the Web pages on your site.
Most, but not all, search engine robots follow the rules in the robots.txt file (more on that later). The robot then scans the visible text on the page, the content of various HTML tags, and the hyperlinks listed on the page. It will then analyze the information and process it according to an algorithm devised by its owner. Depending on the search engine, the information is indexed and sent to the search engine's database. Google's process, for instance, takes two steps:
  • Googlebot is the actual robot that traverses the Web site. It fetches the pages and sends them to the indexer.
  • The indexer sorts through every word on the page and stores them in a database.

The graphics below show how a robot typically works (courtesy of Kent State University Library).


How robots work


You can tell if your site has been visited by robots by viewing your access logs. For a more detailed discussion, see "SpiderSpotting: When a Search Engine, Robot or Crawler Visits." Another good source is "Building a Better Spider-Trap."

If your Web site is online, robots will eventually come to visit. To expedite the process, most of the crawler-based search engines allow links to be submitted manually. It doesn't mean that you will be listed immediately or even the next day; it can take several weeks (or even months). However, it can help to speed-up the process so they will not have to go looking for you.

| Next Page »

Contents:
1. Getting Up to Speed with Webbots
2. Gaining Control Over Robots






JupiterOnlineMedia

internet.comearthweb.comDevx.commediabistro.comGraphics.com

Search:

Jupitermedia Corporation has two divisions: Jupiterimages and JupiterOnlineMedia

Jupitermedia Corporate Info


Legal Notices, Licensing, Reprints, & Permissions, Privacy Policy.

Advertise | Newsletters | Tech Jobs | Shopping | E-mail Offers

Solutions
Whitepapers and eBooks
IBM eBook: Planning a Service Oriented Architecture
IBM eBook: Choosing the Right Architecture--What It Means for You and Your Business
Microsoft Article: Will Hyper-V Make VMware This Decade's Netscape?
Avaya Article: Using Intelligent Presence to Create Smarter Business Applications
Intel Go Parallel Article: Getting Started with TBB on Windows
Microsoft Article: 7.0, Microsoft's Lucky Version?
Avaya Article: How to Feed Data into the Avaya Event Processor
IBM Article: Developing a Software Policy for Your Organization
Microsoft Article: Managing Virtual Machines with Microsoft System Center
Intel Go Parallel Article: Intel Threading Tools and OpenMP
HP eBook: Storage Networking , Part 1
Microsoft Article: Solving Data Center Complexity with Microsoft System Center Configuration Manager 2007
MORE WHITEPAPERS, EBOOKS, AND ARTICLES
Webcasts
HP Video: StorageWorks EVA4400 and Oracle
HP Webcast: Storage Is Changing Fast - Be Ready or Be Left Behind
Microsoft Silverlight Video: Creating Fading Controls with Expression Design and Expression Blend 2
MORE WEBCASTS, PODCASTS, AND VIDEOS
Downloads and eKits
Red Gate Download: SQL Toolbelt and free High-Performance SQL Code eBook
Iron Speed Designer Application Generator
MORE DOWNLOADS, EKITS, AND FREE TRIALS
Tutorials and Demos
Silverlight 2 App and Walkthrough: Leverage Silverlight 2 with SQL Server and XML
IBM Article: Enterprise Search--Do You Know What's Out There?
HP Demo: StorageWorks EVA4400
Microsoft Article: The Progress and Promise of Deep Zoom
Microsoft How-to Article: Get Going with Silverlight and Windows Live
MORE TUTORIALS, DEMOS AND STEP-BY-STEP GUIDES