Monday 13 April 2015

How Search Engines Process Links

Ever wondered how search engines crawl, analyze, index, and rank pages? Columnist Jenny Halasz has created a helpful primer on the link graph to answer these questions.

Have you ever wondered why 404s, rel=canonicals, noindex, nofollow, and robots.txt work the way they do? Or have you never been clear on quite how they do all work? To help you understand, here is a very basic interpretation of how search engines crawl pages and add links to the link graph.

The Simple Crawl

The search engine crawler (let’s make it a spider for fun) visits a site. The first thing it collects is the robots.txt file.
Let’s assume that file either doesn't exist or says it’s okay to crawl the whole site. The crawler collects information about all of those pages and feeds it back into a database. Strictly, it’s a crawl scheduling system that de-duplicates and shuffles pages by priority to index later.
While it’s there, it collects a list of all the pages each page links to. If they’re internal links, the crawler will probably follow them to other pages. If they’re external, they get put into a database for later.

Processing Links

Later on, when the link graph gets processed, the search engine pulls all those links out of the database and connects them, assigning relative values to them. The values may be positive, or they may be negative. Let’s imagine, for example, that one of the pages is spamming. If that page is linking to other pages, it may be passing some bad link value on to those pages. Let’s say S=Spammer, and G=Good:


No comments:

Post a Comment