'Link Popularity Made Simple (sort
of...)'
by Steve Clason of TopDog Strategy
This article explains link
popularity (specifically Google's use of link popularity)
as it pertains to search engine rankings. The explanation
requires some mathematics, but you don't need a strong
mathematical background to understand it -- high school
algebra will get you through. You won't find specific
methods for improving your search engine rankings, but
reading this may help you select the more productive
options offered to you for improving rankings.
Ever since Yahoo! replaced Inktomi as their search
engine with Google, with their link-oriented ranking
algorithm, the Web has been abuzz with the importance of
"link popularity". It is now "crucial" that we attend to
our links. We read, over and over, sentences like, "To
push your site towards the top of search results, it's
important to have many web sites point or link to your
site." Further, we're told that some links are better
than others, that (for instance) links from .gov or .edu
sites matter more than those from cousin Sally, that
free-for-all link sites don't matter at all (or that they
help immensely)... and on and on.
This buzz, and the information attending it comes from
sources of varying credibility, but almost all offer a
specific product or service taking advantage of the
information. Oddly missing has been a popular explanation
of how ranking algorithms employ link analysis in their
calculations -- explanations that would provide the
background information allowing us to judge the
credibility of the offers .
That lack is odd because of all the features of search
engine ranking algorithms, link popularity may be the
easiest to understand. Not because it is the least
complicated, but because it has been so thoroughly
described by its developers. Descriptions of specific
implementations of the technique are widely published in
academic circles, and are available on the Web. This
article stems from one of those publications.
Links Measure Importance
Before diving into the details, let's gloss the basics.
Link "topology" is really a better term than "popularity"
because the technique treats 'interconnectedness' as well
as count. Whatever you call it, the analysis measures a
page's importance. That's different from relevance.
While relevance means how well the content of a Web
page pertains to a specific query string, "importance"
refers to a page's worth, without regard to the content. A
link to a page makes a statement about the page's
perceived worth and so confers some value on it. The more
value conferred through links, the more important the page
is.
But we don't all have the same value to confer. Some of
our pages are more important than others, and a link from
an important page confers more value than a link from an
unimportant one.
So: An important page is one that has important
pages pointing to it. Circular reasoning?
Absolutely—but easy to grasp intuitively. For instance, a
link to your site from the National Institute of Standards
and Technology ought to confer more value than a link from
your cousin Sally. Not because Sally is partial, but
because NIST is more important, in some vague way that we
nevertheless seem to understand.
How
Importance Is Measured
Although easy enough to grasp with two or three pages,
measuring the relative importance of billions of
inter-connected pages seems hopelessly complicated. And
complicated it is, but not hopelessly—it's really pretty
straightforward. It does require a lot of
calculations, but luckily we don't have to invent them. We
can just get them from the academic literature.
While graduate students at Stanford, Larry Page and
Sergey Brin (the inventors of Google the search engine and
the founders of Google the company) published
"The Anatomy of a Large-Scale Hypertextual Search Engine",
which you can download in PDF format.
Their paper describes PageRank, Google's technique for
defining a page's importance on the basis of the pages
that link to it. It is this method that I elaborate in
this article.
Here's the PageRank formula. It looks difficult, but
don't let it intimidate you. We'll take it a little at a
time; patience and high school algebra will get you
through it.
Assume a web page, called A, that has a number of
pages that link to it. Call those pages that link to it
T1, T2, T3, and so on to the last one, called Tn
No math so far -- this just names things so we can talk
about them later.Ý Think of A as your home page,
and T1 to Tn as other pages on the Web that
contain hypertext links to yours. T2 can even be
your cousin Sally's, if that helps.
The PageRank of A is calculated by this equation:
PR(A) = (1-d)+d
[PR(T1)/C(T1)+PR(T2)/C(T2)+PR(T3)/C(T3)+Ö+PR(Tn)/C(Tn)]
This looks really nasty. But ripped into its three
components, it's much easier.
PR(A) means the PageRank of A; that's what we
are trying to find out. This just states the problem --
the real work happens on the other side of the equals
sign.
(1-d)+d is a damping factor.Ý Never mind what it
does. Page and Brin recommend a setting of 0.85, so we'll
set it to that and then forget about it. Although it
probably matters a lot if you run a search engine, it
doesn't matter much for our purposes here.ÝWe're just
going to calculate that big mess between the brackets
(these things,Ý [ and ]), multiply it by
0.85 and add that result to 0.15 to stay true to the
formula.
Now, to the big mess between the brackets.Ý If we
reformat this section, like this:
PR(T1)/C(T1)+
PR(T2)/C(T2)+
PR(T3)/C(T3)+Ö+
PR(Tn)/C(Tn)]
it's easier to recognize T1, T2, and T3 as those pages
that link to A, and (I hope) easy to see that an
exceedingly simple calculation is being done to them.Ý The
apparent complication comes from the quantity of the
calculations, not the difficulty.
PR still means PageRank, just like on the left
of the equals sign, T1, T2, etc., just name the pages. The
only new thing is the C -- as in C(T2).Ý
This is defined as the number of hypertext links on page
T2.Ý The number of outgoing links.Ý This is an
outgoing link:
On the receiving end (at Adventive's home page), this is
an incoming link (or an in-link), but on this page, it is an
out-link. Both matter!
Putting back together the three components we earlier
ripped, we can now list a set of steps for applying the
formula to any page:
- make a list of all the pages which point to the target
page (we'll call it a link list);
- for each page on the link list:
- determine the PageRank;
- count the number of outgoing links;
- divide each page's PageRank by the number of its
outgoing links;
- sum the results of step 2 for the entire link list;
- apply the damping factor to the resulting sum.
Making The Calculations
Four steps. Simple enough, but where do we start?. In
order to determine PageRank, this says we have to already
know the PageRank of the link list -- have to know, that
is, exactly what we're trying to find out.
But the PageRank formula achieves its results by
repeating the calculation until the results converge on a
stable result.Ý Meaning that we can just start someplace
arbitrary, and everything will work itself out.
This is a remarkably clever idea, which I'm going to
belabor. To demonstrate how it works, I've created a
little Web made up of 10 pages, and we'll employ PageRank
calculations to rank these pages according to their
importance.
Click Here to View Diagram
The universe looks like this -- the circles are Web
pages, the lines between them are hyperlinks.Ý The arrows
show the direction of the links:
What a mess, eh?Ý But things clear up fast once we
start the calculations. Stepping through the sequence,
first we make a link list. Here's A's:
A has 6 in-links, from B, E, G, H, I, and J
You can do the rest if you want to follow along
closely.
Next we find the PageRank for each of the pages on this
link list.Ý Of course, at this point we don't know any
PageRanks, so, we'll arbitrarily assign each page a
PageRank of 1 for the first iteration of the
algorithm.
Next we count the number of outgoing links for
each page on the link list and divide PageRank by the
result of the count.ÝUsing A's link list, we generate this
table:
| Page |
PageRank |
#
out-links |
PR/out-links |
| B |
1 |
6 |
0.1667 |
| E |
1 |
4 |
0.2500 |
| G |
1 |
3 |
0.3333 |
| H |
1 |
2 |
0.5000 |
| I |
1 |
4 |
0.2500 |
| J |
1 |
3 |
0.3333 |
| |
|
Total |
1.8333 |
As the last step, we apply the damping factor by
multiplying the sum (1.8333) by 0.85, getting 1.5583, then
adding (1-0.85), or 0.15 to get 1.7083. After the first
iteration, PR(A)=1.7083
Repeating the steps for each of the ten pages in our
Web delivers the following results, listed in order of
rank (you can check my work by constructing a table for
each of the other 9 pages just like the table I did for A,
if you've a mind to):
PR(A)= 1.7083
PR(J)= 1.4250
PR(G)= 1.2833
PR(H)= 1.0708
PR(C)= 0.8583
PR(D)= 0.8583
PR(F)= 0.7875
PR(I)= 0.7167
PR(E)= 0.5042
PR(B)= 0.3625
OK, looking it over, the list makes a little sense.Ý A
has the most in-links and is listed as the most important,
and B the fewest and is the least important. But J in
second place seems a little odd, since G has more incoming
links (4 versus 3).Ý So, let's run through the
calculations again.
Second Iteration
We use the same sequence of steps, but this time,
instead of using an arbitrary 1 for the PageRank
value in calculating each page's links-list, we'll use the
values in the above table; that is, the results of
the first iteration.Ý So, in calculating A's link list for
the second iteration, we generate this table:
| Page |
PageRank |
# out-links |
PR/out-links |
| B |
0.3625 |
6 |
0.0604 |
| E |
0.5042 |
4 |
0.1261 |
| G |
1.2833 |
3 |
0.4278 |
| H |
1.0708 |
2 |
0.5354 |
| I |
0.7167 |
4 |
0.1792 |
| J |
1.4250 |
3 |
0.4750 |
| |
|
Total |
1.8039 |
Adjust that total by the damping factor, and PR(A)=1.6833
after the second iteration.
Look what happened this second time through.Ý Taking B
as a example, notice that instead of contributing a value
of 0.1667 to A's accumulating PageRank, this time around B
added only 0.0604. In other words, after the first
go-round, B's importance diminished from the (arbitrary)
starting value of 1, and so it's referrals diminished in
value.Ý Once we get past the initial, arbitrary PageRank
of 1, every page starts to contribute according to its own
"importance".
I won't go through any calculations in detail (you'll
have to trust that I did them right);Ý here are the
PageRank values (again in order) after the second
iteration of the algorithm:
PR(A)= 1.6833
PR(G)= 1.5442
PR(J)= 1.4870
PR(H)= 1.3335
PR(F)= 1.0502
PR(C)= 0.7731
PR(D)= 0.7173
PR(I)= 0.5361
PR(E)= 0.3537
PR(B)= 0.2572
There's a little more action this time around.Ý G and J
swapped places and that seems intuitively better, as we
mentioned earlier. F went from 7th to 5thÝ
displacing C and D. Let's see if we can figure that one
out.
Look again at the link structure diagram.Ý C, D, and F
all have 3 in-links, but notice that one of F's comes from
A, while A does not link to either C or D.Ý A being the
most important (highest ranked) page in this universe, a
referral from it carries more weight than referral from
any other page, so F gets a big boost that C and D lack.Ý
Let's see if anything changes the third time through.
Third Iteration
I'll neglect all the detail, and just go to the
results. PageRank results after 3 iterations are:
PR(A)= 1.8020
PR(G)= 1.6515
PR(H)= 1.4019
PR(F)= 0.9920
PR(J)= 0.9496
PR(C)= 0.7774
PR(D)= 0.7389
PR(I)= 0.6328
PR(E)= 0.3004
PR(B)= 0.2260
Only one change: J drops from 3rd to 5th
boosting G and H. Why?Ý Notice that G and H have links
from A, while J does not. Again, having a link from an
important page boosts importance. Compare F and H, both
with 3 in-links. F's links come from A, B, and C, while
H's are from A, C, and F. That single difference -- the
difference between the weight of the contribution of B (at
the bottom of the list) and F (near the top) ranks H
higher than F.
Although this universe offers no example of this, you
can see how a single in-link from A would count more than
links from all three of the bottom of the list. With a
larger universe, that spread would be even wider, and link
quality would play an even larger part than here, where it
serves mainly to decide the relative rankings of pages
with the same number of in-links.
Fourth Iteration: We're Done!
After 4 iterations, the values differ, but the rankings
remain the same:
PR(A)= 1.7132
PR(G)= 1.5575
PR(H)= 1.4126
PR(F)= 1.0230
PR(J)= 0.9764
PR(C)= 0.8162
PR(D)= 0.7844
PR(I)= 0.6036
PR(E)= 0.3165
PR(B)= 0.2138
The rankings stay this way through several more
iterations, and seem to be stable, so we'll stop
here.ÝThis last list contains what we'll call the
"official" PageRank values of our little 10 page
universe.Ý
Lessons Learned
Simple, right?Ý Not particularly easy, but once you
understand a small example you can expand that
understanding to encompass the entire Web.Ý You'll never
really get your mind around it because the complexity of
the interrelationships between a billion or so pages in
ungraspable -- we can, though, understand the
calculations, and appreciate how they determine what pages
are "important".
More importantly, for most of us anyway, we can use our
new grasp of the algorithm to make sense of some of the
buzz about link popularity.Ý
Let's start with the quote that opened this article: "To
push your site towards the top of search results, it's
important to have many web sites point or link to your
site."ÝA true statement, as far as it goes.ÝIn
general, the more in-links you have the higher you'll be
ranked in the search results. But as we've seen, links
confer different amounts of importance, and a high quality
link can easily outweigh several low quality links.
Now this one: "The quality of the link holds more
"weight" than the quantity of links. You will get better
results in the search engines if you have link popularity
from sites that have considerable traffic."Ý The
statement starts off dead on, then misses the point
completely.ÝÝ Traffic has nothing to do with link
popularity.Ý The "quality" of a site is nothing more than
its PageRank. All links (of whatever "quality") contribute
to the calculation; "quality" links just contribute more.
Traffic contributes nothing.
Or, consider this: "Free for all sites don't boost
your rankings."Ý Mostly true. Although all incoming
links improve your importance, the FFAs, whose lots of
out-links dilute whatever importance they might have, will
contribute very little. Also, to the extent they are
considered spam, Search Engines may keep them out of their
index -- a link from a non-indexed site contributes
nothing.
And lastly, how about this one: "Links from .gov and
.edu sites are better than links from your cousin."Ý
Maybe.Ý The .gov or .edu by themselves confer no
additional authority, but these sites might be more likely
than your cousin's to have lots of in-links themselves,
and so have a high PageRank.Ý That alone makes a link
"better".
The main lesson here seems to be the same old one. If
you want the high search rankings that result from a high
PageRank (or whatever other link popularity algorithm a SE
uses) pay attention to your content. Easy to navigate
pages with high quality content will attract links from
Webmasters augmenting their sites by linking to yours, and
it is those incoming links, hopefully from sites that are
themselves important, that will boost your page's
importance.
Steve
Clason is a freelance writer and the owner of TopDog
Strategy, a consulting firm. He lives in Boulder,
Colorado, and can be reached at
steve@clason.org.
BACK TO THE LINKING LIBRARY
|
"The Professional Solution to Building
Link Popularity
that gathers the blessings of search engines in the
making!"
- Planet Ocean
For a free link popularity consultation visit
http://www.linkagexpress.com
or call 480-556-9388 (toll-free in the US).
|
|