Search Score and Item Popularity

Mario Alemi
8 min readJan 6, 2021

Back in the 1990s, Google’s business model was selling private search engines to companies for their intranet. They soon discovered it didn’t work. Google’s original algorithm was measuring how important a page was by looking at the centrality of that page in the network of the world wide web –roughly, how many links were pointing to that page, and to the pages pointing to that page, and so on.

In an intranet, such information is useless: the network of web-pages is small and with very people creating links to pages they think are interesting and valid.

Why was Google computing the centrality? Because the centrality of a node in a network can be seen as the number of times the node is visited by a random surfer. And this information tells you how important the page is.

How was the centrality computed? Imagine you are in a page with 4 links. You produce a random number between 1 and 4 which tells you where to go next. In the new page you’ll do the same, and so on. The more often you’ll visit that page, the higher the centrality of the page.

Because surfer in an intranet do not surf at random, but usually look for specific information, measuring the centrality of a page tells little about that page. In addition, we don’t really need to compute such information, because we have it: for each page, we know how often, and for how long, our surfers visit it. We can therefore use the actual number of visits to the page instead of the estimated number through the centrality.

This can help us provide better results. If a user search for “How to esign” and there are only pages about “How to resign” and “How to design”, both pages are equally good candidates. But if the first is visited only 20 times a year and the second 10,000 times, any good algorithm should provide the page about “design” as the best candidate.

Let’s visualise the space of probability of a search:

  • The black circle is the space of probability of the pages.
  • Each section inside the circle represent a possible result: the bigger the section, the higher the probability that a random user (e.g. someone throwing darts) choses that page.

With a query, the user will filter out a region of the search space, but for the moment let’s focus on the size of the sectors in the probability space.

We want to compute a probability distribution function (PDF) for the likelihood that page R might be the desired page when the user didn’t provide any search query. Something like “How likely this page would interest someone who stumbles upon my website?”. And we want to use previous visits as a proxy to compute this likelihood. We call that:

P(r|{n, N}) = likelihood that page R elicits an interest of value r , computed from having observed in the past n visits to that page out of N total visits to the website.

We cannot use a simple proxy as

P(r|{n, N}) = n/N

because this would mean that a page which has never been visited could never be found interesting.

To compute our likelihood, we start with the Bayes formula:

P(r|{n, N}) ∝ P({n, N}|r) •P(r)

It says that the likelihood of R being of interest r is proportional to the probability of observing n and N given that the a priori interest in that page is r.

The value of P(r) is –if we do not make assumption like “The company is on the verge on bankruptcy and people will look for How to resign even if the page was seldom visited in the past”– a uniform function between 0 and 1. The page, as far as we know, might be very interesting or of no interest at all –we don’t know, and that’s why we are going to look at previous visits.

P({n, N}|r) is more interesting. The probability of observing n successes given N trials is give by the binomial distribution, i.e.

P({n, N}|r) ∝ rⁿ•(1-r)⁽ᴺ⁻ⁿ⁾

Note that if n=0, i.e. the page is never visited, the PDF becomes (1-r)ᴺ, something like that:

That means that we are more and more confident that the page is not interesting when more visitors never visit it.

In general, the more the data, the higher our confidence. If the page is visited 8 times out of 10, or 80 times out of 100, we get increasingly more confident about the page having an interest of 0.8:

The PDF of our page eliciting an interest of value r , given the observed visits, is:

P(r|{n, N}) \propto P({n, N}|r) •P(r) = r^n \cdot (1-r)^{(N-n)} \cdot 1

Although we could use the PDF for computing the final score to give to the page, better to compute now the expected value of the interest r given the visits:

E(r) = \frac{\int r^{n+1} \cdot (1-r)^{(N-n)} dr }{ \int r^n \cdot (1-r)^{(N-n)} dr}

The denominator is not necessary (the PDF is not normalised anyway), but it will be useful later for simplifying the result. The indefinite integral of the numerator solves to:

\int r^{n + 1} (1 — r)^{N — n} dr = \frac{r^{n + 2} {}_{2}F_1(n + 2, n — N, n + 3, r)}{(n + 2)}

Where 2F1 is the hypergeometric functions. But we are actually interested in the definite integral, and luckily, WolframAlpha also helps us compute the numerator…

\int_0¹ r^{n + 1} (1 — r)^{N — n} dr = \frac{\Gamma(n + 2) \Gamma(-n + N + 1))}{\Gamma(N + 3)} = \frac{(n + 1)! (N-n)!)}{(N

…and the denominator:

\int_0¹ r^n (1 — r)^{N — n} dr = \frac{\Gamma(n + 1) \Gamma(-n + N + 1))}{\Gamma(N + 2)} = \frac{n! (N-n)!)}{(N + 1)!}

Which finally gives us:

E(r) = \frac{n+1}{N+2}

Admittedly that’s a simpler result one would expect after the initial integral, but a very sensible one: for good statistical samples the expected interest r grows like n/N. But as long as we only have very few counts, all pages are, in practice, equally important.

Note that one could say: “A single visit says little about the validity of the page. Unless we have 100s of visits for a page, I don’t want to favour this page against the others.” This would be an antidote against the Matthew principle (the rich gets richer), which favours pages that, maybe by chance, started with more visits than others. In this case we can transform the value to:

E(P(r|\{n,N\})) = \frac{n+1,000}{N+1,000}

In this way, if I have 100 pages where 99 received no visits and one received 100 visits, the (not normalised) expected desire for the latter is 1,100/1,100=1, while for the former ones is 1,000/1,100=0.9. Or I could counteract the emerging power law in the number of visits taking log(n) and log(N).

Using the User’s Query

Now we want to compute the likelihood that a page is the desired one given a user’s query.

For that, we will use the concept of distance between the query and the page. Because all pages are different, they have distance d > 0 between themselves.

While a distance has no upper bound, in our case it would be good to have something bounded, e.g. cosine distance, which provides values in [-1, 1]. Once we have that, we can normalize it to [0, 1] and define similarity s as:

s = 1– d

The similarity s goes ideally from 0 (nothing in common) to 1 (the Query coincides with the page).

Finally, we can save that the probability that a page R is the desired one, given a similarity s between the query and the page and n visits to the page out of a total N:

P(R|s, {n, N}) ∝ P(s|R, {n, N})•P(R, {n, N})

P(s) does not depend on neither n nor N. In addition, for what we know, the probability that a user writes a query of similarity s to a page if they want that page is… s itself. If the user wants a page about resignation will write something as similar as possible to “resignation”.

Therefore, the last formula translates to:

P(R|s, {n, N}) ∝ s • (n+1)

Note that it would be nice if the pages were a basis in the space where we represent the query: in this case we could use the component of the normalised query with each basis (page) as similarity.

In this case, if a page has similarity s=1 (i.e. distance =0) with the query, it will be the only choice, because all the others will have similarity 0. Otherwise the likelihood that it’s the right one is proportional to similarity to the query times the number of visits n+1.

Because pages are not orthogonal between each other, it’s important to exclude all pages which we believe are not requested by the user. For instance, in case we use n-gram cosine similarity, we can take only pages with a similarity bigger than 0.8, and only then multiply by n+1.

Using Logarithms

As quickly written above, one could consider using log(n) instead of n.

The reasoning behind is that the ranking of pages by popularity follows a power law –kind of:

1ˢᵗ page visits = nₒ
2ⁿᵈ page visits = nₒ/2
3ʳᵈ page visits = nₒ/3

or:

n = nₒ / R => log(n) = log(nₒ)–log(R)

That means that for a page to become more popular (wanted) than another page, it must increase its visits by an order of magnitude. What matters in page visits (or product sales) is not the actual number, but the logarithm of it. It’s like for money: if your income is €10,000 a year, you would be happy with a €10,000 increase. If your income is €1M, €10,000 would mean close to nothing.

Therefore, although the probability of being chosen goes with n –which makes sense– the importance of a page does not go with n, but with the logarithm of it. Therefore we can rewrite our search formula with:

search score = s • log(n + 2)

--

--

Mario Alemi

Author of “The Amazing Journey of Reason, from DNA to Artificial Intelligence” (https://www.amazon.com/dp/B082D6BYT6 –free digital ed.)