Looking For

Using Supervised Learning to Improve Url Ordering in Autocomplete

Nisheeth Ranjan (nisheeth at cs dot stanford dot edu)
Andrew Ng (ang at cs dot stanford dot edu)
Computer Science Department, Stanford University
Stanford, CA 94305 USA

Abstract

The autocomplete feature in web browsers uses what the user has typed in the url bar as a prefix string to query the user's web traversal history and presents the matching urls in an autocomplete list. If one of the urls in the autocomplete list is where the user wants to go, then he/she can directly select it from the list instead of having to type out the entire url. Mozilla, the leading open source web browser, sorts its autocomplete list using a hardcoded algorithm that uses features of the url like frequency, whether it is a url with a trailing slash, etc. We try to improve on Mozilla's ordering by using a simple softmax like probabilistic model that computes the probability of selection of a url in an autocomplete list. We order the urls in the autocomplete list in descending order of their probability of selection. We design some scoring metrics to compare the performance of the Mozilla and Softmax orderings on a data set that we collected from the Mozilla community for this research. We try both offline and online learning approaches to fit the parameters of our model and find that online learning does marginally better. Our final result is that the Softmax ordering saves the user an average of approximately one keystroke per autocomplete event.

Introduction

We model the autocomplete problem as follows. Say an autocomplete list is composed of N urls. We assume that the probability that the i'th url gets selected by the user in that autocomplete list is given by the formula below:

      (1)

In the equation above, x is the n-dimensional input vector, and θ is the n-dimensional parameter vector. Assuming that there are m separate autocomplete events, the log likelihood of our parameters is given by:

      (2)

In the above equation, y(i) is the index of the selected url in the i'th autocomplete event. l(θ) turns out to be a concave function in θ and Newton's method can be used to find the value of θ that maximizes l(θ).

Once we've found the maximum likelihood estimate of θ on a training set, we can use (1) to find the probability of user selection for each url in the autocomplete list and order the urls in descending order of that probability.

Offline learning

In order to learn θ, the parameters in our model, we needed data on how urls in autocomplete lists are chosen by current browser users. To this end, we recruited approximately 250 Mozilla users to run an instrumented version of Mozilla that captured their url traversal and autocomplete usage information to a local data file. After a period of two weeks to a month, we asked them to submit their data files back to us.

We then wrote a perl script that took each data file as input and generated an output file containing an Nix1 matrix of feature vectors for the i'th autocomplete event in the data file, i ranging from 1 to m. For example, here's a snippet from the output file for a single autocomplete event:

slashdot.org/, 1045072282056876, 1, 1, 3.00, 3.00, 3.01, 3.03, 3.10, 3.26, 3.68, 4.94, 8.99, 22.00, 1.00, 1.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 3.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.00, 0.00, 18.00, 1.00, 2.00, 0.00, 14.00, 0.00, 1.00, 2.00, 1.00, 16.00, 12.00, 24.00, 8.00, 0.00, 0.00, 0.00, 0.00
slashdot.org/pollBooth.pl?qid=926&aid=2, 1045161387935626, 0, 0, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, 0.06, 0.26, 1.00, 1.00, 1.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 3.00, 1.00, 1.00, 2.00, 0.00, 0.00, 2.00, 4.00, 32.00, 1.00, 1.00, 0.00, 11.00, 0.00, 1.00, 1.00, 1.00, 12.00, 12.00, 46.00, 34.00, 0.00, 0.00, 0.00, 0.00
slashdot.org/pollBooth.pl?qid=927&aid=2, 1045251640566251, 0, 0, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.01, 0.07, 0.29, 1.00, 1.00, 1.00, 0.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 1.00, 0.00, 0.00, 0.00, 0.00, 3.00, 1.00, 1.00, 2.00, 0.00, 0.00, 2.00, 4.00, 32.00, 1.00, 1.00, 0.00, 11.00, 0.00, 1.00, 1.00, 1.00, 12.00, 12.00, 46.00, 34.00, 0.00, 0.00, 0.00, 0.00
No_url, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00


The above output file describes an autocomplete event where the autcomplete list contained three urls and the first url was selected by the user (the third comma separated field denotes whether the url was selected by the user). There are 54 comma separated fields per url line and the following table describes what those fields are:

Field number

Description of Field

1

Url

2

Url ID

3

User selected the url

4

The url contains a trailing slash

5

Generated frequency-recency metric (0.1)

6

Generated frequency-recency metric (0.2)

7

Generated frequency-recency metric (0.3)

8

Generated frequency-recency metric (0.4)

9

Generated frequency-recency metric (0.5)

10

Generated frequency-recency metric (0.6)

11

Generated frequency-recency metric (0.7)

12

Generated frequency-recency metric (0.8)

13

Generated frequency-recency metric (0.9)

14

Generated frequency-recency metric (1.0)

15

Mozilla captured frequency-recency metric (0.2)

16

Mozilla captured frequency-recency metric (0.8)

17

Url typed by user

18

Url ends in .htm or .html

19

Is it a .com url?

20

Is it a .edu url?

21

Is it a .org url?

22

Is it a .net url?

23

Is it a .gov url?

24

Url contains a ~

25

Url starts with http://

26

Url starts with ftp://

27

Url starts with file://

28

Url starts with https://

29

Url ends in two letter country code?

30

Number of /s in the url

31

Number of ?s in the url

32

Number of &s in the url

33

Number of =s in the url

34

Number of #s in the url

35

Number of +s in the url

36

Number of .s in the url

37

Number of numerical [0-9] chars in url

38

Number of alphabetical [a-zA-Z] chars in url

39

Number of non-alphanumeric, non-[/?&=#+.] chars in url

40

Number of .s in the hostname

41

Number of numerical [0-9] chars in hostname

42

Number of alphabetical [a-zA-Z] chars in hostname

43

Number of non-alphanumeric, non-[/?&=#+.] chars in hostname

44

Number of .s in hostname if we omit initial "www." or "ftp."

45

Number of .s in hostname if we omit ending ".XX" country code

46

Number of .s in hostname if we omit initial "www." and ending ".XX" country code

47

Number of characters in url

48

Number of characters in hostname

49

Number of characters in hostname excluding initial "www." or "ftp."

50

Number of characters in URL excluding hostname

51

Number of characters in web page title

52

Is this a google search url?

53

Is this a netscape search url?

54

Is this a yahoo search url?


To relate these fields back to (1), x(i) is the 51 dimensional vector containing the numeric values from field 4 through field 54 for the i'th url in the autocomplete event and θ is the 51 dimensional parameter vector that we want to learn.

We used Newton's method to find the parameter vector, θ, that maximized the log likelihood function, (2). We devised a simple scoring metric to compare the Mozilla ordering with the Softmax ordering. An ordering got a score of 1/i on an autocomplete event where i was the position (counting from the top of the list) of the user selected url.

Initial Results for Offline Learning

The Mozilla ordering scored 3609.23 out of 4343 (83.1%) and beat the Softmax ordering which scored 3460.85 out of 4343 (79.69%) using the autocomplete lists we got back in the data files.

Simulating Browser History

We reasoned that the autocomplete lists we had used for our training were captured after the user had typed enough characters in the url bar to make the url they wanted to select bubble up to the top in each list. We wondered whether we could show that the Softmax ordering does better than the Mozilla ordering while the first few characters are being typed. To find out, we extended our perl script to take the number of characters typed in the url bar, C, as an input parameter and generate output files where each autocomplete list was built by matching urls from the set of all urls visited before. Essentially, we simulated browser history by storing visited urls in a hash and generating the autocomplete list that the browser would have generated when C prefix characters of the user selected url were typed in the url bar.

We created four data sets for the 1 character typed, 2 characters typed, 3 characters typed, and 4 characters typed scenarios. We trained on approximately 70% of that data and tested on the remaining 30%. Figure 1 shows our results. The Softmax ordering consistently did better than the Mozilla ordering.



Figure 1

A Better Scoring Metric

It was hard for us to see the relationship between score percentage numbers and actual utility to the user. So, we decided on measuring user utility more directly. From a user's perspective, the important metric is the number of keys typed to get to the next url. We extended our perl script to
  1. simulate typing a url character by character,
  2. create a Mozilla and Softmax ordering of the autocomplete list after each character is typed, and
  3. keep track of the minimum number of keystrokes needed to access the user's url.
The number of keystrokes were calculated as follows: if the user's url ended up in position 5 (from the top) in the ordering after 2 characters are typed in the url bar, the number of keys needed to get to it are 7 (2 characters typed and 5 presses of the down arrow key).

Final Results for Offline Learning

On running our minimum keys needed simulation on the entire data set, we found that the Softmax ordering saved the user 13171 keystrokes over 14647 autocomplete events, an average savings of 0.90 keystrokes per event.

Online Learning

Till now, we'd kept our parameter vector, θ, fixed while scoring the Softmax orderings on the test set. We now decided to explore whether an online learning paradigm where the parameters are updated to adapt to an individual user's preferences would improve the ordering further. We changed our perl script yet again to update θ after each autocomplete event such that we take a stochastic gradient ascent step towards maximizing log(P(y = i; θ)), where P is given by (1).

We created a subset of the data set to test different learning rates and saw that a learning rate of 0.0001 yielded the lowest number of keys needed by the Softmax ordering to select the user's url in all the autocomplete events contained in the data subset. Figure 2 shows how the different learning rates performed on the data subset. We also computed a heuristic that would use a higher learning rate at autocomplete events where the user selected url was high on the Softmax ordered list. To implement this heuristic, we computed a multiplier as follows:

If the minimum chars needed by softmax (M)   < 10
      multiplier = 10 - M
else
      multiplier = 1/M


We then used this multiplier in the stochastic gradient ascent step. For autocomplete events where the user's url could be accessed in less than 10 keystrokes, the multiplier was a positive integer and magnified the update of the parameter vector. For events where the user's url was accessed in 10 or more keystrokes, the multiplier was a positive number between 0 and 1 and reduced the update of the parameter vector.



Figure 2

Final Results for Online Learning

On running the online learning simulation on the entire data set, we found that the Softmax ordering saved the user 13882 keystrokes over 14647 autocomplete events, an average savings of 0.95 keystrokes per event.

Conclusion

We were able to show that url ordering using an offline learning paradigm saves the user an average of 0.90 keystrokes per autocomplete event while an online learning paradigm does marginally better and saves the user an average of 0.95 keystrokes per autocomplete event as compared to the current Mozilla algorithm for url ordering.