Task Identification Using Search Engine Query Logs
|Li Quan Khoo|
|University College London|
|Department of Computer Science|
- 1 Abstract
- 2 Introduction
- 3 Literature review
- 4 Summary of state of the art
- 5 Challenges ahead
- 6 Conclusion
- 7 Glossary
- 8 Bibliography
Given a set of strings from a search engine query log, a human can deduce or guess (with varying accuracy) which strings might be related, and the user(s) motivation for issuing the queries. If a search engine is able to perform an equivalent operation (guessing), it might be able to provide information or services of more relevance to the user than merely the pages pertaining to the individual queries. This involves the search engine being able to group these queries together, and to be able to understand the relevant semantics of the query strings. This paper is a literature review on the current state of the art of solving this problem, which we are going to formally introduce in this paper as task identification using search engine query logs.
Where we currently stand in publicly or commercially available web technology, the quality of a search engine's output is judged by their ability to output a set of the most relevant resources (usually webpages) for a user's query. However, the observation is that users tend to issue a set of queries to solve a task in mind, whereas search engines return results based on individual query strings. For example, given the set of query strings:
- Flights to Wyoming
- Hotels in Wyoming
A human, being asked the same questions, might guess that the user is going to Wyoming at some point and offer help for travel. Search engines are incapable of making this guess and consequently only answer the user's task in parts, resulting in unnecessary query reformulations of all kinds.
Given that out of the 2 billion searches made per day (2009 data) and 34% among them are reformulations, this is an indication that how search engines behave currently do not help users solve their tasks very well.
The solution would be to implement a search engine that can make the same deduction as a human - guess the users' task(s) based the issued query strings. The sketch of the solution involves solving two component problems - first, it needs to group query strings which are likely to be related together in a set, and exclude those which aren't. Second, it needs to be able to infer the semantics from within each of these sets to have any hope of understanding the user's task(s) behind the strings.
Consider the fact that the hypothetical search engine has an understanding of the various tasks corresponding to the millions of query strings, this becomes a powerful tool for suggesting the most useful results based on keywords. Consider now the string points of interest in Java and that the engine has acquired a relation of points of interest to a location, because based on semantics and most likely word sense, Java means a location. Now, consider a user who issues the query string Hawaii, based on our previous understanding, this user is likely to be interested in points of interest related to Hawaii and hence we can be more helpful by displaying those results.
Before going further, these are definitions which are going to be used throughout this paper. Minor ones are included in the glossary:
- Task: The goal a user wants to solve when issuing a set of related queries. For the example in the introduction, this would be "To travel to Wyoming". In Jones and Klinkner's paper, this is called a "search goal", which is defined as an atomic information need.
Here, we review the research outcomes which might be useful for implementing this hypothetical search engine which can identify user tasks, and see how certain approaches and resources differ in what they offer.
The session boundary
Before a computer can try to deduce a user's task, it needs to identify which queries are related and which are not. The boundary between two sets of related queries is called a session boundary.
There are three main ways which researchers tried to solve this problem - time-based, content-based, and mixed heuristics using both time and content.
Silverstein et al.'s widely cited work in 1999 first introduced the concept of a "search session". They defined two consecutive queries as being in the same session if they are no more than 5 minutes apart.
Radlinski and Joachims (2005) observed that users tend to perform a sequence of queries which a similar information need, and they called these sequences of reformulated queries "query chains". They attempted to automatically classify these queries using a 30-minute threshold instead of 5 minutes as in Silverstein et al.'s work, but Jones and Klinkner showed that this is completely ineffective.
Jones and Klinkner's work (2008) identifies several key difficulties in determining this boundary. Among them:
- Thresholds purely based on time, regardless of duration, is not effective at identifying session boundaries. In fact, such thresholds are not better than randomly determining whether queries are related. This finding is supported by Montgomery and Faloutsos's work.
- The query source may be shared; for instance, two users may take turns issuing queries with completely different goals in mind.
However, the authors identified that by using a combination of several metrics which are calculated from a set of query logs, they were able to determine whether two queries straddle a session boundary with an accuracy of 84%.
Jones and Klinkner's work uses purely mathematical methods, including string similarity scores (Levenshtein and Jackard distances), Prisma (cosine vector similarity), and time thresholds to determine the similarity between two queries, and hence belong to different sessions.
Wen et al.'s preceding work in 2001 presented several ways of a similarity index between queries using different information available. Their first approach is similar to Jones and Klinkner's, which is making use of the queries themselves. They also suggested another metric - queries which lead to the selection of the same document by users are similar.
Finally, they proposed a similarity index between entities based on their co-existence within selected text corpora, such as articles within Encarta.
Luchchese et al. (2011) claimed that 75% of submitted queries involve some form of multitasking, meaning users tend to solve several tasks at once by interleaving their queries.
Their research also shows that approaches relying only on content features suffer from the vocabulary-mismatch problem; the existence of topically-related queries without any shared terms. Meanwhile, they observed that time-splitting techniques are unable to account for multi-tasking sessions.
Part of their analysis involves removing queries with too many search terms, as they think those are likely to be machine-generated (e.g. robots') queries.
Jansen et al. (2007) looked at content changes in combination with available metadata, namely IP addresses and session cookies. These statstics are available as and when the queries are issued, and do not require retrospective statistical analysis as with heuristic techniques.
One of their methods, which uses IP address, cookie, and query-content changes, manages to achieve an accuracy of about 95.55% in 5 x 105 query strings, which puts it among the best results to date.
Boldi et al. (2007) attempted to group individual query strings together via what they introduced as a query flow graph. This is a labeled directed graph with query strings as nodes. An edge goes from node A to node B if B was queried immediately after A. The edge weights are determined by the relative frequency of all outgoing edges; that is, how often users go from one particular query to another.
"Because meaningful sentences are composed of meaningful words, any system that hopes to process natural languages as people do must have information about words and their meanings" - George A. Miller.
Suppose we are able to make use of the results of the papers above and are able to solve the session boundary part of our problem to about 95% accuracy, we still need to deduce the relations between the different concepts contained within the strings. To do that, we need to draw semantically-structured content.
Given the inherent ambiguity of human languages, a human necessarily relies on past experience and context to identify the most likely meaning of a sentence. For the computer, it needs to perform much like a human would - learn when different word senses are most relevant to disambiguate, and learn facts about them, e.g. London would have a relation capitalOf to Great Britain.
Research on the semantic web has yielded multiple tools which attempt to enable this. WordNet, for instance, has all the data defined manually and is only available in English, whereas other efforts attempt to automatically derive relations by parsing semi-structured text corpora like Wikipedia. The many approaches have their own strengths and weaknesses.
WordNet is a lexical database (a dictionary) founded by psychologist George A. Miller at Princeton University in the mid-1980s. With its roots as a psychological project, WordNet's semantic relations are designed to be consistent with how human beings process language. It was not conceived as a dataset optimized for reading by machines, but nevertheless, it still managed to become very important in the semantic web for disambiguating word senses.
WordNet respects the different syntactic categories of the same word. For example, the word "well" can be used as a noun (place to draw water from the ground), as an adjective (being healthy), and so on.
It defines the vocabulary of a language as set of tuples
- (f,s), where
- f is the form - the a string over a finite alphabet,
- s is the sense - an element from a given set of meanings.
For each one of these tuples, WordNet provides a gloss or textual definition. Semantic relations within a language are defined as a set of pointers between members in this vocabulary.
The table is from George Miller's paper illustrating this, and it omits the sense part of tuples in the vocabulary.
|Semantic relation||Syntactic category||Examples|
|Synonymy (similar)||N, V, Aj, Av||(pipe, tube), (rise, ascend), (sad, unhappy), (rapidly, speedily)|
|Antonymy (opposite)||Aj, Av, (N, V)||(wet, dry), (powerful, powerless), (friendly, unfriendly), (rapidly, slowly)|
|Hyponymy (subordinate)||N||(sugar maple, maple), (maple, tree), (tree, plant)|
|Meronymy (part)||N||(brim, hat), (gin, martini), (ship, fleet)|
|Troponomy (manner)||V||(march, walk), (whisper, speak)|
|Entailment||V||(drive, ride), (divorce, marry)|
|N = Nouns, Aj = Adjectives, V = Verbs, Av = Adverbs|
Because all the entries in WordNet are manually defined essentially with no errors, the correctness of the relations is very high indeed when compared to other lexical databases which automatically extract relations. The way its words are defined makes it particularly useful to disambiguate senses of words given a context.
Despite its advantages, WordNet is only available in English, and since it is a manual endeavor, its data is fairly limited compared to other lexical databases - it does not have certain information like etymology, pronunciation, and does not cover domain language (language used in specialized fields, like jargon). Being able to understand domain language as well as languages other than English would be very important in order to identify a user's tasks from the query logs. There are other endeavors similar to WordNet that cover other languages, but suffer from the same weaknesses - expensive maintenance and poor coverage of domain language.
Cyc is similar to WordNet and was started in the 1980s as well, but it was created as an AI project from the get-go. It suffers from the same shortcomings with its manual input, as well as the difficulty for humans to translate concepts into its form of data storage because of its complicated syntax. It is only mentioned here to provide context for the disciplines involved in research on semantics during that time period, and won't be discussed at length.
BabelNet is an attempt to address poor language coverage in manually curated systems like WordNet as well as take advantage of their quality by combining them with a very large set of multi-lingual semi-structured content (Wikipedia). In this respect, BabelNet is highly successful, but Navigli and Ponzetto, who presented BabelNet in their paper in 2010, acknowledged that the mapping produced is not as robust as WordNet's simply because it constructs maps automatically based on heuristics.
BabelNet encodes knowledge as:
- G = (V,E)
- G is a labeled directed graph
- V is a set of vertices representing a concept. Vertices are represented as entites called babel synsets
- E is an edge labeled with a semantic relation R
A babel synset is a set of all synonyms of one sense of a word in all languages BabelNet covers. In addition, it holds short segments of sentences extracted from Wikipedia and SemCor as examples of usages of the word. SemCor is a set of sense-tagged corpora created by the WordNet research team.
To populate this knowledge graph, BabelNet collects Wikipedia pages (as concepts) and hyperlinks (as relations) and attempts to map them with data from WordNet (word senses as concepts, semantic pointers as relations). To achieve its multi-lingual coverage in the absence of WordNet equivalents for the target languages, BabelNet uses human-generated transitions in Wikipedia's inter-language links between its different language editions, as well as a machine translation system.
To perform the mapping, BabelNet uses a concept called disambiguation context Ctx(e), which is a set of words obtained from performing an operation on an entity e.
For a Wikipedia page w, BabelNet identifies the disambiguation context of a wikipage Ctx(w) using sense labels e.g. aircraft in the article titled balloon (aircraft), outgoing links and related articles, and article categories.
For a WordNet sense s (this refers to the tuple of both the word and its sense), the disambiguation context Ctx(s) is derived from all the entries in the synset of s, called S (the set containing all synonyms of s), and the hyponyms (specialization) and hypernyms (generalization) of S. It also includes sister synsets; a synset is a sister to S if they have a common direct hypernym, i.e. they generalize to a common word. Finally, it also makes use of the gloss available to S.
The actual mapping itself is found by calculating the conditional probability P(s|w) of picking the WordNet sense s given a Wikipedia page w.
How this value is calculated is complicated - it makes use of a mapping algorithm introduced by Ponzetto and Navigli, who are also the authors who introduced BabelNet.
DBpedia is a project to extract structured content from Wikipedia and to make it publicly available. Its difference from other similar projects is that it only infers the semantics from Wikipedia, and does not rely on a specialized lexical resource such as WordNet. DBpedia is updated automatically every time Wikipedia releases its monthly SQL dump, as well as live, with a few minutes' delay as Wikipedia's API notifies DBpedia of the changes, and as DBpedia makes its changes.
DBpedia extracts the following from Wikipedia articles:
- Labels: The title of the article
- Abstracts: First paragraph of an article
- Interlanguage links: Links between different Wikipedia language editions of an article
- Redirects: These are used to define synonymous labels
- Disambiguation: These are used to distinguish between different senses of a word
- External links
- Pagelinks: Links to Wikipedia's own pages
- Homepages: Links to homepages of entities such as companies and organizations
- Infoboxes: Box at the top of most significant Wikipedia articles containing the most important information
It uses a complicated extraction algorithm presented by S. Auer and J. Lehmann in 2007. This algorithm's output is a large set of hierarchical ontology classes, each with their own unique properties. Using the example from C. Bizer and J. Lehmann 2009,
|Ontology class||Example properties|
|Person||name, birthdate, birthplace|
|↳ Artist||activeyears, awards, occupation, genre|
|↳ Musician||genre, instrument, label, voiceType|
|↳ Actor||academyaward, activeyears|
These classes would then contain members, called instances, which have different values for each property of that class.
YAGO is a series of ontologies. YAGO2 has improved capability of representing the semantics for a given sense over YAGO, and YAGO2s is mainly the modularization of YAGO2's extractor component, the ability for a user to modify these modules, and the addition of a GUI. For our purposes, we will review YAGO2.
YAGO2 uses modules called extractors specialized for different sources (e.g. Wikipedia, WordNet etc.) in order to derive semantic data. Extractors automatically output three main types of data - entities, classes, and relations.
A YAGO entity is a unique string representing that entity, for example, ElvisPresley. When extracting from Wikipedia, these come from the article titles.
A YAGO class is a group of similar entities. An entity is mapped to a class via a triple (entity, TYPE, class), and classes are arranged in hierarchical order as tuples (class, SUBCLASSOF, superclass).
YAGO relations are manually defined strings which express a concept, e.g. wasBornIn, locatedIn. A YAGO fact is a tuple with an entity and a relation. Basic relations are expressed as a SPO tuple (sense, predicate, object) - for example, (ElvisPresley, bornInYear, 1935).
All the YAGO concepts discussed so far have been implemented since the original YAGO. YAGO2 augments entities, classes, and facts with temporal, location, and optionally, context representations, and these are expressed in terms of a SPOTL(X) tuple (sense, predicate, object, time, location, context). This data is derived from existing attributes of the entity/class/fact and is mainly for ease of querying. For instance, the time component for entity ElvisPresley is a specialized yagoDate instance derived from the entity's bornOnDate and diedOnDate.
By manual evaluation, the presenters of YAGO2 determined it to have a factual accuracy of 95% +/- 3.69%. This is the proportion of facts in YAGO2 which are correct - very high compared to the other semantic networks, but since YAGO relations are manually defined, a query which involves a relation which is not defined cannot be carried out (e.g. rivers flowing through a city). Also, YAGO cannot perform queries with a negated predicate (e.g. people who are not Austrian), because entities can have that predicate as undefined (e.g. people who have unknown nationality).
Summary of state of the art
From the literature review, we found that, for identifying session boundaries, purely temporal-based thresholds do not account for multi-tasking, and are ineffective on their own, and on the other hand, content-only analysis is over-reliant on shared terms. The best outcomes are achieved by using a combination of features; in Jansen et al.'s case, they used search metadata as well as content changes (string similarity scores).
As for semantic content, since the search engine would need to derive relations between arbitrary entities, resources like YAGO and DBpedia make good candidates, as they classify concepts within hierarchies and contain certain relations between them. Ones like WordNet and BabelNet are more suited for disambiguating word senses, and aren't that useful in the context of this problem.
Implementing a performant search engine that does all of the above is difficult indeed.
Based on the current state of the art, 95% (1 out of 20) of identified sessions would have at least one query string which is unrelated to the rest of the set, which would mean that the identified task may potentially be wrong, especially if the query context is ambiguous. This is nowhere near industry standards that users expect from a commercial product, but there is no other solution apart from developing a better model mapping query strings to sessions that has much better accuracy.
Consider the query set:
- Albert Einstein
- Theory of General Relativity
- John von Neumann
- von Neumann probe
Depending on how specific the search is, the search engine could potentially segment it as one, two, or four individual tasks. Therein lies another difficulty - predicting the granularity of the result set that the user is expecting. Referring to the query set above, the user might be interested in major scientists and their works, or there might be two different users interested in different scientists and their works, etc. Segmenting the tasks too zealously might omit results which would interest certain users, whereas lumping together too many queries would give results which are irrelevant.
All this querying is expensive, especially if we want a robust implementation. Consider the fact that YAGO only contains manually-defined relations. This means the search engine would not be able to identify certain tasks (with undefined relations), unless it tries an alternative like DBpedia.
However, all of these knowledge bases have very different APIs, outputs, and data structures in which they define the relations, and hence the search engine needs to translate its common query representation to the appropriate format for each knowledge base it uses. Even after we manage to find the relations, we still have to compute the user's task. Even if the search engine returns extremely relevant results, it would be useless or unappealing to a user if it takes too long.
Task identification using search engine query logs is still a problem in its infancy, and there remains a multitude of problems to be solved before it can be used in a commercial search engine deployment. However, given the above literature review of the steady progress of the current state of the art, progress towards that goal looks promising.
|Jackard (also Jaccard) distance||The number of single-word operations (insertion ,deletion, substitution) to change from one string to another|
|Levenshtein distance||The number of single-character operations (insertion, deletion, substitution) to change from one string to another|
|RDF triple||RDF (Resource Description Framework) is a W3C specification of a metadata data model. For knowledge bases the triple is usually a so-called SPO (sense, predicate, object) triple used to represent the semantics of one sense of a word|
|Reformulation||Refinements that a user makes to a previous query in order to try to get a better result.|
|Session boundary||The boundary between two sets of related searches|
|Sense / word sense||One use of a word|
|Synset||A set of words which are synonyms to each other|
|Task||The goal a user wants to solve when issuing a set of related queries. An atomic information need.|
- Claudio Lucchese, Salvatore Orlando, Raffaele Perego, Fabrizio Silvestri, Gabriele Tolomei (2011). Identifying Task-based Sessions in Search Engine Query Logs WSDM: 277-286 doi: 10.1145/1935826.1935875
- Jeff Huang, Efthimis N. Efthimiadis (2009). Analyzing and Evaluating Query Reformulation Strategies in Web Search Logs CIKM: 77-86
- Craig Silverstein, Hannes Marais, Monika Henzinger, Michael Moricz (1999). Analysis of a very large web search engine query log SIGIR 33 (1): 6-12 doi: 10.1145/331403.331405
- Filip Radlinski, Thorsten Joachims (2005). Query chains: learning to rank from implicit feedback KDD: 239-248 doi: 10.1145/1081870.1081899
- Alan L. Montogomery, Christos Faloutsos (2007). Identifying web browsing trends and patterns. IEEE Computer 34 (7): 94-95
- Ji-Rong Wen, Jian-Yun Nie, Hong-Jiang Zhang (2001). Clustering User Queries of a Search Engine WWW: 162-168 doi: 10.1145/371920.371974
- Bernard J. Jansen, Amanda Spink, Chris Blakely, Sherry Koshman (2007). Defining a session on Web search engines JASICT 58 (6): 862-871 doi: 10.1002/asi.20564
- Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, Aristides Gionis, Sebastiano Vigna (2008). the Query-flow Graph: Models and Applications CIKM: 609-618 doi: 10.1145/1458082.1458163
- George A. Miller (1995). WordNet: A Lexical Database for English CACM 38 (11): 39-41 doi: 10.1145/219717.219748
- Collins Allan M., Quillian M. Ross (1972). Experiments on semantic memory and language comprehension Cognition in learning and memory
- Douglas B. Lenat (1995). CYC: a large-scale investment in knowlege infrastructure CACM 38 (11): 33-38 doi: 10.1145/219717.219745
- S. Auer, J. Lehmann (2007). What Have Innsbruck and Leipzig in Common? Extracting Semantics from Wiki Content The Semantic Web: Research and Applications 4519: 503-517
- C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, S. Hellmann (2009). DBpedia - A crystallization point for the Web of Data Web Semantics 7 (3): 154-165
- Joanna Biega, Erdal Kuzey, Fabian M. Suchanek (2013). Inside YAGO2s: A transparent Information Extraction Architecture WWW Companion: 325-328
- Johannes Hoffart, Fabian M. Suchanek, Klaus Berberich, Gerhard Weikum (2013). YAGO2: A spatially and temporally enhanced knowledge base from Wikipedia Artificial Intelligence 194: 28-61