Difference between revisions of "Main Page"
Changtau2005 (Talk | contribs) (→Resume) |
Changtau2005 (Talk | contribs) (→Task Identification using Search Engine Query Logs) |
||
Line 39: | Line 39: | ||
===Task Identification using Search Engine Query Logs=== | ===Task Identification using Search Engine Query Logs=== | ||
+ | [[File:Ex-task-identification.png|thumb|right|350px|Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities of a class which is most often searched for. Green nodes are the most frequently searched strings of a class.]] | ||
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of <tt>Hotels in Hawaii</tt>, or <tt>Flights to Hawaii</tt>. Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches. | This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of <tt>Hotels in Hawaii</tt>, or <tt>Flights to Hawaii</tt>. Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches. | ||
Revision as of 14:11, 4 October 2014
Page currently under reconstruction (4th October 2014). Expected to finish in several hours. Please check back later :)
Welcome!
I'm Li, a 4th year student at University College London currently working on an MEng in Computer Science. I'm most interested in applications of machine learning to large data sets. I haven't decided on a specific research area, primarily because I don't think I've seen enough of the field yet. However, my current interests slant towards applying machine learning to areas related to data mining, semantic computation, and natural language processing. The data I've worked with in the past are web-based (AOL search logs, Bing session data, mined Twitter data, YAGO2).
Why computer science? I decided to enter the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in Imperial College London - I left after two years - but that's a story for another time ;)
+ For people unfamiliar with computer science, machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).
Resume
- UK version - 2 pages
- US version - 1 page - todo
Internships
Microsoft Research Cambridge
UniEntry
Other
Research
I was fortunate enough to have the opportunity to be involved in several short-term research projects (2 months - 6 months) during my undergraduate years. Generally, internship opportunities for undergraduate students in the UK tend to be limited to development work.
Big Five Personality Classification of Twitter Profile by Machine Learning
This is the title for my Masters dissertation. At the time of writing, I've just begun to work on it, so everything is still highly tentative. Supervisor: Emine Yilmaz. Personal tutor: Dr. Kevin Bryson
By mining the text corpus of individual Twitter profiles, we hope to classify the user in the five categories of the Big Five model. We plan to do so by identifying adjectives in them labeled with a "weight" towards one end of each category. Such labels can be found from the seminal Allport-Odbert 1936 list and in similar works. We are scoping the project to only consider Twitter profiles in English.
We hope that the findings form a basis for further research into identifying individuals with potential signs of depression based on their Twitter activity. Depending on the speed of progress, we might have some time to consider this part of the problem.
SmartFence
- Please see #Microsoft Research Cambridge
Task Identification using Search Engine Query Logs
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for Hawaii, it gives you results along the lines of Hotels in Hawaii, or Flights to Hawaii. Based on our understanding of how Google's system works, these are the most common strings appearing with Hawaii in searches.
We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. For example, Hawaii would be determined to be a place. Using relations like these (Hawaii {hasClass} place), we build up a tree of classes - Root --> Organism --> Human --> Artist --> Musician --> Singer --> Michael Jackson, for example, and we aggregate the related search strings using this tree. Hence, by querying nodes of this tree, we can find out the most popular entries when searching for a human being, for instance. I thought this was an extremely interesting problem to tackle.
We ran into problems of ambiguity - for example, Java may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms Scala and Java, we can be sure that Java means the programming language. We ended up discarding many sessions which did not give us enough data to disambiguate, and we didn't have enough data in the end to populate the tree beyond the first 3 to 4 layers. We were extremely time-constrained (3 months) so we couldn't refine our methods to improve the results, but for our efforts, the project was awarded best research project in our year.
- Task Identification Using Search Engine Query Logs (Lit review coursework)
- Task Identification Using Search Engine Query Logs (Results report)
Projects
SynthJS
- Please see Dev:SynthJS
Pastimes
Currently wiki is mostly used to construct and publish dynamic/modular documents since wikitext/HTML is easier to work with than LaTeX in some cases. MediaWiki also works as a convenient CMS for the dev diary.