Difference between revisions of "Main Page"

From LQ's wiki
Jump to: navigation, search
(Task Identification using Search Engine Query Logs)
Line 8: Line 8:
 
I'm Li, a 4th year student at [http://www.ucl.ac.uk/ University College London] currently working on an MEng in Computer Science. I'm most interested in applications of machine learning to large data sets. I haven't decided on a specific research area, primarily because I don't think I've seen enough of the field yet. However, my current interests slant towards applying machine learning to areas related to data mining, semantic computation, and natural language processing. The data I've worked with in the past are web-based (AOL search logs, Bing session data, mined Twitter data, [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/ YAGO2]).
 
I'm Li, a 4th year student at [http://www.ucl.ac.uk/ University College London] currently working on an MEng in Computer Science. I'm most interested in applications of machine learning to large data sets. I haven't decided on a specific research area, primarily because I don't think I've seen enough of the field yet. However, my current interests slant towards applying machine learning to areas related to data mining, semantic computation, and natural language processing. The data I've worked with in the past are web-based (AOL search logs, Bing session data, mined Twitter data, [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/ YAGO2]).
  
Why computer science? I decided to enter the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in [http://www3.imperial.ac.uk/ Imperial College London] - I left after two years - but that's a story for another time ;)
+
Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in [http://www3.imperial.ac.uk/ Imperial College London] - I left after two years - but that's a story for another time ;)
  
 
+ For people unfamiliar with computer science, machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).
 
+ For people unfamiliar with computer science, machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).
Line 16: Line 16:
 
* US version - 1 page - todo
 
* US version - 1 page - todo
  
 +
==Projects==
 +
 +
===SynthJS===
 +
[[File:Synthjs-scrshot-03.png|thumb|right|350px]]
 +
This is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at the emerging technologies for a while, I settled on a project that explores music technology for the web.
 +
 +
Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology is new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!
 +
 +
*For the developer's diary, see [[Dev:SynthJS]].
 +
*To play with it, [http://lqkhoo.com/synthjs click here]. Have fun!
 
==Internships==
 
==Internships==
  
Line 23: Line 33:
  
 
===Other===
 
===Other===
 
+
My other experiences are related to my brief stint in medical school, rather than computer science.
 +
* Work shadowing in van Andel Institute, Michigan, USA. I generally observed the activity within a biomedical research lab - automatic sequencing, running DNA microarrays etc.
 +
* In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia, and then later on, in the Department of Public Health of Penang.
  
 
==Research==
 
==Research==
Line 39: Line 51:
  
 
===Task Identification using Search Engine Query Logs===
 
===Task Identification using Search Engine Query Logs===
[[File:Ex-task-identification.png|thumb|right|350px|Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities of a class which is most often searched for. Green nodes are the most frequently searched strings of a class.]]
+
[[File:Ex-task-identification.png|thumb|right|350px|Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings. Visualized using D3.js]]
 
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of <tt>Hotels in Hawaii</tt>, or <tt>Flights to Hawaii</tt>. Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches.
 
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of <tt>Hotels in Hawaii</tt>, or <tt>Flights to Hawaii</tt>. Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches.
  
Line 48: Line 60:
 
*[[Task Identification Using Search Engine Query Logs|Task Identification Using Search Engine Query Logs (Lit review coursework)]]
 
*[[Task Identification Using Search Engine Query Logs|Task Identification Using Search Engine Query Logs (Lit review coursework)]]
 
*[[:File:Task Identification Using Search Engine Query Logs - Report.pdf|Task Identification Using Search Engine Query Logs (Results report)]]
 
*[[:File:Task Identification Using Search Engine Query Logs - Report.pdf|Task Identification Using Search Engine Query Logs (Results report)]]
 
==Projects==
 
 
===SynthJS===
 
*Please see [[Dev:SynthJS]]
 
  
 
==Pastimes==
 
==Pastimes==
  
 
 
 
Currently wiki is mostly used to construct and publish dynamic/modular documents since wikitext/HTML is easier to work with than LaTeX in some cases. MediaWiki also works as a convenient CMS for the dev diary.
 
  
 
[[Category:Root]]
 
[[Category:Root]]

Revision as of 14:36, 4 October 2014

Page currently under reconstruction (4th October 2014). Expected to finish in several hours. Please check back later :)


Welcome!

I'm Li, a 4th year student at University College London currently working on an MEng in Computer Science. I'm most interested in applications of machine learning to large data sets. I haven't decided on a specific research area, primarily because I don't think I've seen enough of the field yet. However, my current interests slant towards applying machine learning to areas related to data mining, semantic computation, and natural language processing. The data I've worked with in the past are web-based (AOL search logs, Bing session data, mined Twitter data, YAGO2).

Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in Imperial College London - I left after two years - but that's a story for another time ;)

+ For people unfamiliar with computer science, machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).

Resume

Projects

SynthJS

Synthjs-scrshot-03.png

This is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at the emerging technologies for a while, I settled on a project that explores music technology for the web.

Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology is new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!

Internships

Microsoft Research Cambridge

UniEntry

Other

My other experiences are related to my brief stint in medical school, rather than computer science.

  • Work shadowing in van Andel Institute, Michigan, USA. I generally observed the activity within a biomedical research lab - automatic sequencing, running DNA microarrays etc.
  • In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia, and then later on, in the Department of Public Health of Penang.

Research

I was fortunate enough to have the opportunity to be involved in several short-term research projects (2 months - 6 months) during my undergraduate years. Generally, internship opportunities for undergraduate students in the UK tend to be limited to development work.

Big Five Personality Classification of Twitter Profile by Machine Learning

This is the title for my Masters dissertation. At the time of writing, I've just begun to work on it, so everything is still highly tentative. Supervisor: Emine Yilmaz. Personal tutor: Dr. Kevin Bryson

By mining the text corpus of individual Twitter profiles, we hope to classify the user in the five categories of the Big Five model. We plan to do so by identifying adjectives in them labeled with a "weight" towards one end of each category. Such labels can be found from the seminal Allport-Odbert 1936 list and in similar works. We are scoping the project to only consider Twitter profiles in English.

We hope that the findings form a basis for further research into identifying individuals with potential signs of depression based on their Twitter activity. Depending on the speed of progress, we might have some time to consider this part of the problem.

SmartFence

Task Identification using Search Engine Query Logs

Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings. Visualized using D3.js

This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for Hawaii, it gives you results along the lines of Hotels in Hawaii, or Flights to Hawaii. Based on our understanding of how Google's system works, these are the most common strings appearing with Hawaii in searches.

We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. For example, Hawaii would be determined to be a place. Using relations like these (Hawaii {hasClass} place), we build up a tree of classes - Root --> Organism --> Human --> Artist --> Musician --> Singer --> Michael Jackson, for example, and we aggregate the related search strings using this tree. Hence, by querying nodes of this tree, we can find out the most popular entries when searching for a human being, for instance. I thought this was an extremely interesting problem to tackle.

We ran into problems of ambiguity - for example, Java may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms Scala and Java, we can be sure that Java means the programming language. We ended up discarding many sessions which did not give us enough data to disambiguate, and we didn't have enough data in the end to populate the tree beyond the first 3 to 4 layers. We were extremely time-constrained (3 months) so we couldn't refine our methods to improve the results, but for our efforts, the project was awarded best research project in our year.

Pastimes