Main Page

From LQ's wiki
Revision as of 19:07, 4 October 2014 by Changtau2005 (Talk | contribs)

Jump to: navigation, search
At Furano, Hokkaido, Japan

Hi! I'm Li, a 4th year student at University College London currently working on an MEng in Computer Science. I'm most interested in applications of machine learning to large data sets. I haven't decided on a specific research area, but my current interests slant towards applying machine learning to areas related to data mining, semantic computation, and natural language processing. The data I've worked with in the past are web-based (AOL search logs, Bing session data, mined Twitter data, YAGO2).

Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in Imperial College London - I left after two years - but that's a story for another time ;)

+ For people unfamiliar with computer science, machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).

Resume

Selected projects

SynthJS

This logo was actually designed in CSS3, but I converted it to png for certain browsers' sakes

This is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at the emerging technologies for a while, I settled on a project that explores music technology for the web.

Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology was new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of musical instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!

RoboHome

Control interface
Scripting interface

This was a university systems engineering project. The objective was to build a protocol-agnostic control system for home devices which is accessible and scriptable remotely via a web interface. The control system itself has to be low-cost, so we used a Raspberry Pi. Each RPi runs the drivers for devices for three platforms - Arduino, Belkin, and Microsoft Gadgeteer and a full web stack so we can connect to it via WiFi using a laptop to control it. We also had a separate server instance hosted on Azure that authenticates users and connect them to a remote RPi.

Robot race

The standard racing robot
The first track. The second track is a figure-of-eight. Some teams' robots actually went backwards on that one!

A university project. We programmed a standard robot in C to navigate a track with obstacles as quickly as possible. The robot has a frontal ultrasound sensor, two IR sensors at the frontal corners, two frontal bump buttons, and a light sensor at the frontal bottom edge that stops it at the finishing line. The wheels were attached to stepper motors which drive them forward in numbers of ticks per second. The idea was to map the course out and find the shortest path to the end, but because the robot has no point of reference relative to the tracks other than the ticks in both wheels, the tiniest bump (and momentum) throws everything off so nobody actually managed to perform the mapping!

The most straightforward solution in the end is simply a wall follower, and because we were allowed to change the code on racing day, the most successful teams were the ones which modified their algorithm last-minute specifically to target the course being run on (like maximum speed for 3 seconds after the 4th turn). That was a fun event, even if it didn't turn out like I expected!

Android app for field study

A first year university project. We programmed an Android app for the Restless Beings charity as an exercise in requirements gathering from the client. It was designed for quick and easy use by field workers to log health data on homeless children. The app was designed to work both online and offline, because the location of deployment is in rural Africa and only the centers of the largest towns have any sort of internet connection. These workers often work with the same children, so using the phone GPS, they track the movement of each individual over time to generate statistics like distance to the nearest water source.

Internships

Microsoft Research Cambridge

MSR Cambridge Bright Minds Interns 2014. I'm the one in the light blue shirt
MSR Cambridge Machine Learning and Perception group

I joined Microsoft for 8 weeks as Research Intern through the Bright Minds Internship Competition programme for undergraduates. I was supervised by Pushmeet Kohli and Yoram Bachrach, with limited help from Ulrich Paquet and Filip Radlinski.

Since the allocated time is short, I was told about the problem and given a direction to start off on - it was about parental / access control of the internet. There were several problems in this area we hoped to address:

  1. Traditional access control relies on black and whitelists of URLs. For instance, OpenDNS offers a service where they simply refuse to resolve blacklisted domains. However, domain ownership changes constantly and the lists have to be updated to reflect this.
  2. A site is either blocked or not, based on a fixed criterion, and there is no customization of the partitioning. For example, if a company wants to block Facebook at work, they couldn't use a service which has it in the whitelist.
  3. Users can only recognize a tiny fraction of sites in either list.

We wanted to try a different method. The project was called SmartFence. Unfortunately I cannot discuss it in detail (until the paper gets published at some point, hopefully), but I can give an idea of what we did. We make the assumption that users tend to visit websites which serve the same information need in a session. Hence, websites belonging within a search session in Bing are said to be correlated, and we transform this correlation data into multi-dimensional feature vectors. We then construct a giant similarity matrix of websites based on these vectors. For our investigation, we constructed it in-memory for the top 10k sites rather than for everything. When users block or allow the sites which they recognize, we label these -1 and 1 respectively. For any site which a user does not recognize, its score is calculated based on its similarity to all the other labeled sites.

We used the approach primarily because the we need to be able to compute the partitioning quickly when the user changes the labeled set of websites or weighting decay function and cutoff. I delivered a working prototype in 4 weeks for the internal company hackathon, and a refined version based on D3.js at the end of the internship.

UniEntry

Course browse page for UniEntry

UniEntry is a startup company which I worked for in summer 2013. It aims to better inform students to when picking UK universities by aggregating user ratings and information from the Higher Education Statistics Agency, and recommend a good spread of choices based on the student's grades. It is founded as a part time venture by two individuals, who hired myself and another developer to develop a pilot site over the summer.

The pilot site is supposed to be used as a proof of concept to get schools to be involved and to look for funding. We used the agile development model, so we had daily stand-ups, sprint planning, progress burndown charting, the works. The site was developed in ASP.NET, and I was the primary front-end developer and designer. After the initial three months, I maintained contact with the company and submitted bugfixes when requested.

Other

My other experiences are related to my brief stint in medical school rather than computer science, but I consider them to be valuable and one-of-a-kind.

  • Work shadowing in van Andel Institute, Michigan, USA. I generally observed the environment and working atmosphere within a biomedical research lab. I learned about what the researchers do on a regular basis, things like automatic sequencing, running DNA microarrays etc. I learned how they made knockout mice (mice with certain genes deactivated) to study its effects, and how they highlight sections of DNA by binding highly specific fluorescent molecules to them, using techniques (with really fancy names) like spectral karyotyping and fluorescent in-situ hybridization. Looking back at my experiences in the lab, it always reminds me of how different the nature of the work in the various fields of science can be.
  • In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia. I remember how meticulous everything was - the cleanliness precautions we had to take, the nurses charting the patient's statistics every few hours, etc. Later on, I was attached with the Department of Public Health of Penang, and we went off checking the safety of water supplies and fogging areas with reported cases of dengue fever.

Research

I was fortunate enough to have the opportunity to be involved in several short-term research projects (2 months - 6 months) during my undergraduate years. Generally, internship opportunities for undergraduate students in the UK tend to be limited to development work.

Big Five Personality Classification of Twitter Profile by Machine Learning

This is the title for my Masters dissertation. At the time of writing, I've just begun to work on it, so everything is still highly tentative. Supervisor: Emine Yilmaz. Personal tutor: Dr. Kevin Bryson

By mining the text corpus of individual Twitter profiles, we hope to classify the user in the five categories of the Big Five model. We plan to do so by identifying adjectives in them labeled with a "weight" towards one end of each category. Such labels can be found from the seminal Allport-Odbert 1936 list and in similar works. We are scoping the project to only consider Twitter profiles in English.

We hope that the findings form a basis for further research into identifying individuals with potential signs of depression based on their Twitter activity. Depending on the speed of progress, we might have some time to consider this part of the problem.

SmartFence

Task Identification using Search Engine Query Logs

Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings in conjunction with an entity of that class. Visualized using D3.js

This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for Hawaii, it gives you results along the lines of Hotels in Hawaii, or Flights to Hawaii. Based on our understanding of how Google's system works, these are the most common strings appearing with Hawaii in searches.

We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. For example, Hawaii would be determined to be a place. Using relations like these (Hawaii {hasClass} place), we build up a tree of classes - Root --> Person --> Artist --> Musician --> Singer --> Michael Jackson, for example, and we aggregate the related search strings using this tree. Hence, by querying nodes of this tree, we can find out the most popular entries when searching for a human being, for instance. I thought this was an extremely interesting problem to tackle.

We ran into problems of ambiguity - for example, Java may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms Scala and Java, we can be sure that Java means the programming language. We ended up discarding many sessions which did not give us enough data to disambiguate, and we didn't have enough data in the end to populate the tree beyond the first 3 to 4 layers. We were extremely time-constrained (3 months) so we couldn't refine our methods to improve the results, but for our efforts, the project was awarded best research project in our year.

Pastimes

I enjoy reading. In particular, material like National Geographic, The Economist, and New Scientist. I've also been studying Japanese on my own in my free time for almost a year now - now I can maybe read, speak, and listen at the level of a child in grade school. Naturally, I enjoy reading manga and watching anime too, and I study well-known works that I've read before like The Little Prince or Treasure Island in Japanese.

In my more carefree years, I used to play games quite a bit - BG, NWN, Torment, Panzer Dragoon, C&C, Starcraft, Warcraft, Total War, SupCom, Galciv, SotS, Civ, DotA, Diablo, Wing Commander, Freespace, Freelancer, X series, XCOM, Unreal, Mass Effect, Kirby, Zelda, Pokemon, all manner of JRPGs like Mana, Persona, Disgaea, Suikoden, Star Ocean, Etrian Odyssey, Dragon Quest, Tales, Grandia, Kingdom Hearts, FF... and a whole load more. Nowadays I no longer have the time for them, but some of these did wonders for my imagination - they are like novels or epics that stay with you forever. I play the piano and the erhu and gaohu (Chinese viola and violin), and I used to be the instructor and lead player for the erhu back in high school.