Difference between revisions of "Main Page"

From LQ's wiki
Jump to: navigation, search
m
m
(43 intermediate revisions by one user not shown)
Line 1: Line 1:
 
{{:Template:CSS:MainPage}}
 
{{:Template:CSS:MainPage}}
<div style="font-size: 18px; margin: 10px 0; padding: 5px 10px; background-color: #efefef; outline: 1px solid #ccc;">[[:File:CV-General-Li-Quan-Khoo.pdf|Resume]] {{!}} [{{Template:LinkedInURL}} LinkedIn] {{!}} [{{Template:GithubURL}} Github] {{!}} [{{Template:LandingPageURL}} Landing page]</div>
+
<div style="font-size: 16px; margin: 10px 0; padding: 5px 10px; background-color: #efefef; outline: 1px solid #ccc;">[[:File:CV-General-Li-Quan-Khoo.pdf|Resume]] {{!}} [{{Template:LinkedInURL}} LinkedIn] {{!}} [{{Template:GithubURL}} Github] {{!}} [{{Template:LandingPageURL}} Landing page]</div>
  
Hi, I'm Li! I'm most interested in working with technologies that can tractably and massively scale, such as applications of machine learning to large, high dimensional, sparse, or heterogeneous data. I hold a Master of Engineering in Computer Science with first class honours from [http://www.ucl.ac.uk/ University College London] since August 2015, and I have general experience working with datasets and methods related to search and advertising, semantic computation, and NLP - corpora (<small>mined Twitter data, [http://www.natcorp.ox.ac.uk/using/index.xml BNC], [http://nlp.uned.es/social-tagging/wiki10+/ Wiki10+], [http://about.reuters.com/researchandstandards/corpus/ Reuters-21578]</small>), search logs (<small>AOL search logs, Bing session data</small>), ontologies (<small>[https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/ YAGO2], [https://www.freebase.com/ Freebase]</small>), lexicons (<small>[http://wordnet.princeton.edu/ WordNet], [http://sentiwordnet.isti.cnr.it/ SentiWordNet], [http://wndomains.fbk.eu/ WordNet Domains], [http://www.liwc.net/ LWIC], [http://www.psych.rl.ac.uk/ MRC], [http://www.ark.cs.cmu.edu/TweetNLP/ TweetNLP] clusters</small>), image datasets (<small>[http://labs.yahoo.com/news/yfcc100m/ Yfcc100m], [http://www.yli-corpus.org/ YLI], [http://visual-sentiment-ontology.appspot.com/ VSO], [http://places.csail.mit.edu/ Places], [http://groups.csail.mit.edu/vision/SUN/ SUN]</small>) etc.
+
Hi, I'm Li! I graduated in August 2015 from [http://www.ucl.ac.uk/ University College London] with MEng Computer Science from UCL with first class honours. Since then I've generally gravitated towards the research side of things, especially in applied machine learning and artificial intelligence. During university, I collaborated with my advisor, Prof. Emine Yilmaz, in topics related to search and advertising, semantic computation, and NLP, working with technologies such as <small>the [http://www.natcorp.ox.ac.uk/using/index.xml BNC], [https://www.mpi-inf.mpg.de/departments/databases-and-information-systems/research/yago-naga/yago/ YAGO2], [http://wordnet.princeton.edu/ WordNet], [http://sentiwordnet.isti.cnr.it/ SentiWordNet]</small>. More recently, I'm interested in neural nets as solutions to a subset of problems I've encountered during my professional career, such as machine vision, translation, music synthesis, and analysis of spatial or temporal data. I'm currently a registered SCPD student under Stanford University's professional education programme, while employed as a research SDE in Singapore.
  
 
Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in [http://www3.imperial.ac.uk/ Imperial College London] - I left after two years - but that's a story for another time.
 
Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in [http://www3.imperial.ac.uk/ Imperial College London] - I left after two years - but that's a story for another time.
Line 8: Line 8:
 
+ For people unfamiliar with computer science, (supervised) machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).
 
+ For people unfamiliar with computer science, (supervised) machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).
  
==What I'm working on==
 
* [[:File:CV-General-Li-Quan-Khoo.pdf|Resume (PDF version)]]
 
 
I am currently preparing for further graduate study, primarily towards the modelling of financial time series data. There are well-known issues with applying simple autocorrelation and Fourier decomposition methods to such data, due to windowing, noise, aperiodicity, and non-stationarity. For purposes of evaluating risk, I want to learn about modelling with stochastic processes, which is how the quantitative finance industry generally tackles such problems.
 
 
My current research is towards the application of time-delay neural networks (TDNNs) for forecasting discretized time series data. To elaborate a bit more:
 
 
Structurally, time series are highly similar to images (serial dependence along axes), and TDNNs are essentially ConvNets, which have state-of-the-art performance in image pattern recognition tasks. For time-series forecasting, what we look for is an approximation of a multivariable function f that takes some previous time window (after subsampling or otherwise raw) and predicts the next value in discrete time. As with any sort of fine-grained forecasting, only the short-term is relevant; because the system uses its own output to predict further values in the future, errors compound with further extrapolation. The strength of stochastic processes in comparison is that they forego this kind of fine-grainedness, but on the other hand capture long-term statistical properties of the series, allowing for better long-term risk control. I believe the two are very good complements to one another, and depending on the time-scale of your strategies, one may be much more relevant than the other, for example in high-frequency trading (HFT) where market microstructure and short-term models are extremely important.
 
 
Recurrent nets (RNNs) are also fantastic in the domain of sequential pattern recognition, but I favor TDNNs for financial data because they are more "stable" without recurrent connections and hence are much more straightforward to train, plus the main problem that plagues TDNNs - the inability to handle variable-sized inputs - is completely absent in this context. RNNs are Turing-complete, which makes them more powerful, but for time series prediction, what we're really looking for is a a simple function approximation, and feedforward architectures like TDNNs are entirely sufficient for that purpose.
 
 
Deep learning in general has been pushing the boundaries of what's possible at an astonishing pace - we went from shallow nets to pre-trained hidden layers, then we realised the vanishing gradient problem for what it is, and now everybody is using variants of ReLU activations with Dropout etc for regularization. All this happened within just a handful of years, and now we are already in the region of superhuman performance for certain tasks. This is certainly a very new and exciting area to be working in.
 
  
 
===Curriculum of study===
 
===Curriculum of study===
I've been following a curriculum of MOOC and continued education courses since August 2015. At first it's to fill gaps in my quantitative background, but now I take all manner of subjects to inform my research, and I fully intend to hit the ground running once I actually get into grad school.
+
I've been following a curriculum of MOOC and continued education courses since August 2015 primarily to inform my research interests.
  
Current focus: ''Deep learning on sequential data''
 
 
{| class="wikitable mw-collapsible mw-collapsed" style="line-height: 1.2em; font-size: 0.9em; border-collapse: collapse;"
 
{| class="wikitable mw-collapsible mw-collapsed" style="line-height: 1.2em; font-size: 0.9em; border-collapse: collapse;"
 
!Detailed curriculum
 
!Detailed curriculum
Line 89: Line 76:
  
 
Probability triples, sigma-algebra, measure, Caratheodory's Extension Theorem, Borel sets, Lebesgue [0,1] measure, random variables, independence, limit events, tail fields, zero-one laws (Borel-Cantelli, Kolmogorov)
 
Probability triples, sigma-algebra, measure, Caratheodory's Extension Theorem, Borel sets, Lebesgue [0,1] measure, random variables, independence, limit events, tail fields, zero-one laws (Borel-Cantelli, Kolmogorov)
 +
|-
 +
|EE
 +
|[http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-003-signals-and-systems-fall-2011/ 6.003 Signals and Systems]
 +
|MIT||OCW||Yes||Partial||N/A||Completed (Sep 2016)
 +
|
 +
*Signals and Systems (Oppenheim, Wilsky)
 +
|Signals, samples, DT, CT systems, feedback, poles, modes, Z-transform, bilateral Laplace transform, DT approximation of CT systems, convolution, LTI systems and eigenfunctions, phase-amplitude frequency response, Bode plots, CT, DT Fourier series, Fourier transform, sampling, impulse reconstruction, aliasing, quantization
 
|-
 
|-
 
|Comp
 
|Comp
Line 95: Line 89:
 
|N/A
 
|N/A
 
|Distributed ML, Apache Spark, pySpark, SparkSQL, distributed regression, distributed PCA, computational complexity, big n, big d considerations
 
|Distributed ML, Apache Spark, pySpark, SparkSQL, distributed regression, distributed PCA, computational complexity, big n, big d considerations
|-
 
|Comp
 
|[http://ocw.mit.edu/courses/electrical-engineering-and-computer-science/6-003-signals-and-systems-fall-2011/ 6.003 Signals and Systems]
 
|MIT||OCW||Yes||Partial||N/A||Completed (Sep 2016)
 
|
 
*Signals and Systems (Oppenheim, Wilsky)
 
|Signals, samples, DT, CT systems, feedback, poles, modes, Z-transform, bilateral Laplace transform, DT approximation of CT systems, convolution, LTI systems and eigenfunctions, phase-amplitude frequency response, Bode plots, CT, DT Fourier series, Fourier transform, sampling, impulse reconstruction, aliasing, quantization
 
 
|-
 
|-
 
|Comp
 
|Comp
 
|[https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ Deep Learning]
 
|[https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/ Deep Learning]
|Oxford|| -||Yes||No||N/A||Ongoing
+
|Oxford|| -||Yes||No||N/A||Completed (Dec 2016)
 
|
 
|
 
*Machine Learning: A Probabilistic Perspective (Kevin Murphy)
 
*Machine Learning: A Probabilistic Perspective (Kevin Murphy)
 
*Elements of Statistical Learning (Hastie, Tibshirani, Friedman)
 
*Elements of Statistical Learning (Hastie, Tibshirani, Friedman)
 
*[http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf A Tutorial on Energy-Based Learning] (LeCun, 2006)
 
*[http://yann.lecun.com/exdb/publis/pdf/lecun-06.pdf A Tutorial on Energy-Based Learning] (LeCun, 2006)
|Linear models, MLE estimation, entropy, regularization, the kernel trick, model complexity, MAP estimation, radial basis functions, optimization: Newton's, SGD, Polyak average, momentum methods, adagrad, logistic regression, softmax, modular backpropagation, convnets, multichannel spatial, volumetric convolution, k-max pooling, Xavier initialization, embedding in continuous spaces, transfer learning, auxiliary tasks, energy-based methods, generalized margin loss, RNNs, LSTMs, ...
+
|Linear models, MLE estimation, entropy, regularization, the kernel trick, model complexity, MAP estimation, radial basis functions, optimization: Newton's, SGD, Polyak average, momentum methods, Adagrad, logistic regression, softmax, modular backpropagation, convnets, multichannel spatial, volumetric convolution, k-max pooling, Xavier initialization, embedding in continuous spaces, transfer learning, auxiliary tasks, energy-based methods, generalized margin loss, RNNs, LSTMs, ...
 +
|-
 +
|Comp
 +
|[http://cs231n.stanford.edu/syllabus.html CS231n Convnets for Visual Recognition]
 +
|Stanford|| -||Yes||No||N/A||Completed (Nov 2016)
 +
|
 +
*Course notes
 +
*[http://www.youtube.com/watch?v=RJVL80Gg3lA Visualization with t-SNE]
 +
|Batch-norm, parameter update rules: Nesterov, Adagrad, RMSProp, Adam, L-BFGS, residual nets, localization, visualization, DeepDream/NeuralStyle, dropout, RNNs, LSTMs, GRUs, deep learning libs, segmentation, soft attention, spatial transformers, spatio-temporal convolutions, variational autoencoders, adversarial encoder/decoder, model distillation
 
|-
 
|-
 
|Econ
 
|Econ
Line 168: Line 163:
 
|}<!-- End outer table -->
 
|}<!-- End outer table -->
  
==Selected projects==
+
==Research==
  
===SynthJS===
+
===Bounding Out-of-Sample Objects (2017)===
[[File:Synthjs-scrshot-03.png|thumb|right|350px|This logo was actually designed in CSS3, but I converted it to png for certain browsers' sakes]]
+
This was my project submission for Stanford University's CS231n Convnets for Visual Recognition.
SynthJS is a music sequencer built using web technologies that run entirely on the client machine. It is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at emerging HTML5 techs for a while, I settled on a project that explores music technology for the web.
+
* Report: [ [[:File:Cs231n-project-pdf.pdf|PDF]] | [[:File:Cs231n-project-bib.bib.txt|BIB]] | [[:File:Cs231n-project-latex.zip|ZIP]] ]
 +
* Poster: [ [[:File:Cs231n-2017-poster.pdf|PDF]] | [[:File:Cs231n-2017-poster-pptx.pptx|PPTX]] | [[:File:Cs231n-2017-poster-vid.mp4|MP4]] ]
 +
* Code: [ [https://github.com/lqkhoo/cs231n-spring-2017 Github repo] ]
  
Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology was new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of musical instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!
+
<b>Abstract</b>
  
* [https://github.com/lqkhoo/SynthJS Github repo]
 
* To play with it, [http://lqkhoo.com/synthjs click here]. Have fun!
 
 
===RoboHome===
 
 
{| style="float: right;"
 
{| style="float: right;"
|[[File:Robohome-scrshot2.png|thumb|right|350px|Control interface]]
+
|[[File:Cs231n-2017-img02.png|x250px|thumb|Architecture]]
|[[File:Robohome-scrshot3.png|thumb|right|350px|Configuration interface]]
+
|[[File:Cs231n-2017-img01.png|x250px|thumb|Visualization]]
 
|}
 
|}
This was a university systems engineering project. The objective was to build a platform-and-protocol-agnostic control system for home devices which is accessible and scriptable remotely via a web interface. The control system itself had to be low-cost, so we used a Raspberry Pi. Each RPi runs the drivers for devices for three platforms - Arduino, Belkin, and Microsoft Gadgeteer and a full web stack so we can connect to it via WiFi using a laptop to control it. We also had a separate server instance hosted on Azure that authenticates users and connects them to a remote RPi.
 
  
* [http://comp2013group7.wikia.com/wiki/Comp2013group7_Wiki Team dev diary] | [https://github.com/michboon/comp2014 Github repo]
+
ConvNets have spatial awareness built into them by design, specifically in their feature maps or activation volumes. With respect to any task such as image classification, they are at the same time implicitly performing some form of localization of salient features. We want to understand how good this localization information is at annotating bounding boxes compared to ground-truth, especially around out-of-sample objects which the convnet has never been trained on. This requires a model which is indifferent to the class label, and thus it must operate on spatial information that is as general as possible, across as many image classes as possible. Additionally, we want to investigate whether this method could be adapted to different image classification architectures.
 +
 
 +
===Predicting Personality from Twitter (2015)===
 +
* Final year MEng project: [ [[:File:Meng-report.pdf|PDF]] ]
 +
 
 +
<b>Abstract</b>
  
===Robot race===
 
 
{| style="float: right;"
 
{| style="float: right;"
|[[File:Robot-race-robot.jpg|thumb|right|350px|The standard racing robot]]
+
|[[File:Meng-project-chart3.png|350px|thumb|right|Receiver operating characteristic plot of conscientiousness model.]]
|[[File:Robot-race-track.jpg|thumb|right|350px|The first track. The second track is a figure-of-eight. Some teams' robots actually went backwards on that one!]]
+
|[[File:Meng-project-chart4.png|350px|thumb|right|Precision-recall plot of conscientiousness model.]]
 +
|-
 +
|[[File:Meng-project-chart1.png|350px|thumb|right|Plot of predicted scores of agreeableness models trained by optimizing various metrics, against actual score.]]
 +
|[[File:Meng-project-chart2.png|350px|thumb|right|Surface plot of model performance, evaluating by Pearson correlation score, vs training and data parameters.]]
 
|}
 
|}
A first year university project. We programmed a standard robot in C to navigate a track with obstacles as quickly as possible. The robot has a frontal ultrasound sensor, two IR sensors at the frontal corners, two frontal bump buttons, and a light sensor at the frontal bottom edge that stops it at the finishing line. The wheels were attached to stepper motors which drive them forward in numbers of ticks per second. The original idea was to map the course out and find the shortest path to the end, but because the robot has no point of reference relative to the tracks other than the ticks in both wheels, the tiniest bump (and momentum) throws everything off so nobody actually managed to perform the mapping!
 
  
The most straightforward solution in the end is simply a wall follower, and because we were allowed to change the code on racing day, the most successful teams were the ones which modified their algorithm last-minute specifically to target the course being run on (like maximum speed for 3 seconds after the 4th turn). That was a fun event, even if it didn't turn out like I expected!
+
We present a 6-month-long multi-objective two-part study on the prediction of Big-Five personality scores from Twitter data, using lexicon-based machine learning methods. We investigate the usefulness of models trained with TweetNLP features, which have not been used in this domain before.
 +
 +
In the first part of the study, we cast personality prediction as a classification problem, and we investigate how prediction performance is affected by different methods of data normalization, such as whether we divide each feature by word count. In the second, main part of our study, we cast it as a regression problem, and we investigate the differences in performance when we use ranks of scores rather than actual scores, and how filtering only for users with over a certain tweet count affects prediction performance.
 +
 +
We report on the different methods used in existing literature, explain background information about the tools we used, and look at the common evaluation metrics used in classification and regression problems and address potential pitfalls when calculating or comparing them. We also suggest a solution on how to reconcile learning parameters for different models optimizing different metrics. Finally, we compare our best results with those in recent publications.
 +
 +
Our main findings are that term frequency-normalized features perform most consistently, that filtering for users (>200 tweets) improves prediction performance significantly in the regression problem, and that prediction performance using ranked data is comparable to using actual values. We found that models trained with TweetNLP features have comparable or superior performance to those trained with LWIC and MRC features commonly used in literature. Models trained with both have superior performance. Compared against 15 recent models (3 papers, 5 personality scores), our best models are better at prediction than 11 of them.
  
===Android app for field study===
+
===IRDM project poster presentation (2015)===
A first year university project. We programmed an Android app for the [http://www.restlessbeings.org/ Restless Beings charity] as an exercise in requirements gathering from the client. It was designed for quick and easy use by field workers to log health data on homeless children. The app had to work both online and offline, because the location of deployment is in rural Africa and only the centers of the largest towns have any sort of internet connection. These workers often work with the same children, so using the phones' GPS, they track the movement of each individual over time to generate statistics like distance to the nearest water source.
+
* Poster: [ [[:File:IRDM poster.pdf|PDF]] ]
 +
 
 +
This was a 4th year Information Retrieval and Data Mining course project about automatic document-tagging methods. The idea is extremely simple - we use named entity tags (Barack Obama, United States), RAKE algorithm-extracted keywords (candidate, poll, election), and WordNet domains information (political), to give readers a precise idea of an article's topic(s). From the given example tags alone, we can already be pretty confident that the target article is discussing the last one or two US presidential elections. The intuition is that the entity tags (proper nouns) give us specificity, the RAKE keywords (general, frequent terms) give us coverage, and WordNet domains information gives us a hierarchy to cluster similar articles together by topic.
 +
 
 +
We were awarded second-best for poster presentation.
 +
 
 +
===SmartFence (2014)===
 +
* Please see [[#Microsoft Research Cambridge]]
 +
 
 +
===Task Identification using Search Engine Query Logs (2014)===
 +
* Report [ [[:File:Task Identification Using Search Engine Query Logs - Report.pdf|PDF]] ]
 +
* Lit review [ [[Task Identification Using Search Engine Query Logs|PDF]] ]
 +
 
 +
[[File:Ex-task-identification.png|thumb|right|550px|Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings in conjunction with an entity of that class. Visualized using D3.js. The search logs were from 1997, so the results reflect the zeitgeist at the end of the 20th century.]]
 +
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of "Hotels in Hawaii", or "Flights to Hawaii". Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches.
 +
 
 +
We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. The idea is to lump searches like the ones above together with other searches of the same class, for example, with "Colleges in Hong Kong" or "Banks in London". These searches are similar because they are queries related to a place, and by aggregating all these searches together, for example, we can find out what users want to find out about most, when they query for places. First, we build up a class tree using the YAGO ontology. An example of a branch would be: Root --> Person --> Artist --> Musician --> Singer. Then, we add entities to each of these classes based on <i>rdf triples</i> expressing it, like <tt>ElvisPresley hasClass Singer</tt>. So, when we find a search string referencing the entity <tt>ElvisPresley</tt>, we remove the entity from the string and map the remainder to the class the entity belongs to, as well as all the ancestors of that class. For example, the search string <tt>Biography of Elvis Presley</tt>  would map the string <tt>Biography of</tt> to the classes Root, Person, Artist, Musician, and Singer. Hence, by querying nodes of this tree, we can find out the most popular tasks users want to do when searching for any class of things. I thought this was an extremely interesting problem to tackle.
 +
 
 +
AOL logs come from real data, and real data is messy. It was full of typographical errors, machine-issued queries, and foreign languages, keywords unsafe for minors, etc. To give an idea of what we did to get useful data out of it, first, we ran all the searches through a simple regex sieve to filter out nonsensical queries (only symbols, URLs), or machine-issued queries, which tend to be very long. Then we grouped the related searches together - this is called search session segmentation (we used Levenshtein distance, tri-grams, and a cutoff of T=26 minutes). Then we ran Porter-Stemmer. The Porter-Stemmer algorithm reduces English words to their root word, so we can match it against the entities in the YAGO knowledge base. We also had to think about greedy matching. Consider the string <tt>New York Times</tt>. We want the entire string rather than matching <tt>New York</tt> and <tt>times</tt>.
 +
 
 +
We also ran into problems of ambiguity (this is called word-sense matching in NLP) - for example, <tt>Java</tt> may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms <tt>Scala</tt> and <tt>Java</tt>, we can be sure that <tt>Java</tt> means the programming language. The problem is if we only have one entity matched in a session, we cannot disambiguate because we don't have another entity to compare it against, so we ended up discarding many such sessions, so the remaining data only managed to populate the tree up to around the third layer. That was as far as we got in three months, but after I delivered the presentation, the project was awarded [http://www.cs.ucl.ac.uk/computer_science_news/article/undergraduate-research-group-projects-prize-winners/ best undergraduate research group project of the year].
  
 
==Full-time positions==
 
==Full-time positions==
Line 209: Line 233:
 
While a non-disclosure agreement prevents me from talking in-depth about the results and methodology, I used the [http://labs.yahoo.com/news/yfcc100m/ Yahoo! Flickr Creative Commons 100 Million Images] (Yfcc100m) dataset and its pre-extracted feature descriptors from [http://www.yli-corpus.org/ YLI], and essentially applied the most straightforward methods known to me, as the project needed to finish on time.
 
While a non-disclosure agreement prevents me from talking in-depth about the results and methodology, I used the [http://labs.yahoo.com/news/yfcc100m/ Yahoo! Flickr Creative Commons 100 Million Images] (Yfcc100m) dataset and its pre-extracted feature descriptors from [http://www.yli-corpus.org/ YLI], and essentially applied the most straightforward methods known to me, as the project needed to finish on time.
  
What I found most interesting about the project was that the difficulties laid in unexpected areas. For example, because the field is new, there are inconsistencies in how to interpret the problem. Without going into detail, based on how images are tagged on Flickr, what we have is a <i>multi-label classification problem</i>. Some researchers (including authors of a AAAI 2015 paper) had simplified the problem to single-label classification when using Flickr's search API to compile their dataset, which I found as unjustifiable based on label distributions in Yfcc100m. During the process, I also found and [https://github.com/nltk/nltk/issues/1062 documented] a behavioural anomaly when the widely-used Natural Language Toolkit ([http://www.nltk.org/ NLTK]) interacts with the WordNet3.0 and SentiWordNet corpora. On top of that, I had to tailor on-the-fly, some high-level decisions towards addressing various potential applications in the context of the company's clients, which were not objectives in the original research proposal. Along with reading the literature on image representation and machine learning on images, which I'm relatively unfamiliar with in general, it all adds up into a very interesting learning experience indeed.
+
What I found most interesting about the project was that the difficulties laid in unexpected areas. For example, because the field is new, there are inconsistencies in how to interpret the problem. Without going into detail, based on how images are tagged on Flickr, what we have is a <i>multi-label classification problem</i>. Some researchers (including authors of a AAAI 2015 paper) had simplified the problem to single-label classification when using Flickr's search API to compile their dataset, which I found as unjustifiable based on label distributions in Yfcc100m. During the process, I also found and [https://github.com/nltk/nltk/issues/1062 documented] a behavioural anomaly when the widely-used Natural Language Toolkit ([http://www.nltk.org/ NLTK]) interacts with the WordNet3.0 and SentiWordNet corpora. On top of that, I had to tailor on-the-fly, some high-level decisions towards addressing various potential applications in the context of the company's clients, which were not objectives in the original research proposal. Along with reading the literature on image representation and machine learning on images, which I was relatively unfamiliar with in general, it all adds up into a very interesting learning experience indeed.
  
 
After the project ended, while my agreement with the company was full-time and I was welcome to take over one of several research projects, including drafting a new grant proposal based on results of my research, I had decided to pursue my own interests by then, and we parted ways amicably.
 
After the project ended, while my agreement with the company was full-time and I was welcome to take over one of several research projects, including drafting a new grant proposal based on results of my research, I had decided to pursue my own interests by then, and we parted ways amicably.
Line 232: Line 256:
 
We initially performed k-means clustering to investigate how the clusters would look like with different values of <tt>k</tt>. We wanted to see if these partitions made sense. Then we had a brief foray into reducing the vectors to two-dimensional space (bearing in mind information loss) so the user can draw out areas on the screen which they want to block or allow. I also briefly considered partitioning by Voronoi cells while working in 2D (because running convex hull becomes rapidly infeasible at higher dimensions). We finally abandoned hard cluster boundaries altogether, and settled on a kernel method. This meant that initially, every site is related to every other (a fully connected graph), and we cull the arcs by setting a minimum similarity threshold, and we control the partitioning by setting the value needed for a site to be blocked.
 
We initially performed k-means clustering to investigate how the clusters would look like with different values of <tt>k</tt>. We wanted to see if these partitions made sense. Then we had a brief foray into reducing the vectors to two-dimensional space (bearing in mind information loss) so the user can draw out areas on the screen which they want to block or allow. I also briefly considered partitioning by Voronoi cells while working in 2D (because running convex hull becomes rapidly infeasible at higher dimensions). We finally abandoned hard cluster boundaries altogether, and settled on a kernel method. This meant that initially, every site is related to every other (a fully connected graph), and we cull the arcs by setting a minimum similarity threshold, and we control the partitioning by setting the value needed for a site to be blocked.
  
I delivered a working prototype in 4 weeks for the internal company hackathon, and a refined version with a web front-end (knockout + D3) at the end of the internship. I didn't have any exposure to machine learning before the internship, but thankfully, I managed to pick up the basics on-the-fly in a short period of time. The period before the hackathon was particularly intense, because we have to get a basic algorithm and interface going in just four weeks. My previous experience in web development meant that I could rapidly iterate multiple versions of the GUI to get something intuitive to what our algorithm was trying to do. Filip took particular interest in the GUI implementation - he said it was more impressive than what one would usually see at a conference.
+
I delivered a working prototype in 4 weeks for the internal company hackathon, and a refined version with a web front-end (knockout + D3) at the end of the internship. I didn't have any exposure to machine learning before the internship, but thankfully, I managed to pick up the basics on-the-fly. The period before the hackathon was particularly intense, because we have to get a basic algorithm and interface going in just four weeks. My previous experience in web development meant that I could rapidly iterate multiple versions of the GUI to get something intuitive to what our algorithm was trying to do. Filip took particular interest in the GUI implementation - he said it was more impressive than what one would usually see at a conference.
  
 
This was as good an internship as I could have ever hoped for. I got to see how it is like to work in industrial research, I began to understand how powerful machine learning is, and what inventive things other researchers are doing with it, and most of all, I got the chance to work with some really fantastic people. I was trying very hard not to be the stupidest guy in the room - I was an undergraduate amongst PhDs and postdocs and people with decades of industry experience, but the guidance and support I received was second to none. It was awesome!
 
This was as good an internship as I could have ever hoped for. I got to see how it is like to work in industrial research, I began to understand how powerful machine learning is, and what inventive things other researchers are doing with it, and most of all, I got the chance to work with some really fantastic people. I was trying very hard not to be the stupidest guy in the room - I was an undergraduate amongst PhDs and postdocs and people with decades of industry experience, but the guidance and support I received was second to none. It was awesome!
Line 259: Line 283:
 
* In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia. I remember how meticulous everything was - the cleanliness precautions we had to take, the nurses charting the patient's statistics every few hours, etc. Later on, I was attached with the Department of Public Health of Penang, and we went off checking the safety of water supplies and fogging areas with reported cases of dengue fever.
 
* In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia. I remember how meticulous everything was - the cleanliness precautions we had to take, the nurses charting the patient's statistics every few hours, etc. Later on, I was attached with the Department of Public Health of Penang, and we went off checking the safety of water supplies and fogging areas with reported cases of dengue fever.
  
==Research==
+
==Selected projects==
I was fortunate enough to have the opportunity to be involved in several short-term research projects (2 months - 6 months) during my undergraduate years.
+
  
===Predicting Personality from Twitter===
+
===SynthJS===
Final year MEng project: [ [[:File:Meng-report.pdf|PDF]] ]
+
[[File:Synthjs-scrshot-03.png|thumb|right|350px|This logo was actually designed in CSS3, but I converted it to png for certain browsers' sakes]]
{| style="float: right;"
+
SynthJS is a music sequencer built using web technologies that run entirely on the client machine. It is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at emerging HTML5 techs for a while, I settled on a project that explores music technology for the web.
|[[File:Meng-project-chart3.png|350px|thumb|right|Receiver operating characteristic plot of conscientiousness model.]]
+
|[[File:Meng-project-chart4.png|350px|thumb|right|Precision-recall plot of conscientiousness model.]]
+
|-
+
|[[File:Meng-project-chart1.png|350px|thumb|right|Plot of predicted scores of agreeableness models trained by optimizing various metrics, against actual score.]]
+
|[[File:Meng-project-chart2.png|350px|thumb|right|Surface plot of model performance, evaluating by Pearson correlation score, vs training and data parameters.]]
+
|}
+
  
<b>Abstract</b>
+
Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology was new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of musical instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!
  
We present a 6-month-long multi-objective two-part study on the prediction of Big-Five personality scores from Twitter data, using lexicon-based machine learning methods. We investigate the usefulness of models trained with TweetNLP features, which have not been used in this domain before.
+
* [https://github.com/lqkhoo/SynthJS Github repo]
+
* To play with it, [http://lqkhoo.com/synthjs click here]. Have fun!
In the first part of the study, we cast personality prediction as a classification problem, and we investigate how prediction performance is affected by different methods of data normalization, such as whether we divide each feature by word count. In the second, main part of our study, we cast it as a regression problem, and we investigate the differences in performance when we use ranks of scores rather than actual scores, and how filtering only for users with over a certain tweet count affects prediction performance.
+
+
We report on the different methods used in existing literature, explain background information about the tools we used, and look at the common evaluation metrics used in classification and regression problems and address potential pitfalls when calculating or comparing them. We also suggest a solution on how to reconcile learning parameters for different models optimizing different metrics. Finally, we compare our best results with those in recent publications.
+
+
Our main findings are that term frequency-normalized features perform most consistently, that filtering for users (>200 tweets) improves prediction performance significantly in the regression problem, and that prediction performance using ranked data is comparable to using actual values. We found that models trained with TweetNLP features have comparable or superior performance to those trained with LWIC and MRC features commonly used in literature. Models trained with both have superior performance. Compared against 15 recent models (3 papers, 5 personality scores), our best models are better at prediction than 11 of them.
+
  
===IRDM project poster presentation===
+
===RoboHome===
Poster: [ [[:File:IRDM poster.pdf|PDF]] ]
+
{| style="float: right;"
 +
|[[File:Robohome-scrshot2.png|thumb|right|350px|Control interface]]
 +
|[[File:Robohome-scrshot3.png|thumb|right|350px|Configuration interface]]
 +
|}
 +
This was a university systems engineering project. The objective was to build a platform-and-protocol-agnostic control system for home devices which is accessible and scriptable remotely via a web interface. The control system itself had to be low-cost, so we used a Raspberry Pi. Each RPi runs the drivers for devices for three platforms - Arduino, Belkin, and Microsoft Gadgeteer and a full web stack so we can connect to it via WiFi using a laptop to control it. We also had a separate server instance hosted on Azure that authenticates users and connects them to a remote RPi.
  
4th year Information Retrieval and Data Mining course project. This is a short investigation into automatic document-tagging methods. The idea is extremely simple - we use named entity tags (Barack Obama, United States), RAKE algorithm-extracted keywords (candidate, poll, election), and WordNet domains information (political), to give readers a precise idea of an article's topic(s). From the given example tags alone, we can already be pretty confident that the target article is discussing the last one or two US presidential elections. The intuition is that the entity tags (proper nouns) give us specificity, the RAKE keywords (general, frequent terms) give us coverage, and WordNet domains information gives us a hierarchy to cluster similar articles together by topic.
+
* [http://comp2013group7.wikia.com/wiki/Comp2013group7_Wiki Team dev diary] | [https://github.com/michboon/comp2014 Github repo]
  
We were awarded second-best for poster presentation.
+
===Robot race===
 +
{| style="float: right;"
 +
|[[File:Robot-race-robot.jpg|thumb|right|350px|The standard racing robot]]
 +
|[[File:Robot-race-track.jpg|thumb|right|350px|The first track. The second track is a figure-of-eight. Some teams' robots actually went backwards on that one!]]
 +
|}
 +
A first year university project. We programmed a standard robot in C to navigate a track with obstacles as quickly as possible. The robot has a frontal ultrasound sensor, two IR sensors at the frontal corners, two frontal bump buttons, and a light sensor at the frontal bottom edge that stops it at the finishing line. The wheels were attached to stepper motors which drive them forward in numbers of ticks per second. The original idea was to map the course out and find the shortest path to the end, but because the robot has no point of reference relative to the tracks other than the ticks in both wheels, the tiniest bump (and momentum) throws everything off so nobody actually managed to perform the mapping!
  
===SmartFence===
+
The most straightforward solution in the end is simply a wall follower, and because we were allowed to change the code on racing day, the most successful teams were the ones which modified their algorithm last-minute specifically to target the course being run on (like maximum speed for 3 seconds after the 4th turn). That was a fun event, even if it didn't turn out like I expected!
* Please see [[#Microsoft Research Cambridge]]
+
  
===Task Identification using Search Engine Query Logs===
+
===Android app for field study===
[[File:Ex-task-identification.png|thumb|right|550px|Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings in conjunction with an entity of that class. Visualized using D3.js. The search logs were from 1997, so the results reflect the zeitgeist at the end of the 20th century.]]
+
A first year university project. We programmed an Android app for the [http://www.restlessbeings.org/ Restless Beings charity] as an exercise in requirements gathering from the client. It was designed for quick and easy use by field workers to log health data on homeless children. The app had to work both online and offline, because the location of deployment is in rural Africa and only the centers of the largest towns have any sort of internet connection. These workers often work with the same children, so using the phones' GPS, they track the movement of each individual over time to generate statistics like distance to the nearest water source.
This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for <tt>Hawaii</tt>, it gives you results along the lines of "Hotels in Hawaii", or "Flights to Hawaii". Based on our understanding of how Google's system works, these are the most common strings appearing with <tt>Hawaii</tt> in searches.
+
 
+
We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. The idea is to lump searches like the ones above together with other searches of the same class, for example, with "Colleges in Hong Kong" or "Banks in London". These searches are similar because they are queries related to a place, and by aggregating all these searches together, for example, we can find out what users want to find out about most, when they query for places. First, we build up a class tree using the YAGO ontology. An example of a branch would be: Root --> Person --> Artist --> Musician --> Singer. Then, we add entities to each of these classes based on <i>rdf triples</i> expressing it, like <tt>ElvisPresley hasClass Singer</tt>. So, when we find a search string referencing the entity <tt>ElvisPresley</tt>, we remove the entity from the string and map the remainder to the class the entity belongs to, as well as all the ancestors of that class. For example, the search string <tt>Biography of Elvis Presley</tt>  would map the string <tt>Biography of</tt> to the classes Root, Person, Artist, Musician, and Singer. Hence, by querying nodes of this tree, we can find out the most popular tasks users want to do when searching for any class of things. I thought this was an extremely interesting problem to tackle.
+
 
+
AOL logs come from real data, and real data is messy. It was full of typographical errors, machine-issued queries, and foreign languages, keywords unsafe for minors, etc. To give an idea of what we did to get useful data out of it, first, we ran all the searches through a simple regex sieve to filter out nonsensical queries (only symbols, URLs), or machine-issued queries, which tend to be very long. Then we grouped the related searches together - this is called search session segmentation (we used Levenshtein distance, tri-grams, and a cutoff of T=26 minutes). Then we ran Porter-Stemmer. The Porter-Stemmer algorithm reduces English words to their root word, so we can match it against the entities in the YAGO knowledge base. We also had to think about greedy matching. Consider the string <tt>New York Times</tt>. We want the entire string rather than matching <tt>New York</tt> and <tt>times</tt>.
+
 
+
We also ran into problems of ambiguity (this is called word-sense matching in NLP) - for example, <tt>Java</tt> may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms <tt>Scala</tt> and <tt>Java</tt>, we can be sure that <tt>Java</tt> means the programming language. The problem is if we only have one entity matched in a session, we cannot disambiguate because we don't have another entity to compare it against, so we ended up discarding many such sessions, so the remaining data only managed to populate the tree up to around the third layer. That was as far as we got in three months, but after I delivered the presentation, the project was awarded [http://www.cs.ucl.ac.uk/computer_science_news/article/undergraduate-research-group-projects-prize-winners/ best research project] in our year.
+
 
+
* [[Task Identification Using Search Engine Query Logs|Task Identification Using Search Engine Query Logs (Lit review coursework)]]
+
* [[:File:Task Identification Using Search Engine Query Logs - Report.pdf|Task Identification Using Search Engine Query Logs (Results report)]]
+
  
 
==Open source bugfixes==
 
==Open source bugfixes==
Line 316: Line 326:
 
|[https://github.com/nltk/nltk/issues/1062 WordNet satellite adjectives lookup failure in SentiWordnet]
 
|[https://github.com/nltk/nltk/issues/1062 WordNet satellite adjectives lookup failure in SentiWordnet]
 
|[http://www.nltk.org/ NLTK]
 
|[http://www.nltk.org/ NLTK]
|Raised, open
+
|Closed
 
|July 2015
 
|July 2015
 
|}
 
|}
Line 335: Line 345:
 
In my more carefree years, I used to play games quite a bit - <small>BG, NWN, Torment, Panzer Dragoon, C&C, Starcraft, Warcraft, Total War, SupCom, Galciv, SotS, Civ, DotA, Diablo, Wing Commander, Freespace, Freelancer, X series, XCOM, Unreal, Mass Effect, Kirby, Zelda, Pokemon, all manner of JRPGs like Mana, Persona, Disgaea, Suikoden, Star Ocean, Etrian Odyssey, Dragon Quest, Tales, Grandia, Kingdom Hearts, FF... and then some.</small> Nowadays I no longer have the time for them, but some of these did wonders for my imagination - they are just as memorable and impactful as any great novel or epic.
 
In my more carefree years, I used to play games quite a bit - <small>BG, NWN, Torment, Panzer Dragoon, C&C, Starcraft, Warcraft, Total War, SupCom, Galciv, SotS, Civ, DotA, Diablo, Wing Commander, Freespace, Freelancer, X series, XCOM, Unreal, Mass Effect, Kirby, Zelda, Pokemon, all manner of JRPGs like Mana, Persona, Disgaea, Suikoden, Star Ocean, Etrian Odyssey, Dragon Quest, Tales, Grandia, Kingdom Hearts, FF... and then some.</small> Nowadays I no longer have the time for them, but some of these did wonders for my imagination - they are just as memorable and impactful as any great novel or epic.
  
I hold ABRSM grade 8 certification for piano and theory of music, and I play the piano and the erhu and gaohu (Chinese viola and violin) - I used to be the instructor and lead player for the erhu in our Chinese Orchestra society back in high school. I had the privilege to be part of the hand-picked retinue to represent the society at the invitational performance for the 27th International Society for Music Education World Conference 2006, Kuala Lumpur, held in Petronas Philharmonic Hall. Although not quite as well-known to the world, the venue is Malaysia's equivalent of the Royal Albert Hall or Carnegie Hall, usually reserved for premier professional orchestral or musical performances. Members still fondly regard that exhilarating performance as the society's crowning achievement, even to this day.
+
On the music side of things, I hold ABRSM grade 8 certification for piano and theory of music, and I play the piano and the erhu and gaohu (Chinese viola and violin) as I used to be student instructor and lead player for the erhu in our Chinese Orchestra society back in high school. I had the privilege to be part of the hand-picked retinue to represent the society at the invitational performance for the 27th International Society for Music Education World Conference 2006, Kuala Lumpur, held in Petronas Philharmonic Hall. Although not quite as well-known to the world, the venue is Malaysia's equivalent of the Royal Albert Hall or Carnegie Hall, usually reserved for premier professional orchestral or musical performances. Members still fondly regard that exhilarating performance as one of the society's crowning achievements, even to this day.
  
 
===Good books===
 
===Good books===
Whenever I find myself in someone's house for the first time, I try to catch a glimpse of their library. From just a small sample of books, and how well-worn they are - it's amazing how much you can glean about their owner. You know the old adage "don't judge a book by their cover"? Well, I like to spin it as "you can tell quite a bit about a person from what they read"!
 
  
Here are some of my own favourites:
+
Here are some of my favourites in the open domain:
 
{| class="wikitable" style="margin: 1em; font-size: 0.9em"
 
{| class="wikitable" style="margin: 1em; font-size: 0.9em"
 
!Author
 
!Author
!style="min-width: 200px;"|Title
+
!Title
 
!Themes
 
!Themes
 
!Year
 
!Year
 
!Form
 
!Form
!Excerpt
 
 
|-
 
|-
|
+
| [http://en.wikipedia.org/wiki/Isaac_Asimov Isaac Asimov]
[http://en.wikipedia.org/wiki/Sun_Tzu Sun Tzu]
+
| [http://www.physics.princeton.edu/ph115/LQ.pdf The Last Question]
|
+
| Transhumanism
[http://ctext.org/art-of-war 孫子兵法] ''(The Art of War)'' <small>[http://ctext.org/art-of-war Translation]</small>
+
| 1956
 +
| Short story
 +
|-
 +
| [http://en.wikipedia.org/wiki/G._H._Hardy G. H. Hardy]
 +
| [http://www.math.ualberta.ca/~mss/misc/A%20Mathematician's%20Apology.pdf A Mathematician's Apology]
 +
| Mathematics
 +
| 1940
 +
| Essay
 +
|-
 +
| [http://en.wikipedia.org/wiki/Sun_Tzu Sun Tzu]
 +
| [http://ctext.org/art-of-war 孫子兵法] ''(The Art of War)'' <small>[http://ctext.org/art-of-war Translation]</small>
 
| Strategy
 
| Strategy
 
| 513 BC
 
| 513 BC
 
| Treatise
 
| Treatise
|
 
''"To secure ourselves against defeat lies in our own hands, but the opportunity of defeating the enemy is provided by the enemy himself. The art of war teaches us to rely not on the likelihood of the enemy's not coming, but on our own readiness to receive him; not on the chance of his not attacking, but rather on the fact that we have made our position unassailable."''
 
 
|-
 
|-
|
+
| [http://en.wikipedia.org/wiki/Kenji_Miyazawa Kenji Miyazawa]
[http://en.wikipedia.org/wiki/Kenji_Miyazawa Kenji Miyazawa]
+
| [http://www.cbl.or.jp/mailmag/info/080108.html 雨にも負けず] ''(Be not Defeated by Rain)'' <small>[http://www.kenji-world.net/english/download/works/Rain.html Translation]</small>
|
+
[http://www.cbl.or.jp/mailmag/info/080108.html 雨にも負けず] ''(Be not Defeated by Rain)'' <small>[http://www.kenji-world.net/english/download/works/Rain.html Translation]</small>
+
 
|
 
|
 
| 1931
 
| 1931
 
| Poem
 
| Poem
|
 
''"Be not defeated by the rain, Nor let the wind prove your better. Succumb not to the snows of winter. Nor be bested by the heat of summer. Be strong in body. Unfettered by desire. Not enticed to anger. Cultivate a quiet joy."''
 
 
|-
 
|-
|
+
| [http://en.wikipedia.org/wiki/Lewis_Carroll Lewis Carroll]
[http://en.wikipedia.org/wiki/Lewis_Carroll Lewis Carroll]
+
| [https://ebooks.adelaide.edu.au/c/carroll/lewis/alice/complete.html Alice's Adventures in Wonderland]
|
+
[https://ebooks.adelaide.edu.au/c/carroll/lewis/alice/complete.html Alice's Adventures in Wonderland]
+
 
|
 
|
 
| 1865
 
| 1865
 
| Novel
 
| Novel
|
+
|}
<i>...‘Then you should say what you mean,’ the March Hare went on.
+
  
‘I do,’ Alice hastily replied; ‘at least — at least I mean what I say — that’s the same thing, you know.
+
Other pleasant / inspiring reads:
 +
* Brian Greene's ''[http://books.google.co.uk/books/about/The_Elegant_Universe_Superstrings_Hidden.html?id=okv_O0Xhl9gC&redir_esc=y The Elegant Universe]'' (physics)
 +
* Eric Drexler's ''[http://www1.appstate.edu/dept/physics/nanotech/EnginesofCreation2_8803267.pdf Engines of Creation]'' (nanotechnology)
 +
* [http://en.wikipedia.org/wiki/Antoine_de_Saint-Exup%C3%A9ry Antoine de Saint-Exupery]'s ''[http://en.wikipedia.org/wiki/Wind,_Sand_and_Stars Wind, Sand and Stars]'' (memoir)
 +
* Antoine de Saint-Exupery's ''[http://en.wikipedia.org/wiki/The_Little_Prince The Little Prince]''
 +
* [http://en.wikipedia.org/wiki/Rabindranath_tagore Rabindranath Tagore's] [http://www.gutenberg.org/ebooks/author/942 works]
  
‘Not the same thing a bit!’ said the Hatter. ‘You might just as well say that “I see what I eat” is the same thing as “I eat what I see”!’
 
  
‘You might just as well say,’ added the March Hare, ‘that “I like what I get” is the same thing as “I get what I like”!’
+
<div style="text-align: center;">
 +
<div style="width: 100%; height: 2px; background-color: rgb(30,87,153);"></div>
  
‘You might just as well say,’ added the Dormouse, who seemed to be talking in his sleep, ‘that “I breathe when I sleep” is the same thing as “I sleep when I breathe”!’</i>
 
|-
 
|
 
Brian Greene
 
|
 
[http://books.google.co.uk/books/about/The_Elegant_Universe_Superstrings_Hidden.html?id=okv_O0Xhl9gC&redir_esc=y The Elegant Universe]
 
|Physics
 
|2010
 
|Factual
 
|
 
|-
 
|
 
[http://en.wikipedia.org/wiki/Antoine_de_Saint-Exup%C3%A9ry Antoine de Saint-Exupery]
 
|
 
[http://en.wikipedia.org/wiki/Wind,_Sand_and_Stars Wind, Sand and Stars]
 
|Humanism
 
|1939
 
|Memoir
 
|
 
<i>I sat down face to face with one couple. Between the man and the woman a child had hollowed himself out a place and fallen asleep. He turned in his slumber, and in the dim lamplight I saw his face. What an adorable face! A golden fruit had been born of these two peasants. [...] I said to myself: This is a musician's face. This is the child Mozart. This is a life full of beautiful promise. Little princes in legends are not different from this. Protected, sheltered, cultivated, what could not this child become? When by mutation a new rose is born in a garden, all gardeners rejoice. They isolate the rose, tend it, foster it. But there is no gardener for men. This little Mozart will be shaped like the rest by the common stamping machine. [...] This little Mozart is condemned.
 
  
Their fate causes these people no suffering. [...] Those who carry the wound do not feel it. It is the human race and not the individual that is wounded here, is outraged here. I do not believe in pity. What torments me tonight is the gardener's point of view. What torments me is not this poverty to which after all a man can accustom himself as easily as to sloth. [...] What torments me is not the humps nor hollows nor the ugliness. It is the sight, a little bit in all these men, of Mozart murdered.
 
 
Only the Spirit, if it breathe upon the clay, can create Man.</i>
 
|-
 
|
 
Antoine de Saint-Exupery
 
|
 
Le Petit Prince ([http://en.wikipedia.org/wiki/The_Little_Prince The Little Prince])
 
|Humanism
 
|1943
 
|Novel
 
|
 
''"What makes the desert beautiful," said the little prince, "is that somewhere it hides a well..."''
 
|-
 
|
 
[http://en.wikipedia.org/wiki/Isaac_Asimov Isaac Asimov]
 
|
 
[http://www.physics.princeton.edu/ph115/LQ.pdf The Last Question]
 
| Transhumanism
 
| 1956
 
|
 
Short story
 
|
 
|-
 
|
 
[http://en.wikipedia.org/wiki/G._H._Hardy G. H. Hardy]
 
|
 
[http://www.math.ualberta.ca/~mss/misc/A%20Mathematician's%20Apology.pdf A Mathematician's Apology]
 
| Mathematics
 
| 1940
 
| Essay
 
|
 
''"It is a melancholy experience for a professional mathematician to find himself writing about mathematics. The function of a mathematician is to do something, to prove new theorems, to add to mathematics, and not to talk about what he or other mathematicians have done..."''
 
|}
 
 
 
 
<div style="width: 100%; height: 2px; background-color: rgb(30,87,153);"></div>
 
<div style="text-align: center;">
 
 
To contact me, please use the information in my [http://lqkhoo.com/wiki/index.php/File:CV-General-Li-Quan-Khoo.pdf resume].
 
To contact me, please use the information in my [http://lqkhoo.com/wiki/index.php/File:CV-General-Li-Quan-Khoo.pdf resume].
  

Revision as of 08:59, 10 October 2017

Resume | LinkedIn | Github | Landing page

Hi, I'm Li! I graduated in August 2015 from University College London with MEng Computer Science from UCL with first class honours. Since then I've generally gravitated towards the research side of things, especially in applied machine learning and artificial intelligence. During university, I collaborated with my advisor, Prof. Emine Yilmaz, in topics related to search and advertising, semantic computation, and NLP, working with technologies such as the BNC, YAGO2, WordNet, SentiWordNet. More recently, I'm interested in neural nets as solutions to a subset of problems I've encountered during my professional career, such as machine vision, translation, music synthesis, and analysis of spatial or temporal data. I'm currently a registered SCPD student under Stanford University's professional education programme, while employed as a research SDE in Singapore.

Why computer science? At first, I entered the field because I love building things. Stacks. Factories. Interfaces. Semaphores. Software are teeming cities running like clockwork on top of layers and layers of abstraction. I thought that I wanted to be a developer for sure, but then I began to see some really interesting problems and approaches to solving them in the field, so I focused my efforts on research too. Computer science (and AI / machine learning) is very much in the middle of interdisciplinary research, and I think this is where the most exciting things are happening. Before this, I was a medical student in Imperial College London - I left after two years - but that's a story for another time.

+ For people unfamiliar with computer science, (supervised) machine learning really is just pattern recognition. If you can reduce a problem to a pattern recognition problem, then you can apply machine learning to solve it. It is a powerful technique that we can use to try and find features / trends / patterns hidden within huge amounts of data (DNA, stock ticks, the internet), or to classify that data into different categories (think algorithm that recognizes faces, road signs, or system intrusions based on anomalous behaviour patterns).


Curriculum of study

I've been following a curriculum of MOOC and continued education courses since August 2015 primarily to inform my research interests.

Detailed curriculum
Field Course Inst. Platform Lectures Problem sets Cert Status Reading list Topics
Math 18.01 Single Variable Calculus MIT OCW Yes No N/A Completed (Aug 2015)
  • Course notes
Linear and quadratic approximations, fundamental theorem of calculus, Taylor series etc.
Math 18.02 Multivariable Calculus MIT OCW Yes Yes N/A Completed (Sep 2015)
  • Course notes
Partial derivatives, Lagrange multipliers, polar cooordinates, change of variables, vector fields, gradient fields, Green's, Stokes', divergence
Math 18.03 Differential Equations MIT OCW Yes Yes N/A Completed (Oct 2016)
  • Course notes
  • Elementary Differential Equations (Edwards & Penney)
First-order ODEs, particular solutions, second-order linear ODEs, Fourier series, Laplace transform, pole diagrams, convolution, linear systems of ODEs, repeated real eigenvalues, matrix exponentials, fundamental matrices, limit cycles
Math 18.06 Linear Algebra MIT OCW Yes Yes N/A Completed (Apr 2016)
  • Introduction to Linear Algebra (Strang)
Four fundamental subspaces, A=LU, Gaussian elimination, orthogonality, vector spaces, projections, least squares, determinants (Leibniz/Laplace expansion), eigenvectors/eigenvalues, diagonalization, symmetric matrices, (semi)-positive definiteness, quadratic form, similar matrices, SVD, Cholesky factorization, pseudoinverse
Math 6.041 Probabilistic Systems Analysis and Applied Probability MIT OCW Yes Yes N/A Completed (Mar 2016)
  • Introduction to Probability (Bertsekas & Tsitsiklis)
Bayes rule, joint PMFs, discrete/continuous RVs, derived distributions, iterated expectations, sum of a random number of RVs, Bernoulli process, Poisson process, Markov chains, Weak Law of Large Numbers, Central Limit Theorem, Bayesian inference, classical inference
Math (Analysis, probability theory) N/A N/A N/A As needed Ordered sets, bounds, supremum/infimum, lub/glb, field axioms, ordered fields, properties of real and complex fields, Archimedean property, density, Schwarz inequality, Euclidean k-space, finite, (un)countable sets, set equivalence, sequences, infinitary union/intersection, metric spaces, convex, open, closed, bounded, compact, perfect, connected sets, Heine-Borel, Weierstrass, the Cantor set, convergence, subsequences, Bolzano-Weierstrass, Cauchy sequences, Cauchy criterion, complete metric spaces, monotonic sequences, limit superior/inferior, series, convergence tests (comparison, condensation, root, ratio), absolute convergence, the natural constant, power series, series addition, multiplication, rearrangement.

Probability triples, sigma-algebra, measure, Caratheodory's Extension Theorem, Borel sets, Lebesgue [0,1] measure, random variables, independence, limit events, tail fields, zero-one laws (Borel-Cantelli, Kolmogorov)

EE 6.003 Signals and Systems MIT OCW Yes Partial N/A Completed (Sep 2016)
  • Signals and Systems (Oppenheim, Wilsky)
Signals, samples, DT, CT systems, feedback, poles, modes, Z-transform, bilateral Laplace transform, DT approximation of CT systems, convolution, LTI systems and eigenfunctions, phase-amplitude frequency response, Bode plots, CT, DT Fourier series, Fourier transform, sampling, impulse reconstruction, aliasing, quantization
Comp CS 120x Distributed Machine Learning with Apache Spark Berkeley edX Yes Yes Yes (100%) Completed (Aug 2016) N/A Distributed ML, Apache Spark, pySpark, SparkSQL, distributed regression, distributed PCA, computational complexity, big n, big d considerations
Comp Deep Learning Oxford - Yes No N/A Completed (Dec 2016)
  • Machine Learning: A Probabilistic Perspective (Kevin Murphy)
  • Elements of Statistical Learning (Hastie, Tibshirani, Friedman)
  • A Tutorial on Energy-Based Learning (LeCun, 2006)
Linear models, MLE estimation, entropy, regularization, the kernel trick, model complexity, MAP estimation, radial basis functions, optimization: Newton's, SGD, Polyak average, momentum methods, Adagrad, logistic regression, softmax, modular backpropagation, convnets, multichannel spatial, volumetric convolution, k-max pooling, Xavier initialization, embedding in continuous spaces, transfer learning, auxiliary tasks, energy-based methods, generalized margin loss, RNNs, LSTMs, ...
Comp CS231n Convnets for Visual Recognition Stanford - Yes No N/A Completed (Nov 2016) Batch-norm, parameter update rules: Nesterov, Adagrad, RMSProp, Adam, L-BFGS, residual nets, localization, visualization, DeepDream/NeuralStyle, dropout, RNNs, LSTMs, GRUs, deep learning libs, segmentation, soft attention, spatial transformers, spatio-temporal convolutions, variational autoencoders, adversarial encoder/decoder, model distillation
Econ 14.01SC Principles of Microeconomics MIT OCW Yes No N/A Completed (Feb 2016)
  • Microeconomics (Perloff & Jeffrey)
  • Principles of Microeconomics (Rittenberg, Libby, Timothy)
Supply and demand, elasticity, consumer/producer theory, cost, competition, welfare, monopoly, oligopoly, factor markets, trade, uncertainty, interest rates, social equity
Fin Financial Markets Yale Coursera Yes No No Completed (Sep 2015) N/A Diversification, efficient markets, debt, corporate stocks, real estate, forward and futures, options, monetary policy, investment banks, brokers and dealers, public and non-profit finance
Fin 15.401 Finance Theory I MIT OCW Yes N/A N/A Completed (May 2016)
  • Principles of Corporate Finance (Brealey-Meyers)
  • Investments (BKM) - Chapters 5-13
Present value, fixed income securities, term structure of interest rates, Macaulay/modified duration, equities, dividend discount model, Gordon growth model, forwards and futures, options, risk and return, Portfolio Theory, single/multi-index models, CAPM, APT, efficient markets, adaptive market hypothesis
Fin 15.437 Options and Futures Markets MIT - N/A N/A N/A Ongoing
  • Options, Futures and Other Derivatives (Hull) - Chapters 7-13
  • The Mathematics of Financial Derivatives (Wilmott) - Chapter 4
Binomial pricing model (Cox-Ross-Rubinstein) for European stock derivatives, Levy, Wiener, Ito processes, informal derivation of Ito's Lemma, Geometric Brownian motion, the lognormal distribution, intro to partial differential equations, heat/diffusion equation, Black-Scholes-Merton differential equation, Black-Scholes pricing of European stock options and warrants, European options on dividend-paying stocks, currencies, and futures, ...
Comp 9.520 Statistical Learning Theory and Applications MIT OCW Yes (CBMM) No N/A On hold N/A Hilbert spaces, well-posed, ill-posed problems, empirical-risk-minimization (ERM) algorithms, ...
Math 18.305 Advanced Analytic Methods in Science and Engineering MIT OCW - - - Tentative Solution of 1st/2nd order PDEs and boundary layer problems
Math 18.303 Linear Partial Differential Equations: Analysis and Numerics MIT OCW - - - Tentative Numerical solution of PDEs by finite-difference and finite-element methods
Math 18.443 Statistics for Applications MIT OCW - - - Tentative Hypothesis testing

Research

Bounding Out-of-Sample Objects (2017)

This was my project submission for Stanford University's CS231n Convnets for Visual Recognition.

Abstract

Architecture
Visualization

ConvNets have spatial awareness built into them by design, specifically in their feature maps or activation volumes. With respect to any task such as image classification, they are at the same time implicitly performing some form of localization of salient features. We want to understand how good this localization information is at annotating bounding boxes compared to ground-truth, especially around out-of-sample objects which the convnet has never been trained on. This requires a model which is indifferent to the class label, and thus it must operate on spatial information that is as general as possible, across as many image classes as possible. Additionally, we want to investigate whether this method could be adapted to different image classification architectures.

Predicting Personality from Twitter (2015)

  • Final year MEng project: [ PDF ]

Abstract

Receiver operating characteristic plot of conscientiousness model.
Precision-recall plot of conscientiousness model.
Plot of predicted scores of agreeableness models trained by optimizing various metrics, against actual score.
Surface plot of model performance, evaluating by Pearson correlation score, vs training and data parameters.

We present a 6-month-long multi-objective two-part study on the prediction of Big-Five personality scores from Twitter data, using lexicon-based machine learning methods. We investigate the usefulness of models trained with TweetNLP features, which have not been used in this domain before.

In the first part of the study, we cast personality prediction as a classification problem, and we investigate how prediction performance is affected by different methods of data normalization, such as whether we divide each feature by word count. In the second, main part of our study, we cast it as a regression problem, and we investigate the differences in performance when we use ranks of scores rather than actual scores, and how filtering only for users with over a certain tweet count affects prediction performance.

We report on the different methods used in existing literature, explain background information about the tools we used, and look at the common evaluation metrics used in classification and regression problems and address potential pitfalls when calculating or comparing them. We also suggest a solution on how to reconcile learning parameters for different models optimizing different metrics. Finally, we compare our best results with those in recent publications.

Our main findings are that term frequency-normalized features perform most consistently, that filtering for users (>200 tweets) improves prediction performance significantly in the regression problem, and that prediction performance using ranked data is comparable to using actual values. We found that models trained with TweetNLP features have comparable or superior performance to those trained with LWIC and MRC features commonly used in literature. Models trained with both have superior performance. Compared against 15 recent models (3 papers, 5 personality scores), our best models are better at prediction than 11 of them.

IRDM project poster presentation (2015)

  • Poster: [ PDF ]

This was a 4th year Information Retrieval and Data Mining course project about automatic document-tagging methods. The idea is extremely simple - we use named entity tags (Barack Obama, United States), RAKE algorithm-extracted keywords (candidate, poll, election), and WordNet domains information (political), to give readers a precise idea of an article's topic(s). From the given example tags alone, we can already be pretty confident that the target article is discussing the last one or two US presidential elections. The intuition is that the entity tags (proper nouns) give us specificity, the RAKE keywords (general, frequent terms) give us coverage, and WordNet domains information gives us a hierarchy to cluster similar articles together by topic.

We were awarded second-best for poster presentation.

SmartFence (2014)

Task Identification using Search Engine Query Logs (2014)

  • Report [ PDF ]
  • Lit review [ PDF ]
Root node of tree. Only selected subclass nodes (blue / red) are displayed. Orange nodes are the entities most often searched for in a class. Green nodes are the most frequently searched-for strings in conjunction with an entity of that class. Visualized using D3.js. The search logs were from 1997, so the results reflect the zeitgeist at the end of the 20th century.

This was my undergraduate university-based research project. The goal is to find out what users are most interested about when they search for a certain class of things. For example, using Google's related searches, if you search for Hawaii, it gives you results along the lines of "Hotels in Hawaii", or "Flights to Hawaii". Based on our understanding of how Google's system works, these are the most common strings appearing with Hawaii in searches.

We wanted to go one step further. By using a knowledge base like YAGO and the set of AOL logs leaked in 1997, we do the same, but with semantics. The idea is to lump searches like the ones above together with other searches of the same class, for example, with "Colleges in Hong Kong" or "Banks in London". These searches are similar because they are queries related to a place, and by aggregating all these searches together, for example, we can find out what users want to find out about most, when they query for places. First, we build up a class tree using the YAGO ontology. An example of a branch would be: Root --> Person --> Artist --> Musician --> Singer. Then, we add entities to each of these classes based on rdf triples expressing it, like ElvisPresley hasClass Singer. So, when we find a search string referencing the entity ElvisPresley, we remove the entity from the string and map the remainder to the class the entity belongs to, as well as all the ancestors of that class. For example, the search string Biography of Elvis Presley would map the string Biography of to the classes Root, Person, Artist, Musician, and Singer. Hence, by querying nodes of this tree, we can find out the most popular tasks users want to do when searching for any class of things. I thought this was an extremely interesting problem to tackle.

AOL logs come from real data, and real data is messy. It was full of typographical errors, machine-issued queries, and foreign languages, keywords unsafe for minors, etc. To give an idea of what we did to get useful data out of it, first, we ran all the searches through a simple regex sieve to filter out nonsensical queries (only symbols, URLs), or machine-issued queries, which tend to be very long. Then we grouped the related searches together - this is called search session segmentation (we used Levenshtein distance, tri-grams, and a cutoff of T=26 minutes). Then we ran Porter-Stemmer. The Porter-Stemmer algorithm reduces English words to their root word, so we can match it against the entities in the YAGO knowledge base. We also had to think about greedy matching. Consider the string New York Times. We want the entire string rather than matching New York and times.

We also ran into problems of ambiguity (this is called word-sense matching in NLP) - for example, Java may mean the programming language, or the place in Indonesia, or a dozen other things. We disambiguate by comparing the number of similar classes terms belong to. For example, if a search session contains the terms Scala and Java, we can be sure that Java means the programming language. The problem is if we only have one entity matched in a session, we cannot disambiguate because we don't have another entity to compare it against, so we ended up discarding many such sessions, so the remaining data only managed to populate the tree up to around the third layer. That was as far as we got in three months, but after I delivered the presentation, the project was awarded best undergraduate research group project of the year.

Full-time positions

R&D Scientist at Digital:MR

Digital:MR is a market research company, and I joined them for a short-term research project, which was funded by InnovateUK, the UK's technology strategy board. Chris, who I collaborated with on my final year project, introduced me to the company and the project, and he had already been with the company for several years.

The project was a feasibility study on identifying (context-free) sentiment from images. While the project proposal, timeline, and grants were handled by a previous employee (who left the company), I took over the rest of the project, which was about 3 months long, including the R&D, documentation, and writing the monthly and final progress reports to the project supervisor from the government. While I had limited discussions with other employees involved in R&D, because Digital:MR is a small company (<10 people), it was made clear from the outset that this was mostly going to be a solo research project where I was responsible for the methodology, implementation, and results.

While a non-disclosure agreement prevents me from talking in-depth about the results and methodology, I used the Yahoo! Flickr Creative Commons 100 Million Images (Yfcc100m) dataset and its pre-extracted feature descriptors from YLI, and essentially applied the most straightforward methods known to me, as the project needed to finish on time.

What I found most interesting about the project was that the difficulties laid in unexpected areas. For example, because the field is new, there are inconsistencies in how to interpret the problem. Without going into detail, based on how images are tagged on Flickr, what we have is a multi-label classification problem. Some researchers (including authors of a AAAI 2015 paper) had simplified the problem to single-label classification when using Flickr's search API to compile their dataset, which I found as unjustifiable based on label distributions in Yfcc100m. During the process, I also found and documented a behavioural anomaly when the widely-used Natural Language Toolkit (NLTK) interacts with the WordNet3.0 and SentiWordNet corpora. On top of that, I had to tailor on-the-fly, some high-level decisions towards addressing various potential applications in the context of the company's clients, which were not objectives in the original research proposal. Along with reading the literature on image representation and machine learning on images, which I was relatively unfamiliar with in general, it all adds up into a very interesting learning experience indeed.

After the project ended, while my agreement with the company was full-time and I was welcome to take over one of several research projects, including drafting a new grant proposal based on results of my research, I had decided to pursue my own interests by then, and we parted ways amicably.

Internships

Microsoft Research Cambridge

MSR Cambridge Bright Minds Interns 2014. I'm the one in the light blue shirt
MSR Cambridge Machine Learning and Perception group

I joined Microsoft for 8 weeks as Research Intern through the Bright Minds Internship Competition programme for undergraduates. I was supervised by Pushmeet Kohli and Yoram Bachrach, with limited help from Ulrich Paquet and Filip Radlinski.

Since the allocated time is short, I was told about the problem and given a direction to start off on - it was about parental / access control of the internet. There were several problems in this area which we hoped to address:

  1. Traditional access control relies on black and whitelists of URLs. For instance, OpenDNS offers a service where they simply refuse to resolve blacklisted domains. However, domain ownership changes constantly and the lists have to be updated to reflect this.
  2. A site is either blocked or not, based on a fixed criterion, and there is no customization of the partitioning. For example, if a company wants to block Facebook at work, they couldn't use a service which has it in the whitelist. You could customize the list, but if one generalizes the requirement to "I want to block all social media sites", that's not possible if there is no such list.
  3. Users can only recognize a tiny fraction of sites in either list.

We wanted to try a different method. The project was called SmartFence - users block or allow the sites they know about to train the system, and the system determines the suitability of the rest of the sites automatically. First, we make the assumption that users tend to visit websites which serve the same information need in a session. Hence, websites belonging within a search session in Bing are said to be correlated, and we transform this correlation data into 30-dimensional feature vectors. We then construct a giant similarity matrix of websites based on these vectors. For our investigation, we constructed it in-memory for the top ~10k sites (by frequency of visitation) rather than for everything. When users block or allow the sites which they recognize, we label these -1 and 1 respectively (this becomes our training set). For any site which a user does not recognize, its score is calculated based on its similarity to all the other labelled sites (this is our hypothesis' output). We used a matrix primarily because the we need to be able to compute the partitioning quickly when the user changes the labelled set of websites or weighting decay function and cutoff.

We initially performed k-means clustering to investigate how the clusters would look like with different values of k. We wanted to see if these partitions made sense. Then we had a brief foray into reducing the vectors to two-dimensional space (bearing in mind information loss) so the user can draw out areas on the screen which they want to block or allow. I also briefly considered partitioning by Voronoi cells while working in 2D (because running convex hull becomes rapidly infeasible at higher dimensions). We finally abandoned hard cluster boundaries altogether, and settled on a kernel method. This meant that initially, every site is related to every other (a fully connected graph), and we cull the arcs by setting a minimum similarity threshold, and we control the partitioning by setting the value needed for a site to be blocked.

I delivered a working prototype in 4 weeks for the internal company hackathon, and a refined version with a web front-end (knockout + D3) at the end of the internship. I didn't have any exposure to machine learning before the internship, but thankfully, I managed to pick up the basics on-the-fly. The period before the hackathon was particularly intense, because we have to get a basic algorithm and interface going in just four weeks. My previous experience in web development meant that I could rapidly iterate multiple versions of the GUI to get something intuitive to what our algorithm was trying to do. Filip took particular interest in the GUI implementation - he said it was more impressive than what one would usually see at a conference.

This was as good an internship as I could have ever hoped for. I got to see how it is like to work in industrial research, I began to understand how powerful machine learning is, and what inventive things other researchers are doing with it, and most of all, I got the chance to work with some really fantastic people. I was trying very hard not to be the stupidest guy in the room - I was an undergraduate amongst PhDs and postdocs and people with decades of industry experience, but the guidance and support I received was second to none. It was awesome!

UniEntry

Course browse page for UniEntry
Course statistics, satisfaction, demographics, outcomes etc.
Course entry requirements

UniEntry is a startup company which I worked for in summer 2013. The problem niche it tries to solve is that in the UK, it is very difficult to make informed decisions when selecting universities. Rankings are not the only thing that matters - the atmosphere of the institution, the faculty, the focus and quality of research of the individual institutions etc... much of this information is not available in the prospectuses. I remember the experience selecting a UK university when I was still in my A levels - it felt like a shot in the dark. I was fortunate enough to be attending a college with excellent admissions tutors, so I had the benefit of counselling and advice on relatively obscure facts (like Gonville and Caius college in Cambridge University was thought to be the best one for medical studies, for example; in Cambridge University, students belong to individual Colleges, and while they all attend a common lecture, supervisions are on a college-to-college basis). Not every student is fortunate enough to have access to this kind of coaching, and even when I did, I felt like the choice was still not as informed as I would have liked it to be.

Enter UniEntry. It is founded as a part time venture by two individuals, who hired myself and another developer to develop a pilot site over the summer. UniEntry pulls information from the Higher Education Statistics Agency to bypass potential bias in the university prospectuses, to make the application process as transparent as possible.

While guest users can browse the courses, much of the platform is actually inaccessible to them. Registered teachers can monitor their students' applications and university choices, students of participating schools can register their grades and the platform will inform them about whether their choices are a good fit for their capabilities, and university undergraduates can register as coaches to guide and inform students and teachers.

The pilot site is supposed to be used as a proof of concept to get schools to be involved and to look for funding. We used the agile development model, so we had daily stand-ups, sprint planning, progress burndown charting, the works. The other developer (Wayne) and myself just went through the application process ourselves, so the two of us had great influence on what features a site like this should have to be useful. The site was developed in ASP.NET, and I was the primary front-end developer and designer. This was my first time working in a small start-up company - each one of us had multiple roles - planning, front-end, back-end, database, office boy, manual work etc. We had lunch together to discuss concepts or the project, and we had to help each other out on a regular basis. We did pair programming, we audited each other's code. It was a really close-knit team. Wayne initially came from an electrical engineering background, so I brought him up to speed on C#, polymorphism, generics, common design patterns and MVC in general. I love teaching and explaining things, so I enjoyed it and learned from it as much as he did.

After the initial three months, I maintained contact with the company and submitted bugfixes when requested. Right now there's not much progress being made because the CEO (Will) is busy with his own research projects. However, I think it is a project with a worthy cause - even in its current state as a pilot site, it's already useful even to guest users. I think I would have benefited from it when I was applying to universities myself, so I certainly hope it gets the funding it needs to be completed at some point.

P.S. At the end of the internship, Will brought us all for a helicopter ride around London as a surprise treat for a job well done!

Other

My other experiences are related to my brief stint in medical school rather than computer science, but I consider them to be valuable and one-of-a-kind.

  • Work shadowing in van Andel Institute, Michigan, USA. I generally observed the environment and working atmosphere within a biomedical research lab. I learned about what the researchers do on a regular basis, things like automatic sequencing, running DNA microarrays etc. I learned how they made knockout mice (mice with certain genes deactivated) to study its effects, and how they highlight sections of DNA by binding highly specific fluorescent molecules to them, using techniques (with really fancy names) like spectral karyotyping and fluorescent in-situ hybridization. Looking back at my experiences in the lab, it always reminds me of how different the nature of the work in the various fields of science can be.
  • In Malaysia, I had a work placement in a hospital's critical care unit and department of anaesthesia. I remember how meticulous everything was - the cleanliness precautions we had to take, the nurses charting the patient's statistics every few hours, etc. Later on, I was attached with the Department of Public Health of Penang, and we went off checking the safety of water supplies and fogging areas with reported cases of dengue fever.

Selected projects

SynthJS

This logo was actually designed in CSS3, but I converted it to png for certain browsers' sakes

SynthJS is a music sequencer built using web technologies that run entirely on the client machine. It is a project I came up with when I had about two weeks of free time during Christmas break back in 2013. I felt that my JavaScript was getting a bit rusty and I wanted to explore something related to HTML5, so after looking at emerging HTML5 techs for a while, I settled on a project that explores music technology for the web.

Progress on the project is now frozen partly due to the scope of the project being too big for two weeks (I couldn't gauge how much work it was going to be as the technology was new to me), and partly due to poor scaling of a DOM-based UI. As it stands, it has a reasonable suite of musical instruments, speed control, equalizers for individual instruments, an undo/redo stack (which was non-trivial to implement for a program like this), file import/export, and most importantly, you can sequence simple music with it!

RoboHome

Control interface
Configuration interface

This was a university systems engineering project. The objective was to build a platform-and-protocol-agnostic control system for home devices which is accessible and scriptable remotely via a web interface. The control system itself had to be low-cost, so we used a Raspberry Pi. Each RPi runs the drivers for devices for three platforms - Arduino, Belkin, and Microsoft Gadgeteer and a full web stack so we can connect to it via WiFi using a laptop to control it. We also had a separate server instance hosted on Azure that authenticates users and connects them to a remote RPi.

Robot race

The standard racing robot
The first track. The second track is a figure-of-eight. Some teams' robots actually went backwards on that one!

A first year university project. We programmed a standard robot in C to navigate a track with obstacles as quickly as possible. The robot has a frontal ultrasound sensor, two IR sensors at the frontal corners, two frontal bump buttons, and a light sensor at the frontal bottom edge that stops it at the finishing line. The wheels were attached to stepper motors which drive them forward in numbers of ticks per second. The original idea was to map the course out and find the shortest path to the end, but because the robot has no point of reference relative to the tracks other than the ticks in both wheels, the tiniest bump (and momentum) throws everything off so nobody actually managed to perform the mapping!

The most straightforward solution in the end is simply a wall follower, and because we were allowed to change the code on racing day, the most successful teams were the ones which modified their algorithm last-minute specifically to target the course being run on (like maximum speed for 3 seconds after the 4th turn). That was a fun event, even if it didn't turn out like I expected!

Android app for field study

A first year university project. We programmed an Android app for the Restless Beings charity as an exercise in requirements gathering from the client. It was designed for quick and easy use by field workers to log health data on homeless children. The app had to work both online and offline, because the location of deployment is in rural Africa and only the centers of the largest towns have any sort of internet connection. These workers often work with the same children, so using the phones' GPS, they track the movement of each individual over time to generate statistics like distance to the nearest water source.

Open source bugfixes

Thankfully, I haven't run into that many bugs in open source software, but I try to do my part when one of them finds its way to me.

Issue Codebase Status Date raised
WordNet satellite adjectives lookup failure in SentiWordnet NLTK Closed July 2015

Personal interests

Spoken languages and other interests

With friends and staff at Goryokaku, Hakodate, Japan. Goryokaku is a five-pointed star fort inspired by Monsieur Vauban's creations in Europe; star forts are highly geometric fortifications in the gunpowder era designed to deflect cannonballs and redirect assailants into crossfire. The building in the background is what used to be the seat of the government located in the middle of the fort.

I've been studying Japanese on my own in my free time for a while now. The best place to learn a language is in its country of origin, so I went off to Hokkaido for a month-long home stay and an intensive course in a language school in September 2014. That was the best trip I've experienced yet.

I can speak, listen, read and write in four languages:

  • English
  • Mandarin (and certain dialects)
  • Malay
  • Japanese (~upper JLPT N3)

In my more carefree years, I used to play games quite a bit - BG, NWN, Torment, Panzer Dragoon, C&C, Starcraft, Warcraft, Total War, SupCom, Galciv, SotS, Civ, DotA, Diablo, Wing Commander, Freespace, Freelancer, X series, XCOM, Unreal, Mass Effect, Kirby, Zelda, Pokemon, all manner of JRPGs like Mana, Persona, Disgaea, Suikoden, Star Ocean, Etrian Odyssey, Dragon Quest, Tales, Grandia, Kingdom Hearts, FF... and then some. Nowadays I no longer have the time for them, but some of these did wonders for my imagination - they are just as memorable and impactful as any great novel or epic.

On the music side of things, I hold ABRSM grade 8 certification for piano and theory of music, and I play the piano and the erhu and gaohu (Chinese viola and violin) as I used to be student instructor and lead player for the erhu in our Chinese Orchestra society back in high school. I had the privilege to be part of the hand-picked retinue to represent the society at the invitational performance for the 27th International Society for Music Education World Conference 2006, Kuala Lumpur, held in Petronas Philharmonic Hall. Although not quite as well-known to the world, the venue is Malaysia's equivalent of the Royal Albert Hall or Carnegie Hall, usually reserved for premier professional orchestral or musical performances. Members still fondly regard that exhilarating performance as one of the society's crowning achievements, even to this day.

Good books

Here are some of my favourites in the open domain:

Author Title Themes Year Form
Isaac Asimov The Last Question Transhumanism 1956 Short story
G. H. Hardy A Mathematician's Apology Mathematics 1940 Essay
Sun Tzu 孫子兵法 (The Art of War) Translation Strategy 513 BC Treatise
Kenji Miyazawa 雨にも負けず (Be not Defeated by Rain) Translation 1931 Poem
Lewis Carroll Alice's Adventures in Wonderland 1865 Novel

Other pleasant / inspiring reads:



To contact me, please use the information in my resume.

~ Thanks for visiting! ~