Tuesday, March 26, 2013

Reading Assignment: A Few Useful Things to Know About Machine Learning

Reference Information
Title: A Few Useful Things to Know About Machine Learning
Author: Pedro Domingos
Citation: "A Few Useful Things to Know About Machine Learning", Pedro Domingos, Communications of the ACM, pp. 78-87, 2012.

Summary
This paper provided an overview of machine learning techniques at a broad scope. Machine learning consists of using learning algorithms with large amounts of training data to generalize the set of possible data in order to classify new, unknown data that is discovered. The focus of this paper was on classification algorithms, along with some key lessons in the field of machine learning regarding classification. Evaluating classifiers involves having a formal representation, an objective function for evaluation, and optimization. It should be noted that testing and training data should always be kept separate for more accurate evaluation statistics. This can be remedied by a few different solutions, including using cross validation.

Machine learning is all about generalizing the data that is given. A learner uses not only training data, but extra assumptions or knowledge about the domain, as well ("no free lunch" theorems). Some problems with the generalization include overfitting, underfitting, multiple testing, and the curse of dimensionality (generalizing becomes more difficult as more features are added into the input). Some theoretical guarantees that are incorrect were mentioned, including the fact that there are no exact guarantees on the bound of the number of examples needed, and that infinite data does not necessarily lead to a correct classifier. The importance of choosing the correct features was reiterated. While relative, independent features are the most useful, it is difficult to know what these features are when simply presented with the raw data of an input, since the input data tends to be more observational than experimental. In addition, it is more important to have large amounts of data for training purposes instead of having a more complex learning algorithm, but scalability must be taken into account when using lots of data. As for choosing the "best" learning algorithm, it really depends on the particular domain and application for which it is being used.

Finally, combining learning algorithms was discussed by using model ensembles to create more accurate learners from running the data through multiple classifiers. Some combination techniques include boosting, bagging, and stacking.

Thoughts
I have previously taken a course in machine learning, so I knew most of the information that was presented within this article. However, it was a good refresher for the information that I did know, and it provided some very good information regarding lessons and myths in machine learning that I did not previously know about.

For example, it was very interesting to learn about some of the details of possible machine learning problems, such as overfitting, multiple testing, and the curse of dimensionality. It was very useful that possible (and best) solutions were presented for each of these problems. Overall, the article was written in a very understandable format that made for an enjoyable and informative read.

No comments:

Post a Comment