I recently did a search on the big G for “Hacker News” and “Machine Learning” to see which posts had attracted the most amount of search attention. I thought it might be the recent announcement of TensorFlow from the aforementioned Google or the even more recent announcement of the Distributed Machine Learning Toolkit (DMLT) from Microsoft. These two multinational corporations are not the only tech giants to have entered this arena. Amazon Machine Learning has been in this space since April albeit they’ve taken their traditional SaaS route so while technically speaking they are providing machine learning services they don’t have an open-source toolkit offering a la Google and Microsoft. Rather, TensorFlow and DMLT follow on the heels of community offerings Torch and Theano.
Sometimes it’s hard to spot a trend that’s right under your nose. It will be interesting to see the worlds of humanities computing and machine learning collide.
Anyway, no one posting caught my eye. What I did notice is that several companies have written about the classification of Hacker News(HN) posts. The three articles I noticed were this one about news categorizing by MonkeyLearn, this one about algorithmic tagging by Algorithmia, and this one about autotagging by Dato. There appear to be supervised and unsupervised versions of these algorithms. The supervised version matches on a predefined list of categories and training data whereas the unsupervised does not need any training data. Dato call the unsupervised approach autotagging and the approach with a training dataset simply classification. Being new to the machine learning camp I couldn’t say if these terms are standard or not. All three articles are informative, and interesting for their different take on things.
A more descriptive term than classification (which seems overly general) is topic analysis or topic modeling and this is the term I have been using in my collaboration with the originators of Saffron(Bordea, 2014). Relatedly I was looking at the introductory video for TypeScript by Anders Hejlsberg today and was struck by the applicability of the notion of type inference to topic analysis. I think we should call all these classification methods topic inference and when those topics are related one to the other then we have topic modeling, or ontology inference of one stripe or another.
I honestly couldn’t say what I’m trying to get at with this short blog post. It merely amused me that a number of machine learning shops had hit upon the same task to demonstrate their tools and wares.