Quantcast
RSS Entries RSS
RSS Subscribe by Email

Archive for Machine Learning

Installing CUDA and Theano on Ubuntu 11.04 Natty Narwhal

Theano is a very interesting Python library developed mainly for deep learning, which can run calculations on some NVIDIA GPUs by using the CUDA library.  Setting up Theano to use the GPU can be a little tricky and take a bit of work. However, Aaron Haviland has set up a CUDA 4.0 PPA, which makes the installation much simpler.

Install Theano
sudo apt-get install python-numpy libblas-dev liblapack-dev gfortran python-dev python-pip mercurial
sudo pip install --upgrade git+git://github.com/Theano/Theano.git

This will put Theano in /usr/local/lib/python2.7/dist-packages/theano

Install CUDA (requires downgrading gcc to 4.4)
sudo add-apt-repository ppa:aaron-haviland/cuda-4.0
sudo apt-get update sudo apt-get upgrade
sudo apt-get install nvidia-cuda-toolkit g++-4.4 gcc-4.4
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.5 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.5
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.4 60 --slave /usr/bin/g++ g++ /usr/bin/g++-4.4
sudo update-alternatives --config gcc

Test it out

Now run the sample program under “Putting it all Together” in the Theano tutorial. It will hopefully tell you that it used your GPU.

A good benchmark to test out the speed of your setup is to run /usr/local/lib/python2.7/dist-packages/theano/misc/check_blas.py

Credits

Thanks to James Bergstra for the necessary Theano fix to make it work with the PPA as well as the rest of the Theano developers for providing this very cool library. And also to Andrew Ng, Samy Bengio, and the other Googlers who have been taking their time to teach the rest of us more machine learning concepts.

Comments (3)

Latent Dirichlet Allocation with Mallet

We recently had a PhD candidate from UCI come in and speak to the AI club at Google Irvine to speak about her research on Latent Dirichlet Allocation (LDA). LDA is a topic model and groups words into topics where each article is comprised of a mixture of topics. I was interested to play around with this a bit, so I downloaded Mallet and wrote up some quick code to try making my own LDA model.

package com.benmccann.topicmodel;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.TreeSet;

import cc.mallet.pipe.CharSequence2TokenSequence;
import cc.mallet.pipe.Pipe;
import cc.mallet.pipe.SerialPipes;
import cc.mallet.pipe.TokenSequence2FeatureSequence;
import cc.mallet.pipe.TokenSequenceLowercase;
import cc.mallet.pipe.TokenSequenceRemoveStopwords;
import cc.mallet.pipe.iterator.ArrayIterator;
import cc.mallet.topics.ParallelTopicModel;
import cc.mallet.types.Alphabet;
import cc.mallet.types.IDSorter;
import cc.mallet.types.InstanceList;

import com.google.inject.Guice;
import com.google.inject.Inject;
import com.google.inject.Injector;

public class Lda {

  @Inject private com.benmccann.topicmodel.TextProvider textProvider;

  InstanceList createInstanceList(List<String> texts) throws IOException {
    ArrayList<Pipe> pipes = new ArrayList<Pipe>();
    pipes.add(new CharSequence2TokenSequence());
    pipes.add(new TokenSequenceLowercase());
    pipes.add(new TokenSequenceRemoveStopwords());
    pipes.add(new TokenSequence2FeatureSequence());
    InstanceList instanceList = new InstanceList(new SerialPipes(pipes));
    instanceList.addThruPipe(new ArrayIterator(texts));
    return instanceList;
  }

  private ParallelTopicModel createNewModel() throws IOException {
    List<String> texts = textProvider.getTexts();
    InstanceList instanceList = createInstanceList(texts);
    int numTopics = instanceList.size() / 5;
    ParallelTopicModel model = new ParallelTopicModel(numTopics);
    model.addInstances(instanceList);
    model.estimate();
    return model;
  }

  ParallelTopicModel getOrCreateModel() throws Exception {
    return getOrCreateModel("model");
  }

  private ParallelTopicModel getOrCreateModel(String directoryPath)
      throws Exception {
    File directory = new File(directoryPath);
    if (!directory.exists()) {
      directory.mkdir();
    }
    File file = new File(directory, "mallet-lda.model");
    ParallelTopicModel model = null;
    if (!file.exists()) {
      model = createNewModel();
      model.write(file);
    } else {
      model = ParallelTopicModel.read(file);
    }
    return model;
  }

  public void printTopics() throws Exception {
    ParallelTopicModel model = getOrCreateModel();
    Alphabet alphabet = model.getAlphabet();
    for (TreeSet<IDSorter> set : model.getSortedWords()) {
      System.out.print("TOPIC: ");
      for (IDSorter s : set) {
        System.out.print(alphabet.lookupObject(s.getID()) + ", ");
      }
      System.out.println();
    }
  }

  public static void main(String[] args) throws Exception {
    Injector injector = Guice.createInjector();
    Lda lda = injector.getInstance(Lda.class);
    lda.printTopics();
  }

}

One of the things I found interesting was that you have to specify a number of topics. This is where the ‘art’ of machine learning comes in. With some training data this parameter could be tuned to perform better than my random guesses.

Comments (1)