Quantcast
RSS Entries RSS
RSS Subscribe by Email

Archive for March, 2011

Sed Cookbook

The Linux sed command is a stream editor.  What that means is basically that you can do a regex operation on each line of a file or a piped stream.  I always have a bit of trouble remembering how to use it since its regex implementation is a bit different than the ones I’m used to.  I’ll post more examples as I encounter them in my work.

Sed regex reminders:

  • You need a backslash before parens in a regex grouping
  • You refer to matched regex groups using \1, \2, etc.
  • The + regex operator does not work
  • Non-greedy quantifiers don’t work.  For example, .*? will not work
  • The output is printed to standard out by default.  You need the -i option if you want to edit a file with sed.

Remove all but the first column in a .tsv stream
sed ‘s/\([^\t]*\).*/\1/’

Edit a .tsv file by removing all but the first column
sed -i ‘s/\([^\t]*\).*/\1/’

Remove the first line of a stream
sed ’1d’

Strip trailing whitespace from a file
sed -i -e ‘s/ *$//’

Replace @inheritDoc with @override after marking for edit
grep @inheritDoc -l -r java/com/benmccann | xargs p4 edit
grep @inheritDoc -l -r java/com/benmccann | xargs sed -i ‘s/\(.*\)@inheritDoc/\1@override/’

Replace @inheritDoc with @override in JS files after marking for edit
find java/com/benmccann -name ‘*.js’ -print0 | xargs -0 grep -l @inheritDoc | xargs p4 edit
find java/com/benmccann -name ‘*.js’ -print0 | xargs -0 grep -l @inheritDoc | xargs sed -i ‘s/\(.*\)@inheritDoc/\1@override/’

Comments

Using the Guice Struts 2 plugin

Guice 3.0 was released a few days ago!  One of the easiest ways to use it in your web server is to use Struts 2 with the Struts 2 plugin, which is available in the central Maven repository.

This tutorial assumes familiarity with Guice and Struts 2.

In order to use it the plugin, your injector must be created with a Struts2GuicePluginModule:

Injector injector = Guice.createInjector(
    new com.google.inject.servlet.ServletModule(),
    new com.google.inject.struts2.Struts2GuicePluginModule(),
    new MyModule());

You must then define a GuiceServletContextListener to provide the injector to the Struts 2 plugin. I injected the Injector because I’m using embedded Jetty. However, if you’re using a standard servlet container, you’d probably just create the injector in the class itself.

package com.benmccann.example;

import com.google.inject.Inject;
import com.google.inject.Injector;
import com.google.inject.servlet.GuiceServletContextListener;

/**
 * @author benmccann.com
 */
public class GuiceListener extends GuiceServletContextListener {

  private final Injector injector;

  @Inject
  public GuiceListener(Injector injector) {
    this.injector = injector;
  }

  @Override
  public Injector getInjector() {
    return injector;
  }

}

You must then wire it up in your web.xml:

  <listener>
    <listener-class>com.benmccann.example.GuiceListener</listener-class>
  </listener>  

  <filter>
    <filter-name>guice</filter-name>
    <filter-class>com.google.inject.servlet.GuiceFilter</filter-class>
  </filter>

  <filter-mapping>
    <filter-name>guice</filter-name>
    <url-pattern>/*</url-pattern>
  </filter-mapping>

There’s also an example in the Guice source code repository.

Enjoy!

Comments (1)

Latent Dirichlet Allocation with Mallet

We recently had a PhD candidate from UCI come in and speak to the AI club at Google Irvine to speak about her research on Latent Dirichlet Allocation (LDA). LDA is a topic model and groups words into topics where each article is comprised of a mixture of topics. I was interested to play around with this a bit, so I downloaded Mallet and wrote up some quick code to try making my own LDA model.

package com.benmccann.topicmodel;

import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.TreeSet;

import cc.mallet.pipe.CharSequence2TokenSequence;
import cc.mallet.pipe.Pipe;
import cc.mallet.pipe.SerialPipes;
import cc.mallet.pipe.TokenSequence2FeatureSequence;
import cc.mallet.pipe.TokenSequenceLowercase;
import cc.mallet.pipe.TokenSequenceRemoveStopwords;
import cc.mallet.pipe.iterator.ArrayIterator;
import cc.mallet.topics.ParallelTopicModel;
import cc.mallet.types.Alphabet;
import cc.mallet.types.IDSorter;
import cc.mallet.types.InstanceList;

import com.google.inject.Guice;
import com.google.inject.Inject;
import com.google.inject.Injector;

public class Lda {

  @Inject private com.benmccann.topicmodel.TextProvider textProvider;

  InstanceList createInstanceList(List<String> texts) throws IOException {
    ArrayList<Pipe> pipes = new ArrayList<Pipe>();
    pipes.add(new CharSequence2TokenSequence());
    pipes.add(new TokenSequenceLowercase());
    pipes.add(new TokenSequenceRemoveStopwords());
    pipes.add(new TokenSequence2FeatureSequence());
    InstanceList instanceList = new InstanceList(new SerialPipes(pipes));
    instanceList.addThruPipe(new ArrayIterator(texts));
    return instanceList;
  }

  private ParallelTopicModel createNewModel() throws IOException {
    List<String> texts = textProvider.getTexts();
    InstanceList instanceList = createInstanceList(texts);
    int numTopics = instanceList.size() / 5;
    ParallelTopicModel model = new ParallelTopicModel(numTopics);
    model.addInstances(instanceList);
    model.estimate();
    return model;
  }

  ParallelTopicModel getOrCreateModel() throws Exception {
    return getOrCreateModel("model");
  }

  private ParallelTopicModel getOrCreateModel(String directoryPath)
      throws Exception {
    File directory = new File(directoryPath);
    if (!directory.exists()) {
      directory.mkdir();
    }
    File file = new File(directory, "mallet-lda.model");
    ParallelTopicModel model = null;
    if (!file.exists()) {
      model = createNewModel();
      model.write(file);
    } else {
      model = ParallelTopicModel.read(file);
    }
    return model;
  }

  public void printTopics() throws Exception {
    ParallelTopicModel model = getOrCreateModel();
    Alphabet alphabet = model.getAlphabet();
    for (TreeSet<IDSorter> set : model.getSortedWords()) {
      System.out.print("TOPIC: ");
      for (IDSorter s : set) {
        System.out.print(alphabet.lookupObject(s.getID()) + ", ");
      }
      System.out.println();
    }
  }

  public static void main(String[] args) throws Exception {
    Injector injector = Guice.createInjector();
    Lda lda = injector.getInstance(Lda.class);
    lda.printTopics();
  }

}

One of the things I found interesting was that you have to specify a number of topics. This is where the ‘art’ of machine learning comes in. With some training data this parameter could be tuned to perform better than my random guesses.

Comments (1)

Remote Java debugging in Eclipse

To debug a Java program being run on the command line from Eclipse you can start the Java program in remote debugging mode:

java -Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y -jar myProgram.jar

The program will wait for you to attach the Eclipse debugger to it. Open Eclipse and choose:

Run > Debug Configurations... > Remote Java Application > New

Make sure to enter the same port that you chose on the command line. The default is port 8000. Now hit “Debug” and you’re off!

Comments