December 11, 2011 at 11:58 pm
Percona Server is just MySQL with a few extra options added in by Percona. It’s backwards compatible and based off the same code base. If you’re not familiar with Percona, they are the world’s leading MySQL consultants. The main reason I switched is because Ubuntu uses an old version of MySQL. Ubuntu is about a year behind in packaging MySQL. Something to do with checking the copyright after Oracle got ahold of it. This seemed to be the easiest way to update. A few other reasons follow.
Everyone and their mom says xtraBackup is the way to go for MySQL backups. Even Facebook uses it. xtraBackup is an open source project made by Percona. mysqldump is fine for small projects, but it’s not real scalable when you have any real amount of data. It’s available in the Percona apt repositories.
By default, older version of MySQL use the MyISAM storage engine, which has fallen out of favor. The default in newer MySQL installs is InnoDB. Percona also makes a storage engine called XtraDB, which is backwards compatible with InnoDB and supposedly a bit more performant. MariaDB (MySQL fork maintained by the MySQL creator) uses it as their default as well. Sounds like most people don’t notice a huge difference between XtraDB and InnoDB, but both are much favored over MyISAM which caused lots of problems for people.
Finally, there’s also HandlerSocket, which is a plugin for MySQL. It allows you to do primary key lookups directly to the storage engine bypassing MySQL’s SQL layer. It’s supposed to be 5-10x faster because it doesn’t have to parse the SQL and do table locking. It turns MySQL into a key/value as good as any of the NoSQL solutions. It’s actually much better because you can still run SQL queries on your data, which you can’t do with most of the NoSQL solutions and you get MySQL’s replication etc. which is all very well documented. As long as your DB can fit in RAM on a single machine it makes MySQL much faster. Perhaps even faster and easier to use than even memcached.
To migrate, first create a backup:
mysqldump -uroot -p --all-databases > dump.sql
Then do the upgrade:
gpg --keyserver hkp://keys.gnupg.net --recv-keys 1C4CBDCDCD2EFD2A
gpg -a --export CD2EFD2A | sudo apt-key add -
sudo emacs /etc/apt/sources.list
Add:
## Percona repository
deb http://repo.percona.com/apt maverick main
deb-src http://repo.percona.com/apt maverick main
sudo apt-get update
sudo apt-get install percona-server-server-5.5
sudo apt-get autoremove
Permalink
November 30, 2011 at 3:53 pm
I had to figure out a few things to get Ubuntu installed and working well on VirtualBox.
I had to enable virtualization technologies in my BIOS. I have a Lenovo T520 and did this by pressing F1 during startup and then going to Security > Virtualization. If I did not do this then I would receive the error “VT-x features locked or unavailable in MSR” when trying to run with more than 1 CPU or 3584 MB of RAM.
Also, I had to run “sudo apt-get install dkms” to get the VirtualBox Guest Additions to work well.
Finally, I remapped the host key. By default all kinds of weird things happen when you use the right Ctrl button. This can be fixed by going to File > Preferences… > Input and then setting Host Key to something you never use like Pause.
Permalink
November 14, 2011 at 1:22 am
Install nginx if it’s not already installed:
sudo apt-get install nginx
You must have the SSL module installed. The nginx docs say this is not standard. However, it does come installed on Ubuntu. You can verify by running nginx -V and looking for --with-http_ssl_module.
Next up is generating the SSL certs. Follow the Slicehost docs for this step.
Now you’ll need to update your /etc/nginx/nginx.conf file:
server {
server_name www.yourdomain.com yourdomain.com;
rewrite ^(.*) https://www.yourdomain.com$1 permanent;
}
server {
server_name local.yourdomain.com;
rewrite ^(.*) https://local.yourdomain.com$1 permanent;
}
server {
listen 443;
ssl on;
ssl_certificate /etc/ssl/certs/myssl.crt;
ssl_certificate_key /etc/ssl/private/myssl.key;
keepalive_timeout 70;
server_name www.yourdomain.com local.yourdomain.com;
location / {
proxy_pass http://backend;
}
}
Then restart nginx:
sudo nginx -s reload
Finally, in /etc/hosts put:
127.0.0.1 local.yourdomain.com
This will allow you to visit https://local.yourdomain.com/ which will be served up by the server that you have running on port 8080.
Permalink
August 28, 2011 at 2:08 am
Earlier in the year, I posted a quick writeup on how to run an embedded Jetty instance. Today, I’m posting basically the same code showing how to run an embedded Tomcat instance. The embedded Tomcat API is much nicer since it matches closely the web.xml syntax. However, the embedded Tomcat instance takes much longer to startup.
package com.benmccann.webtemplate.frontend.server;
import java.net.URL;
import org.apache.catalina.Context;
import org.apache.catalina.core.AprLifecycleListener;
import org.apache.catalina.core.StandardServer;
import org.apache.catalina.deploy.FilterDef;
import org.apache.catalina.deploy.FilterMap;
import org.apache.catalina.startup.Tomcat;
import org.apache.struts2.dispatcher.ng.filter.StrutsPrepareAndExecuteFilter;
import com.beust.jcommander.JCommander;
import com.google.inject.Guice;
import com.google.inject.Inject;
import com.google.inject.Injector;
import com.google.inject.servlet.GuiceFilter;
/**
* @author Ben McCann (benmccann.com)
*/
public class WebServer {
private final FrontendSettings webServerSettings;
private final GuiceListener guiceListener;
private final Tomcat tomcat;
@Inject
public WebServer(
FrontendSettings webServerSettings,
GuiceListener guiceListener) {
this.webServerSettings = webServerSettings;
this.guiceListener = guiceListener;
this.tomcat = new Tomcat();
}
private FilterDef createFilterDef(String filterName, String filterClass) {
FilterDef filterDef = new FilterDef();
filterDef.setFilterName(filterName);
filterDef.setFilterClass(filterClass);
return filterDef;
}
private FilterMap createFilterMap(String filterName, String urlPattern) {
FilterMap filterMap = new FilterMap();
filterMap.setFilterName(filterName);
filterMap.addURLPattern(urlPattern);
return filterMap;
}
public void run() throws Exception {
String appBase = ".";
tomcat.setPort(webServerSettings.getPort());
tomcat.setBaseDir("webapp");
tomcat.getHost().setAppBase(appBase);
String contextPath = "/";
// Add AprLifecycleListener to give native speed boost
// sudo apt-get install libtcnative-1
StandardServer server = (StandardServer)tomcat.getServer();
AprLifecycleListener listener = new AprLifecycleListener();
server.addLifecycleListener(listener);
Context context = tomcat.addWebapp(contextPath, appBase);
context.addFilterDef(createFilterDef("guice", GuiceFilter.class.getName()));
FilterDef struts2FilterDef = createFilterDef("struts2",
StrutsPrepareAndExecuteFilter.class.getName());
struts2FilterDef.addInitParameter("struts.devMode",
Boolean.toString(webServerSettings.isDevModeEnabled()));
context.addFilterDef(struts2FilterDef);
context.addFilterMap(createFilterMap("guice", "/*"));
context.addFilterMap(createFilterMap("struts2", "/*"));
tomcat.start();
tomcat.getServer().await();
}
public static void main(String[] args) throws Exception {
FrontendSettings webServerSettings = new FrontendSettings();
new JCommander(webServerSettings, args);
Guice.createInjector(new FrontendModule(webServerSettings));
Injector injector = Guice.createInjector();
WebServer server = injector.getInstance(WebServer.class);
server.run();
}
}
Permalink
July 9, 2011 at 8:00 pm
Theano is a very interesting Python library developed mainly for deep learning, which can run calculations on some NVIDIA GPUs by using the CUDA library. Setting up Theano to use the GPU can be a little tricky and take a bit of work. However, Aaron Haviland has set up a CUDA 4.0 PPA, which makes the installation much simpler.
Install Theano
sudo apt-get install python-numpy libblas-dev liblapack-dev gfortran python-dev python-pip mercurial
sudo pip install --upgrade git+git://github.com/Theano/Theano.git
This will put Theano in /usr/local/lib/python2.7/dist-packages/theano
Install CUDA (requires downgrading gcc to 4.4)
sudo add-apt-repository ppa:aaron-haviland/cuda-4.0
sudo apt-get update sudo apt-get upgrade
sudo apt-get install nvidia-cuda-toolkit g++-4.4 gcc-4.4
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.5 40 --slave /usr/bin/g++ g++ /usr/bin/g++-4.5
sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-4.4 60 --slave /usr/bin/g++ g++ /usr/bin/g++-4.4
sudo update-alternatives --config gcc
Test it out
Now run the sample program under “Putting it all Together” in the Theano tutorial. It will hopefully tell you that it used your GPU.
A good benchmark to test out the speed of your setup is to run /usr/local/lib/python2.7/dist-packages/theano/misc/check_blas.py
Credits
Thanks to James Bergstra for the necessary Theano fix to make it work with the PPA as well as the rest of the Theano developers for providing this very cool library. And also to Andrew Ng, Samy Bengio, and the other Googlers who have been taking their time to teach the rest of us more machine learning concepts.
Permalink
April 11, 2011 at 1:03 am
I’ve recently started using Git, which I’ve found I much prefer to Subversion for two reasons. The first is that it’s really fast since almost all commands are run locally. The second reason is that Subversion litters your source code with .svn directories and should you accidentally delete or move one then you’re in for a world of hurt. Git also handles ignored files in a much easier manner.
There are two downsides with Git. The first is that there’s no central server to store the code base. GitHub or BitBucket can fulfill this role if you don’t mind someone else hosting your source code. If you want to set up a central server yourself it seems the best solution is gitolite. The documentation isn’t for beginners, but I found a decent tutorial on setting up gitolite.
The other downside with git is that the commands can be a bit bizarre.
git aliases
You can set aliases using git config --global. E.g. git config --global alias.dt "difftool --no-prompt" makes git dt act the same as git difftool --no-prompt. These aliases are saved in ~/.gitconfig. My ~/.gitconfig looks like:
[user]
name = Ben McCann
email = ben@benmccann.com
[alias]
cam = commit -am
dt = difftool --no-prompt
dtm = !meld .
pending = !clear & git status
pullom = pull origin master
pushom = push origin master
rev = checkout --
revall = reset --hard HEAD
Reverting to a previous version
$ git reset --hard YOUR_CHANGESET_HERE
$ git reset --soft @{1}
$ git commit -a
Permalink
March 31, 2011 at 10:50 pm
The Linux sed command is a stream editor. What that means is basically that you can do a regex operation on each line of a file or a piped stream. I always have a bit of trouble remembering how to use it since its regex implementation is a bit different than the ones I’m used to. I’ll post more examples as I encounter them in my work.
Sed regex reminders:
- You need a backslash before parens in a regex grouping
- You refer to matched regex groups using \1, \2, etc.
- The + regex operator does not work
- Non-greedy quantifiers don’t work. For example, .*? will not work
- The output is printed to standard out by default. You need the -i option if you want to edit a file with sed.
Remove all but the first column in a .tsv stream
sed ‘s/\([^\t]*\).*/\1/’
Edit a .tsv file by removing all but the first column
sed -i ‘s/\([^\t]*\).*/\1/’
Remove the first line of a stream
sed ’1d’
Strip trailing whitespace from a file
sed -i -e ‘s/ *$//’
Replace @inheritDoc with @override after marking for edit
grep @inheritDoc -l -r java/com/benmccann | xargs p4 edit
grep @inheritDoc -l -r java/com/benmccann | xargs sed -i ‘s/\(.*\)@inheritDoc/\1@override/’
Replace @inheritDoc with @override in JS files after marking for edit
find java/com/benmccann -name ‘*.js’ -print0 | xargs -0 grep -l @inheritDoc | xargs p4 edit
find java/com/benmccann -name ‘*.js’ -print0 | xargs -0 grep -l @inheritDoc | xargs sed -i ‘s/\(.*\)@inheritDoc/\1@override/’
Permalink
March 29, 2011 at 3:43 pm
Guice 3.0 was released a few days ago! One of the easiest ways to use it in your web server is to use Struts 2 with the Struts 2 plugin, which is available in the central Maven repository.
This tutorial assumes familiarity with Guice and Struts 2.
In order to use it the plugin, your injector must be created with a Struts2GuicePluginModule:
Injector injector = Guice.createInjector(
new com.google.inject.servlet.ServletModule(),
new com.google.inject.struts2.Struts2GuicePluginModule(),
new MyModule());
You must then define a GuiceServletContextListener to provide the injector to the Struts 2 plugin. I injected the Injector because I’m using embedded Jetty. However, if you’re using a standard servlet container, you’d probably just create the injector in the class itself.
package com.benmccann.example;
import com.google.inject.Inject;
import com.google.inject.Injector;
import com.google.inject.servlet.GuiceServletContextListener;
/**
* @author benmccann.com
*/
public class GuiceListener extends GuiceServletContextListener {
private final Injector injector;
@Inject
public GuiceListener(Injector injector) {
this.injector = injector;
}
@Override
public Injector getInjector() {
return injector;
}
}
You must then wire it up in your web.xml:
<listener>
<listener-class>com.benmccann.example.GuiceListener</listener-class>
</listener>
<filter>
<filter-name>guice</filter-name>
<filter-class>com.google.inject.servlet.GuiceFilter</filter-class>
</filter>
<filter-mapping>
<filter-name>guice</filter-name>
<url-pattern>/*</url-pattern>
</filter-mapping>
There’s also an example in the Guice source code repository.
Enjoy!
Permalink
March 10, 2011 at 8:40 pm
We recently had a PhD candidate from UCI come in and speak to the AI club at Google Irvine to speak about her research on Latent Dirichlet Allocation (LDA). LDA is a topic model and groups words into topics where each article is comprised of a mixture of topics. I was interested to play around with this a bit, so I downloaded Mallet and wrote up some quick code to try making my own LDA model.
package com.benmccann.topicmodel;
import java.io.File;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import java.util.TreeSet;
import cc.mallet.pipe.CharSequence2TokenSequence;
import cc.mallet.pipe.Pipe;
import cc.mallet.pipe.SerialPipes;
import cc.mallet.pipe.TokenSequence2FeatureSequence;
import cc.mallet.pipe.TokenSequenceLowercase;
import cc.mallet.pipe.TokenSequenceRemoveStopwords;
import cc.mallet.pipe.iterator.ArrayIterator;
import cc.mallet.topics.ParallelTopicModel;
import cc.mallet.types.Alphabet;
import cc.mallet.types.IDSorter;
import cc.mallet.types.InstanceList;
import com.google.inject.Guice;
import com.google.inject.Inject;
import com.google.inject.Injector;
public class Lda {
@Inject private com.benmccann.topicmodel.TextProvider textProvider;
InstanceList createInstanceList(List<String> texts) throws IOException {
ArrayList<Pipe> pipes = new ArrayList<Pipe>();
pipes.add(new CharSequence2TokenSequence());
pipes.add(new TokenSequenceLowercase());
pipes.add(new TokenSequenceRemoveStopwords());
pipes.add(new TokenSequence2FeatureSequence());
InstanceList instanceList = new InstanceList(new SerialPipes(pipes));
instanceList.addThruPipe(new ArrayIterator(texts));
return instanceList;
}
private ParallelTopicModel createNewModel() throws IOException {
List<String> texts = textProvider.getTexts();
InstanceList instanceList = createInstanceList(texts);
int numTopics = instanceList.size() / 5;
ParallelTopicModel model = new ParallelTopicModel(numTopics);
model.addInstances(instanceList);
model.estimate();
return model;
}
ParallelTopicModel getOrCreateModel() throws Exception {
return getOrCreateModel("model");
}
private ParallelTopicModel getOrCreateModel(String directoryPath)
throws Exception {
File directory = new File(directoryPath);
if (!directory.exists()) {
directory.mkdir();
}
File file = new File(directory, "mallet-lda.model");
ParallelTopicModel model = null;
if (!file.exists()) {
model = createNewModel();
model.write(file);
} else {
model = ParallelTopicModel.read(file);
}
return model;
}
public void printTopics() throws Exception {
ParallelTopicModel model = getOrCreateModel();
Alphabet alphabet = model.getAlphabet();
for (TreeSet<IDSorter> set : model.getSortedWords()) {
System.out.print("TOPIC: ");
for (IDSorter s : set) {
System.out.print(alphabet.lookupObject(s.getID()) + ", ");
}
System.out.println();
}
}
public static void main(String[] args) throws Exception {
Injector injector = Guice.createInjector();
Lda lda = injector.getInstance(Lda.class);
lda.printTopics();
}
}
One of the things I found interesting was that you have to specify a number of topics. This is where the ‘art’ of machine learning comes in. With some training data this parameter could be tuned to perform better than my random guesses.
Permalink
March 8, 2011 at 11:20 pm
To debug a Java program being run on the command line from Eclipse you can start the Java program in remote debugging mode:
java -Xdebug -Xrunjdwp:transport=dt_socket,address=8000,server=y,suspend=y -jar myProgram.jar
The program will wait for you to attach the Eclipse debugger to it. Open Eclipse and choose:
Run > Debug Configurations... > Remote Java Application > New
Make sure to enter the same port that you chose on the command line. The default is port 8000. Now hit “Debug” and you’re off!
Permalink