Michelle Fullwood

Natural language processing at PyGotham 2016

Sat, 23 Jul 2016 00:00:00 +0000

It’s been a week since I attended PyGotham 2016 in New York City. When I saw the schedule, which was packed with natural language processing talks, I knew I had to go. Plus, it was at the United Nations. How cool is it to attend a conference at the UN??!!

I had a great time attending those talks, which were uniformly excellent. I also organised a Birds of a Feather (BoF) about NLP and got to meet a lot of language-minded folk that way. Here’s a recap.

Teaching and Doing Digital Humanities with Jupyter Notebooks

Matt Lavin gave a really interesting talk combining a couple strands of his work in the digital humanities: educating people about computational DH, mainly via the medium of Jupyter notebooks, as well as his own research on dating 19th and early 20th century horror novels to answer questions like: did H. P. Lovecraft deliberately try to write in an older style than his contemporaries to make his horror more…horrorful?

Key takeaways:

While you may make questionable choices while processing the data and running your ML algorithms, it’s okay so long as you document and justify your methods so other people (and future you) can see your thought processes. Jupyter notebooks, which interleave prose and code and execution results, are great for this.

MyBinder enables you to make your Jupyter notebooks executable online, which is great for workshops, as it removes the need to get Jupyter notebook up and running on participants’ individual machines.

Summarizing documents

Slides

Many people I talked to on Saturday evening cited this as their favourite talk of the day. Mike Williams gave a masterly overview of how to do extractive summarization, starting with the “dumb” but still effective Luhn method that anyone can implement with a few lines of code. (If you’ve seen SummaryBot on Reddit, that’s how it works.) Then we worked up to Latent Dirichlet Allocation and recurrent neural networks. It was all stupendously clear and everyone felt like they came out of the talk with their brains embiggened.

Key takeaways:

When extracting bag-of-words, in future, try substituting skip-thought vectors instead.

Keras looks like a really neat way of implementing neural networks (higher-level than Theano/TensorFlow - in fact it builds on them).

Everything you always wanted to know about NLP but were afraid to ask

Slides and notebook

Steven Butler and Max Schwartz gave a solid introduction to NLP on Friday morning, covering a lot of ground from morphology through to semantics in under an hour. I think that was the first time I’d ever seen Morfessor (a classic approach to the problem of morphological segmentation) taught in an intro NLP talk! I really liked their emphasis on how knowledge of linguistics could help with NLP tasks, especially when it comes to other languages when a pre-built NLP library might not be available. If you’re looking to get started with NLP, I highly recommend this talk when the video is out!

Higher-level natural language processing with Textacy

Slides and notebook

Burton DeWilde, creator of the excellently-named library textacy, gave an overview of his NLP library. This sits atop of the also excellent spaCy library and aims to provide a nice, performant API for higher-level NLP tasks such as extracting key terms and topic modelling, with many more features planned.

A nice touch to the library is built-in data visualisations for seeing the results of an analysis. For example, you can visualise the relationship between top terms and topics after topic modelling in a termite plot with one line of code:

Burton also put out a call for contributors to textacy. From meeting him this weekend, I can say he’s a really nice guy, and textacy has the makings of a great library, so go contribute!

Others

In addition to the NLP-centric talks, there were loads of data science-themed talks. Deep learning was a big theme. Slightly less mainstream machine learning techniques like reinforcement learning and probabilistic graphical models were also covered, albeit at a simpler level.

One non-NLP/ML talk I really enjoyed was Suby Raman’s “Making sense of 100 years of NYC opera with Python” (slides), which was more dataviz-y and gave good tips on scraping with asyncio. His initial blogpost about his project got a lot of attention in music social media and even made it to the Washington Post. It was fun to hear about the aftermath of his post. Something he emphasised that resonated with me was the need for domain experts to learn to root around data, since they know the really interesting questions. Once they master the tools to answer these questions, they’re unstoppable.

Pythons of a feather slither together…wait what?

On the first day of PyGotham, I ran into Ray Cha, who I had only met once before at Maptime Boston but quickly turned into a friend over the weekend. I told him I was thinking of doing an NLP BoF and he said he would totally participate. The risk of waiting alone and awkward in a room mitigated, I registered a time on the BoF spreadsheet (it was really hard to find a timeslot without talks an NLP person would be into) and tweeted out an announcement.

NLP folks at @PyGotham: come stop by the natural language processing BoF 3-4pm Sunday! https://t.co/RWpYYxIhne @bjdewilde #pygotham #nlproc
— Michelle Fullwood (@michelleful) July 16, 2016

Just before the actual BoF I mentioned to Ray that I thought 6-10 was about the right size for a BoF and we had 8 participants, so that was juuuuuust right. Many of them were speakers from the talks I mentioned above.

Highlights

Udi started our discussion off by sharing his implementation of a headline generator with RNNs in Keras. Not only were the results on Buzzfeed data super cool-looking, it reinforced that I should really take a good look at Keras.

Matt took us through his novel-dating project once more and the whole group brainstormed other features to add to his machine learning model, sharing their own experiences. For example, Max has been doing some really neat authorship attribution stuff with blogs and Twitter and shared his findings from that.

It turns out that Steven worked on non-concatenative morphological segmentation with Tagalog infixes, which is similar to my dissertation work on segmenting Arabic morphology! Small world!

Burton and I discussed how to train a quote extractor from prose. Textacy currently includes one but relies on the quotes being more or less correctly formatted, but my data is kind of messy. We were talking about using his extractor as a baseline and getting people to annotate while reading, then training a CRF on the resultant corpus. Adam Palay also suggested some resources we could look at that might already have annotated corpora.

There was also general discussion of how to handle multilingual data, data ethics, and how to get started as a beginner.

Conclusion

As someone who’s generally shy about approaching strangers in the hallway, I’ve found that giving talks is a great way to get people to come talk to me instead. Of course, if you’re shy about public speaking, that can be just as bad…so running a BoF was the happy medium for me. I got to meet some great people and share ideas, which is basically the point of going to a conference.

So thanks to the BoF participants, to PyGotham organizers, and to Ray for making my weekend!

Parsing Chinese text with Stanford NLP

Thu, 10 Sep 2015 00:00:00 +0000

I’m doing some natural language processing on (Mandarin) Chinese text right now, using Stanford’s NLP tools, and I’m documenting the steps here. I’m just calling the tools from the command line, in a Unix environment, so if your use case is different from that, this probably won’t help you.

The tools we’ll be using are:

Step 1: Segmenting Chinese text

Mandarin Chinese is written without spaces between words, for example:

世界就是一个疯子的囚笼
“The world is a den of crazies.”

That’s a sentence from the Tatoeba sentence corpus, which is what I’m working on parsing, by the way.

Unsurprisingly, all natural language processing on Chinese text starts with word segmentation – we won’t get far by trying to interpret that whole string as a single element. There are lots of segmenters out there, including jieba in Python, which I like, but they may have different conventions for how they split things up. So if we’re going to use the output of the segmentation in another Stanford tool downstream, it’s best to stick to the Stanford Word Segmenter, whose usage is simple enough with the script provided:

./segment.sh pku path/to/input.file UTF-8 0 > path/to/segmented.file

The first argument can be either pku (for Beijing (Peking) University) or ctb (for Chinese Treebank). According to the docs, pku results in “smaller vocabulary sizes and OOV rates on test data than CTB models”, so I went with that. “0” at the end indicates that we want the single best guess at the segmentation, without printing its associated probability.

If you’re curious, the output of the segmenter on the sentence above is:

世界就是一个疯子的囚笼

which is an eminently sensible segmentation.

The load times on the segmenter are pretty horrible, so it’s worth it to stuff all your text into a single file and segment everything at one go.

Step 2: Parsing

The Stanford parser gives two different kinds of outputs, a constituency parse, which shows the syntactic structure of the sentence:

(ROOT
  (IP
    (NP (NN 世界))
    (VP
      (ADVP (AD 就))
      (VP (VC 是)
        (NP
          (DNP
            (NP
              (QP (CD 一)
                (CLP (M 个)))
              (NP (NN 疯子)))
            (DEG 的))
          (NP (NN 囚笼)))))))

And a dependency parse, which shows, broadly speaking, the grammatical relations the words have to each other:

nsubj(囚笼-8, 世界-1)
advmod(囚笼-8, 就-2)
cop(囚笼-8, 是-3)
nummod(个-5, 一-4)
clf(疯子-6, 个-5)
assmod(囚笼-8, 疯子-6)
case(疯子-6, 的-7)
root(ROOT-0, 囚笼-8)

There are specialized dependency parsers out there, but the Stanford parser first does a constituency parse and converts it to a dependency parse. This approach seems to work better in general.

There are five Chinese parsing models supplied with the software, which you can see by less-ing the stanford-parser-3.5.2-models.jar file.

edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz
edu/stanford/nlp/models/lexparser/xinhuaPCFG.ser.gz

The FAQ says that the PCFG grammars are the fastest, but the factored grammars are the most performant. So choosing either xinhuaFactored or chineseFactored is the way to go. The xinhua models are trained on newswire data, while the chinese models include more varied types of text including some from other regions, so select the model that best fits your data.

In addition, there is a xinhuaFactoredSegmenting model. This works on unsegmented text, allowing us to bypass the segmentation procedure in Step 1. However, this isn’t recommended as it doesn’t perform as well as the standalone Segmenter.

Now that we’ve chosen our model, it’s time to actually do the parsing. There is a lexparser-lang.sh helper script, but it assumes you’re using GB18030 encoding for your Chinese text. It’s simple to edit the script to include an -encoding utf-8 flag, but it’s not that much more difficult to just construct the Java call yourself.

Here’s how to get the constituency parse:

java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file > path/to/constituency.parsed.file

To get the dependency parse, just add an outputFormat flag, and specify typedDependencies:

java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
-outputFormat typedDependencies
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file > path/to/dependency.parsed.file

Incidentally, the parse that was chosen for this sentence is not the intended reading – it’s interpreting the sentence as “The world is the den of a single (unspecified) crazy person”. Which seems scarily close to truth.

You might want to consider the possibility of multiple parses, therefore. To get multiple parses, we need to use one of the PCFG parsers (not the factored parsers), and add the flag -printPCFGkBest n, where n is 2 or more.

Troubleshooting

The two errors I got while trying to do the parsing step had to do with getting the appropriate Java version running, and supplying the correct classPath.

Version 3.5.2 requires Java 8. If you don’t have it, it will turn up the error Unsupported major.minor version 52.0. If you get this error, make sure that (a) you have Java 8 installed, and that (b) java invokes Java 8. To do the latter, do

sudo update-alternatives --config java

and select Java 8.

The second error you may come across if you follow the commands supplied in the docs is Unable to resolve "edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz" as either class path, filename or URL.

If you get this, check the classPath (-cp) argument you’re passing to Java. It should have two parts: the parser .jar, and the models .jar, separated by a colon (a semi-colon in some other OSes).

-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar

Conclusion

I’m really grateful that Stanford makes all this great software available, and particularly for non-English languages. I hope this guide saves someone some time in getting the Chinese parser working. If all goes well, I’ll be sharing what I’ve been using it for soon.

Making maps in Python

Wed, 15 Jul 2015 00:00:00 +0000

Previous articles in this series:

A web map in two lines of Python

Here’s how to make a map from a GeoPandas GeoDataFrame in one step:

ax = df.plot(column='classification', colormap='accent')

where classification was the name of the column with our new Malay/Chinese/British/Indian/Generic/Other labels on each road (row).

What if we want to make this nice and interactive, like a Leaflet map? So we can pan and zoom and actually see street names? There’s a library called mplleaflet, by Jake Wasserman, that can do this for you:

import mplleaflet
mplleaflet.display(fig=ax.figure, crs=df.crs, tiles='cartodb_positron')

(If you don’t see colours on that map, just reload the page.)

To export it to an HTML page, you can do this:

mplleaflet.show(fig=ax.figure, crs=df.crs, tiles='cartodb_positron', path='sgmap.html')

We don’t have much control over colours here, but it would be nice to theme them, associating Chinese with its traditional red, Malay with its traditional green, etc. Here’s a hacky way to do it:

labels = list(set(df['classification'].values))
labels.sort()
# [u'British', u'Chinese', u'Generic', u'Indian', u'Malay', u'Other']
# this is the order in which colours from a colourmap will be applied

# British -> blue, Chinese -> red, etc...
my_colors = ['blue', 'red', 'gray', 'yellow', 'green', 'purple'])

# create a colour map with these colours
from matplotlib.colors import LinearSegmentedColormap
cmap = LinearSegmentedColormap.from_list('my cmap', my_colors)

# do the plot
ax2 = df.plot(column='classification', colormap=cmap)
mplleaflet.show(fig=ax2.figure, crs=df.crs, tiles='cartodb_positron', path='sgmap2.html')

Alternatives

mplleaflet is awesome for exploratory data analysis, but you might want to have more control over how your map looks. For this, I recommend using one of the following:

QGIS (C++ but has Python bindings)
Mapnik (C++ but has Python bindings)
Tilemill (GUI built on top of Mapnik)
Folium (maybe, haven’t investigated fully)

A nice feature of Tilemill is that it allows you to define your map styling using CartoCSS. For example, here’s how we would define the colours:

[classification='Malay']{ line-color: green; }
[classification='British']{ line-color: blue; }
[classification='Chinese']{ line-color: red; }
[classification='Indian']{ line-color: yellow; }
[classification='Other']{ line-color: purple; }
[classification='Generic']{ line-color: gray; }

You can also control the line width at various zoom levels:

line-opacity: 0.7;
[zoom>18] {line-width: 10;}
[zoom=18] {line-width: 7;}
[zoom=17] {line-width: 6;}
[zoom=16] {line-width: 5;}
[zoom=15] {line-width: 3.5;}
[zoom=14] {line-width: 3;}
[zoom=13] {line-width: 1.5;}
[zoom<13] {line-width: 1;}

If these are too fiddly, some web mapping solutions also use CartoCSS. I really like CartoDB, which is how I made my main map:

We can browse this map to look at clusters of street names, which are now conveniently colour-coded for our analysis!

Conclusion

It’s remarkably easy to make maps with GeoPandas and ancillary libraries like mplleaflet, thanks to the developers of these libraries :)

That’s all the technical stuff in this series. Next time, I’ll round everything off and talk about what I learned about Singapore street names from doing this project.

Using Pipelines and FeatureUnions in scikit-learn

Sat, 20 Jun 2015 00:00:00 +0000

Previous articles in this series:

In the last article, we built a baseline classifier for street names. The results were a bit disappointing at 55% accuracy. In this article, we’ll add more features, and streamline the code with scikit-learn’s Pipeline and FeatureUnion classes.

I learned a lot about Pipelines and FeatureUnions from Zac Stewart’s article on the subject, which I recommend.

Adding features

There’s a great paper called A few useful things to know about machine learning by Pedros Domingos, one of the most prominent researchers in the field, in which he says:

At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used…This is typically where most of the effort in a machine learning project goes.

So far I’d only used n-grams. But there were other sources of information I wasn’t using. Some ideas I had for more features were:

Number of words in name
- More words: likely to be Chinese (e.g. “Ang Mo Kio Avenue 1”)
Average word length
- Shorter: likely to be Chinese (e.g. “Ang Mo Kio”)
- Longer: likely to be British or Indian (e.g. “Kadayanallur Street”)
Are all the words in the dictionary?
- Yes: likely to be Generic (e.g. “Cashew Road”). Funny exception: Boon Lay Way (Chinese)
Is the “road tag” Malay?
- Yes: likely Malay (e.g. “Jalan Bukit Merah”, “Lorong Penchalak”, vs “Upper Thomson Road”, “Ang Mo Kio Avenue 1”)

How to incorporate these into the previous code? Let’s look at the code we needed to create the n-gram feature matrix:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

# build the feature matrices
ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')
X_train = ngram_counter.fit_transform(data_train)
X_test  = ngram_counter.transform(data_test)

# train the classifier
classifier = LinearSVC()
model = classifier.fit(X_train, y_train)

# test the classifier
y_test = model.predict(X_test)

To add the new features, what we’re looking at is:

Writing functions that produce a feature vector for each feature
Repeating the fit_transform and fit lines for each feature
Adding two lines of code where we combine the resultant numpy matrices into a one giant training feature matrix and one testing feature matrix

This may not seem like a huge deal, but it is pretty repetitive, opening ourselves up to the possibility of errors, for example calling fit_transform on the testing data rather than just transform.

Fortunately, scikit-learn gives us a better way: Pipelines.

Pipelines

Another way to think about the code above is to imagine a pipeline that takes in our input data, puts it through a first transformer – the n-gram counter – then through another transformer – the SVC classifier – to produce a trained model, which we can then use for prediction.

This is precisely what the Pipeline class in scikit-learn does:

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC

# build the pipeline
ppl = Pipeline([
              ('ngram', CountVectorizer(ngram_range=(1, 4), analyzer='char')),
              ('clf',   LinearSVC())
      ])

# train the classifier
model = ppl.fit(data_train)

# test the classifier
y_test = model.predict(data_test)

Notice that this time, we’re operating on data_train and data_test, i.e. just the lists of road names. We didn’t have to manually create a separate feature matrix for training and testing – the pipeline takes care of that.

Creating a new transformer

Now we want to add a new feature – average word length. There’s no built-in feature extractor like CountVectorizer for this, so we’ll have to write our own transformer. Here’s the code to do that. This time, instead of a list of names, we’re going to start passing in a Pandas dataframe, which has a column for the street name and another column for the “road tag” (Street, Avenue, Jalan, etc).

from sklearn.base import BaseEstimator, TransformerMixin

class AverageWordLengthExtractor(BaseEstimator, TransformerMixin):
    """Takes in dataframe, extracts road name column, outputs average word length"""

    def __init__(self):
        pass

    def average_word_length(self, name):
        """Helper code to compute average word length of a name"""
        return np.mean([len(word) for word in name.split()])

    def transform(self, df, y=None):
        """The workhorse of this feature extractor"""
        return df['road_name'].apply(self.average_word_length)

    def fit(self, df, y=None):
        """Returns `self` unless something different happens in train and test"""
        return self

Unless you’re doing something more complicated where something different happens in the training and testing phase (like when extracting n-grams), this is the general pattern for a transformer:

from sklearn.base import BaseEstimator, TransformerMixin

class SampleExtractor(BaseEstimator, TransformerMixin):

    def __init__(self, vars):
        self.vars = vars  # e.g. pass in a column name to extract

    def transform(self, X, y=None):
        return do_something_to(X, self.vars)  # where the actual feature extraction happens

    def fit(self, X, y=None):
        return self  # generally does nothing

Now that we’ve created our transformer, it’s time to add it into the pipeline.

FeatureUnions

We have a slight problem: we only know how to add transformers in series, but what we need to do is to add our average word length transformer in parallel with the n-gram extractor. Like this:

For this, there is scikit-learn’s FeatureUnion class.

from sklearn.pipeline import Pipeline, FeatureUnion

pipeline = Pipeline([
    ('feats', FeatureUnion([
        ('ngram', ngram_count_pipeline), # can pass in either a pipeline
        ('ave', AverageWordLengthExtractor()) # or a transformer
    ])),
    ('clf', LinearSVC())  # classifier
])

Notice that the first item in the FeatureUnion is ngram_count_pipeline. This is just a Pipeline created out of a column-extracting transformer, and CountVectorizer (the column extractor is necessary now that we’re operating on a Pandas dataframe rather than directly sending the list of road names through the pipeline).

That’s perfectly okay: a pipeline is itself just a giant transformer, and is treated as such. That makes it easy to write complex pipelines by building smaller pieces and then putting them together in the end.

Conclusion

So what happened after adding in all these new features? Accuracy went up to 65%, so that was a decent result. Note that using Pipelines and FeatureUnions did not in itself contribute to the performance. They’re just another way of organising your code for readability, reusability and easier experimentation.

If you’re looking to do hyperparameter tuning (which I won’t explain here), pipelines make that easy, as below:

from sklearn.grid_search import GridSearchCV

pg = {'clf__C': [0.1, 1, 10, 100]}

grid = GridSearchCV(pipeline, param_grid=pg, cv=5)
grid.fit(data_train, y_train)

grid.best_params_
# {'clf__C': 0.1}

grid.best_score_
# 0.702290076336

Ultimately, after adding in more features, adding more data, and doing hyperparameter tuning, I had about 75-80% accuracy, which was good enough for me. I only had to hand-correct 20-25% of the roads, which didn’t seem too daunting. I was ready to make my map. That’s what we’ll do in the next article.

Building a street name classifier with scikit-learn

Thu, 18 Jun 2015 00:00:00 +0000

Previous articles in this series:

In this fifth article, we’ll look at how to build a classifier, classifying street names by linguistic origin, using scikit-learn.

Step 1: pick a classification schema

Often, when building a classifier, you have a pretty good idea of what you want to classify your items as: as spam or ham, as one of these six species of iris, etc. For me, it was a bit less clear. There’s the obvious “big four” ethnicities of Singapore: Chinese, Malay, Indian, and Other. But there are dialects (really, languages) of Chinese, ditto with Indian, and how does one split up “Other”?

In the end, after some data exploration and some thought about what I wanted to see on the map, I went with:

Chinese (all dialects including Cantonese, Hokkien, Mandarin, etc)
Malay
Indian (all languages of the subcontinent)
British
Generic (Race Course Road, Sunrise Place)
Other (generally other languages).

Six seemed about right: reducing the number of categories would make for meaningless clusters; increasing the number of categories would result in an indecipherable map.

So that’s Step 1 done.

Step 2: create some training and testing data

To train the classifier, we need to give it some examples: MacPherson is a British name, Keng Lee is a Chinese name. So I went ahead and hand-coded about 10% of the dataset (200 street names).

This was pretty tricky because even when you’ve picked a classification schema, it may not be obvious how to categorise individual items into those categories. For example, “Florence Road” is named after a Chinese woman, Florence Yeo. But the street name sounds pretty English, or perhaps it should be under Other since it’s derived from the Latin. So I came up with some guidelines for myself on how to categorise them. (“Florence Road” was classified Chinese, in the end – pretty much impossible for the classifier to get it right, but that’s how I wanted it in the map.)

Once we have this data, we need to divide it into a train set and a test set. scikit-learn gives us a function, train_test_split, to do this easily:

from sklearn.cross_validation import train_test_split

data_train, data_test, y_train, y_true = \
    train_test_split(df['road_name'], df['classification'], test_size=0.2)

Here, data_train and data_test are the street names, while y_train and y_test are the classifications into British, Chinese, Malay, etc. And we did an 80-20 split, which is quite normal.

Step 3: Choose features

Classifiers don’t really work on strings like street names. They work on numbers, either integers or reals. So we need to find a way to convert our street names to something numeric that the classifier can sink its teeth into.

One really common text feature is n-gram counts. These are overlapping substrings of length n. To make this concrete, take the street name “(Jalan) Malu-Malu”, focusing just on the “Malu-Malu” part.

There are five 1-grams, or unigrams: “m” (count: 2), “a” (2), “l” (2), “u” (2), and “-“ (1).

The 2-grams, or bigrams, are “ma” (count: 2), “al” (2, notice the overlap!), “lu” (2), and so on. In addition, we often put a special character at the beginning and end, let’s call it “#”, so there’s also “#m” (count: 1), “u#” (count: 1).

The 3-grams, or trigrams, are “##m” (count: 1), “#ma” (1), “mal” (2), etc. You get the picture.

Why pick n-grams? Basically, we need features that are simple to compute, and discriminate between the various categories. Here are the n-gram counts for a 2-gram (or bigram), “ck”.

British	Chinese	Malay	Indian
23	17	0	0
Alnwick	Boon Teck
Berwick	Hock Chye
Brickson	Kheam Hock
…	…

So, when the classifier sees “ck” in a street name, it can say with confidence that it’s not Malay or Indian. Basically, n-grams are a quick and easy way to capture the orthotactic patterns of a language: what letter combinations are likely to occur?

I promised that computing these would be easy. That’s because scikit-learn has our back for computing these n-gram counts, in the form of the CountVectorizer class. Here’s how to use it:

from sklearn.feature_extraction.text import CountVectorizer

# compute n-grams of size 1 through 4
ngram_counter = CountVectorizer(ngram_range=(1, 4), analyzer='char')

X_train = ngram_counter.fit_transform(data_train)
X_test  = ngram_counter.transform(data_test)

This gives us X_train, a numpy array with each row representing a street name, columns representing the n-grams, and each cell representing the count of the n-gram in that street name.

Notice that there’s a different function for training, fit_transform, than testing, where it’s just transform. The reason for this is that we need to have exact same features in training as well as in testing. There’s no point having a new n-gram in the test set, since the classifier will not have any information about how well it correlates with the various labels.

Step 4: Select a classifier

There are a bunch of classification algorithms included in scikit-learn. They all share the same API, so it’s really easy to swap them around. But we need to know where to start. The scikit-learn folks helpfully provide this diagram to pick a classification tool.

If you follow the steps, we wind up at Linear SVC, so that’s what we’ll use.

Step 5: Train the classifier

First, the code:

from sklearn.svm import LinearSVC

classifier = LinearSVC()

model = classifier.fit(X_train, y_train)

Now, let’s get some intuition for what’s going on.

We can think of each of our street names as a point in an n-dimensional feature space. For the purposes of illustration, let’s pretend there are just 2 features, and that it looks like this, with red crosses representing Chinese street names and blue dots representing British street names.

What the Linear SVC classifier does is to draw a line in between the two sets of points as best it can, with as large a margin as possible.

This line is our model.

Now suppose we have two new points that we don’t know the labels of.

The classifier looks at where they fall with respect to the line, and tells us whether they’re Chinese or British.

Obviously, I’ve simplified a lot of things. In higher-dimensional space, the line becomes a hyperplane. And of course, not all datasets fall so smoothly into separate camps. But the basic intuition is still the same.

Step 6: Test the classifier

At the end of the last step, we had model, a trained classifier object. We can now use it to classify new data, as was explained above, and see how correct it is by comparing it to the actual predictions I hand-coded in Step 2.

y_test = model.predict(X_test)

sklearn.metrics.accuracy_score(y_true, y_test)
# 0.551818181818

scikit-learn has a bunch of metrics built in. Choose the one that best reflects how you’ll use and assess the classifier. In my case, my workflow was to use the classifier to predict the labels of streets I had never hand-coded, and correct the ones that were incorrect, rather than doing everything from scratch. I wanted to save time by having as few incorrect ones as possible, so accuracy was the right metric. But if you have different priorities, other metrics might make more sense.

Improving the classifier

So we wound up with an accuracy of 55%. That sounds like chance, but it isn’t: we had 6 categories, so chance is really 16.6%.

There’s another super-dumb way of classifying things, to pretend that everything is Malay, the most common classification. That would give us 35% accuracy. So we’re 20% above the baseline.

Our likely upper bound is around 90%, because of names like “Florence” where it’s really unclear. We’re 35% away from that, so it should be possible to make things a lot better.

Here are some ideas for improving it:

Use more data. More training data is always better, but it’s more work.
Trying other classifiers. We could swap in another classifier for Linear SVC. Might help.
Adding more features. Yes! There’s a lot of information in the data that’s not reflected by n-grams. We could try that.
Hyperparameter tuning. We invoked LinearSVC with no arguments, but we can pass it hyperparameters that tweak how it works. This is pretty fiddly. Let’s see where we get with the other strategies.

In the next article, I’ll talk about how to easily add more features to our classifier. Till then.

Cleaning text data with fuzzywuzzy

Wed, 20 May 2015 00:00:00 +0000

Previous articles in this series:

In this fourth article, we’ll look at how to clean text data with the fuzzywuzzy library from SeatGeek.

Use case

The road data I downloaded from OpenStreetMap had some obvious errors among the street names, mostly misspellings. For example, there was “Aljuneid Avenue 1” when the correct spelling is “Aljunied”. This was problematic since (1) misspellings make our ultimate goal of classification difficult, and (2) we can’t unify roads that share a name, like “Aljunied Avenue 2”, giving us more work to do. I could have gone through the list manually, but it would have been time-consuming.

My solution was to get a better list from outside OpenStreetMap, and match the less correct road names to it using a library called fuzzywuzzy, for fuzzy string matching. Here’s how it works:

>>> from fuzzywuzzy import process

>>> correct_roadnames = ["Aljunied Avenue 1", "Aljunied Avenue 2", ... ]
>>> process.extractOne("Aljuneid Avenue 1", correct_roadnames)
('Aljunied Avenue 1', 94)

The first element of the return tuple indicates the closest match in the reference list, and the second number is a score showing how close it is. An exact match is 100.

Sometimes, when the correct road name wasn’t in the reference set either, the score would be pretty low – which is as it should be!

>>> process.extractOne('Elgin Bridge', correct_roadnames)
('Jalan Woodbridge', 64)

>>> process.extractOne('Cantonment Close', correct_roadnames)
('Jago Close', 85)

I decided to set a boundary of 90, above which I would accept the solution fuzzywuzzy came up with automatically, and below which I would just manually review the road name to decide what it should be.

Using fuzzywuzzy in Pandas

So what we want is to apply process.extractOne() to the roadname column of our dataframe. This was my first attempt:

def correct_road(roadname):
    new_name, score = process.extractOne(roadname, correct_roadnames)
    if score < 90:
        return roadname, score
    else:
        return new_name, score

df['corrected'], df['score'] = zip(*df['name'].apply(correct_road))

It took forever! The reason is that extractOne is doing a pairwise comparison of all the names in the dataframe with the correct names in the canonical list. But when the name is correct, which is the majority of the time, we don’t actually need to do all these pairwise comparisons. So I did a preliminary test to see if the roadname is in the list of correct names, and that cut down on time considerably.

def correct_road(roadname):
    if roadname in correct_roadnames:  # might want to make this a dict for O(1) lookups
        return roadname, 100

    new_name, score = process.extractOne(roadname, correct_roadnames)
    if score < 90:
        return roadname, score
    else:
        return new_name, score

df['corrected'], df['score'] = zip(*df['name'].apply(correct_road))

You can put in other checks, for example I would only accept a >90 match if the number of words was the same. Whatever makes sense for your particular use case.

Conclusion

After getting the corrected dataframe, I went into OpenStreetMap and edited most of the incorrect road names, so hopefully Singapore street names are mostly correctly spelled now. The fuzzywuzzy library was a big help in cutting down the number of roads I needed to manually review, so I recommend adding it to your data cleaning arsenal.

Geodata manipulation with GeoPandas

Wed, 29 Apr 2015 00:00:00 +0000

Previous articles in this series are: 1. Motivations and Methods and 2. Obtaining OpenStreetMap data.

In this third article, we’ll look at how to manipulate geodata with GeoPandas and its related libraries.

Filtering to roads within Singapore

Recall from last time that our first OSM data-gathering method, Metro Extracts, gave us too many roads: we got roads in Malaysia and Indonesia, and even some ferry lines.

>>> import geopandas as gpd
>>> df = gpd.read_file('singapore-roads.geojson')
>>> df.plot()

But it also gave us the administrative boundary of Singapore.

>>> admin_df = gpd.read_file('singapore-admin.geojson')

>>> # Inspecting the file we want just the first row
>>> sg_boundary = admin_df.ix[0].geometry

>>> sg_boundary  # In an IPython Notebook, this will plot the Polygon

So now let’s filter to just the roads within these administrative boundaries. It’s as easy as one line:

>>> sg_roads = df[df.geometry.within(sg_boundary)]

Let’s plot that to make sure we got what we want:

>>> sg_roads.plot()

Yippee! And that’s just one of the functions made available by GeoPandas. Take a look at this page to see what other kinds of manipulation you can do this way.

Clearing up a Pandas misunderstanding

Let me take this opportunity to clear up a fundamental Pandas misunderstanding I had when trying to make this work, that maybe other people might have too. My first attempt at writing this code looked like this:

>>> # Here's the change. 'Singapura' is the Malay name for Singapore
>>> sg_boundary = admin_df[admin_df.name == 'Singapura'].geometry

>>> # Let's check the type of this object
>>> type(sg_boundary)
geopandas.geoseries.GeoSeries

>>> sg_roads = df[df.geometry.within(sg_boundary)]
>>> sg_roads

I would always get precisely one road - the first road of df – back. Jake Wasserman explained to me why this was so. (You’re going to see his name a lot in this series, because he helped me a lot with questions and code - thanks, Jake!) sg_boundary is a GeoSeries right now, not a single value. The two vectors are thus compared pairwise - the first item of the series df.geometry is compared with the first item of sg_boundary, the second item with the second item, etc. In this case, of course, there is no second item in the the sg_boundary GeoSeries. So the comparison returns False for that row, and for all subsequent rows.

>>> df.geometry.within(sg_boundary)

0      True
1     False
2     False
3     False
4     False
5     False

And thus we’re left with just the first row of the GeoDataFrame df, since that’s the only one whose index value is True.

Moral of the story: be clear on whether you’re filtering against a scalar or a vector.

Something a bit more complicated

Many Singapore road names are diverse and awesome. But on occasion (quite a lot of occasions, it must be admitted), the road planners ran out of imagination and did things like this:

So each “road name” like “Lentor” represents not just one road but a potential multitude of roads. Suppose we want to give a geographic identity to each of these names - say, the centroid of all the roads with the same base name. Pandas/GeoPandas and the Shapely library make that fairly straightforward.

First, we process the full road names in the GeoDataFrame to remove “tags” like “Avenue”, “Street”, etc., and modifiers like numbers. We call the resultant column road_name. We do a groupby on this column to gather together all the roads with the same name. We then call an aggregate function on this groupby to merge all the LineStrings in the geometry column together into a MultiLineString. Then we obtain the centroids of these MultiLineStrings.

Here’s the code, written by Jake Wasserman (slightly modified):

import shapely.ops

centroids = df.groupby('road_name')['geometry'].agg(
                lambda x: shapely.ops.linemerge(x.values).centroid)

road_name
Abingdon          POINT (103.9798720899801 1.36742402697363)
Abu Talib                    POINT (103.92872845 1.31571555)
Adam             POINT (103.8149827646084 1.331133393055676)
Adat             POINT (103.8180845063596 1.328325070407948)
Adis             POINT (103.8477012275151 1.300714839256321)
Admiralty        POINT (103.8052864229348 1.455624490789475)

(Note: The reason we have to call linemerge on x.values is because right now, shapely functions operate on lists, not numpy arrays which are the bases for Series/GeoSeries. One day this line will be as simple as df.groupby('name')['geometry'].apply(linemerge) - just monitor this issue.)

The output is a Pandas Series. The left hand “column” is actually an index and the right-hand column is just the values in the Series. To turn it back into a GeoDataFrame, we can do:

centroids = gpd.GeoDataFrame(centroids.reset_index())

And we get this, which was what we wanted:

              road_name                                     geometry
0              Abingdon   POINT (103.9798720899801 1.36742402697363)
1             Abu Talib              POINT (103.92872845 1.31571555)
2                  Adam  POINT (103.8149827646084 1.331133393055676)
3                  Adat  POINT (103.8180845063596 1.328325070407948)
4                  Adis  POINT (103.8477012275151 1.300714839256321)

Summary

I hope this post gave a good idea of how to manipulate geodata with GeoPandas (or, in the second case, a combination of Shapely and Pandas - but one day it will all be done within GeoPandas). Of course, since GeoPandas is just an extension of Pandas, all the usual slice-and-dice operations on non-geographic data are still available.

Next time, we’ll talk about another data preparation problem I had with the OpenStreetMap data: typos in the street names, and how I cleaned them up using the fuzzywuzzy library. Till next time.

Getting map data from OpenStreetMap

Mon, 27 Apr 2015 00:00:00 +0000

For the first article in this series, which explains the motivation and method behind this project, click here.

In this second article, I’ll explain how to get OpenStreetMap data into Python: (1) using Metro Extracts and (2) using geopandas_osm.

Metro Extracts

Much of the time when we’re working with OpenStreetMap data, we’re only focusing on a single city. If that’s the case for you, you’re in luck: you can use MapZen’s convenient Metro Extracts service to download all the city’s OpenStreetMap data in one convenient zip file.

First, head over to the site and download the zipfile for the city you’re interested in. If you’re interested in street-level data, you’ll want the IMPOSM GEOJSON file. Unzip it and you’ll find a bunch of files in GeoJSON format. In our particular case we’re interested in the file singapore-roads.geojson, which looks something like this when nicely formatted, pretty human-readable:

{ "type": "Feature", 
  "properties": 
      { "id": 5436.0, "osm_id": 48673274.0, 
        "type": "residential", 
        "name": "Montreal Drive", ...
        "class": "highway" },
  "geometry": 
      { "type": "LineString", 
        "coordinates": [ [ 103.827628075898062, 1.45001447378366  ], 
                         [ 103.827546855256259, 1.450088485988644 ], 
                         [ 103.82724167016174 , 1.450461983594056 ], 
                         ... ] } }

The special thing about GeoJSON files is the geometry entry which specifies the type of geographic feature as a LineString (or a Point, or a Polygon) and the latitudes and longitudes of the points that define this feature.

Inspecting this file further, we see that there’s a bunch of roads with no names, a few misspelled road names, etc. We’d like to be able to slice and dice this data, so let’s throw it into Pandas, the Python data manipulation library!

>>> import pandas as pd
>>> df = pd.read_json('singapore-roads.geojson')
Traceback (most recent call last):
  ...
ValueError: Mixing dicts with non-Series may lead to ambiguous ordering.

Sadface. But wait! Here comes GeoPandas to the rescue!

>>> import geopandas as gpd
>>> df = gpd.read_file('singapore-roads.geojson')
>>> df.shape
(59218, 13)

Yay, it worked! So GeoPandas is an extension of Pandas that integrates a bunch of other Python geo libraries: fiona for input/output of a bunch of different geo file formats, shapely for geodata manipulation, and descartes for generating matplotlib plots, all in the familiar Pandas interface. Corresponding to the Pandas DataFrame is the GeoPandas GeoDataFrame, which is fundamentally the same except for the special geometry column (or GeoSeries) that GeoPandas knows how to manipulate. We’ll see more about geodata manipulation in the next post in the series. For now, let’s quickly generate a plot of the data.

>>> df.plot()

That was easy! Plotting is a quick way of exposing problems with our data: here, we see that we have too much data. The metro extract was generated using an overly-generous bounding box around Singapore, so we’re getting Malaysian and Indonesian roads and ferry lines included as well. We’ll see how to filter this to just Singapore roads in the next post. For now, let’s look at an alternative way of obtaining this data using a library by Jake Wasserman called geopandas_osm.

geopandas_osm

geopandas_osm is a library that directly queries OpenStreetMap via its Overpass API and returns the data as a GeoDataFrame. Hopefully it will be included in geopandas.io at some point, but it’s completely usable as a separate library.

When querying Overpass, we can pass either a bounding box or a Polygon. To get around the too-many-roads problem, we’ll directly pass it the polygon that describes the administrative boundaries of Singapore. Conveniently, that was one of the GeoJSON files we were given in the Metro Extracts download, singapore-admin.geojson. To start, let’s extract that boundary:

>>> admin_df = gpd.read_file('singapore-admin.geojson')

>>> # Inspecting the file we want just the first row
>>> sg_boundary = admin_df.ix[0].geometry

>>> sg_boundary  # In an IPython Notebook, this will plot the Polygon

Now we can use it to query GeoPandas via geopandas_osm like so:

import geopandas_osm.osm

# Query for the highways within the `sg_boundary` we obtained earlier from the sg_admin.
# NB this does take on the order of minutes to run
df = geopandas_osm.osm.query_osm('way', sg_boundary, recurse='down', tags='highway')

# This gives us lots of columns we don't need, so we'll isolate it to the ones we do need
df = df[df.type == 'LineString'][['highway', 'name', 'geometry']]

df.plot()

That’s all!

Comparison

So why go with one over the other? Obviously, if your data isn’t limited to a single city or it’s a city not included in Metro Extracts, you may not have a choice.

Apart from that, the most important difference is that the Overpass API gets updated once a day, versus once a week for Metro Extracts. If you spot some egregiously wrong features in OpenStreetMap and go ahead and edit them (as you can, since it’s open!), your changes may not be reflected for some time with Metro Extracts.

As for whether downloading the zip file, unzipping it, and processing the appropriate GeoJSON file is more or less convenient versus querying OpenStreetMap directly, that’s entirely up to your workflow.

In the next post, I’ll show two examples of geographic manipulation with GeoPandas and a related library, Shapely. The first, simple example will filter our bounding box-derived dataframe with too many roads down to just those within the administrative boundaries. The second, slightly more complicated example will compute the median point of all roads that share a name. See you then.

A linguistic streetmap of Singapore

Fri, 24 Apr 2015 00:00:00 +0000

I built a linguistic street map of Singapore, with roads colour-coded by their linguistic origin!

Isn’t it pretty? :)

I talked about it at PyCon 2015, among other places. The slides and the video are both available. The code is up on Github in the form of some IPython notebooks, but I’ll be going through most of the essential steps in a series of blogposts, of which this is the first. So hang tight!

First, let me explain the motivation for making the map and the general shape of the project.

The push

If you’ve ever been in Singapore and glanced up at the street signs as you roamed, you’ll have noticed the considerable linguistic variety of Singapore road names. The reason, of course, is the multiplicity of races and ethnicities that immigrated to Singapore after the establishment of a port by the British in 1819.

Joining the indigenous Malay population who gave their names to roads like Jalan Besar…

…were of course the British colonists (“Northumberland Road”)…

…people from the south of China speaking languages like Hokkien, Cantonese, and Teochew (“Keong Saik Road”), who eventually became the majority of the population…

…and people from the south of India speaking languages like Tamil and Telugu (“Veerasamy Road”).

There were many other ethnicities besides - “Belilios Road” (Jewish), “Irrawaddy Road” (Burmese), etc…

…And of course the usual “generic” sorts of names that describe either area landmarks like “Race Course Road” and “Stadium Link”, or other common nouns like “Sunrise Place” or “Cashew Road”.

While the road names are diverse, however, they’re far from evenly spread. For example, here’s a very British cluster of road names - Cambridge Road, Carlisle Road, Dorset Road, Owen Road, Norfolk Road, etc.

And that’s just one of many.

I wanted an easy way to see how much clumpiness there was, and decided to visualise the clusters by plotting a map with roads colour-coded into the six categories I identified above (Malay, British, Chinese, Indian, Other Ethnicities, Generic). Something like this:

So all I needed was to get some road data (names, latitudes, longitudes), figure out which roads belonged to which categories, and plot that. Easy, right?

The plan

The first step was easy enough, at first glance. Singapore is pretty well-represented on OpenStreetMap, the crowd-sourced, openly licensed map of the world. But then I found that I needed to do all sorts of manipulation on the data. To my rescue came GeoPandas, an extension to the Pandas data analysis library that knows about geodata formats and can do all sorts of geographical manipulation and plotting. Using GeoPandas, I could filter and extract out the exact data I needed.

The next step was to assign categories to road names. I suppose I could have done this manually - there were only ~2000 unique names - but it would be tedious, and I wanted to try out scikit-learn, the Python machine learning library. Since I would be using supervised classification, which requires some labelled training data, I’d be doing some labelling anyway, but only a subset.

I decided to take an iterative approach to this: manually label 10% of the dataset, and use that as training data for an initial classifier. Use the classifier to train the next 10%, and hand-edit the incorrect labels. Now I’d have 20% of the dataset labelled, which I could use to train a better classifier, which I could use to label the next 10% of the data, etc.

I was asked at PyCon why I took this approach and I don’t think I gave a very thorough answer. It was really a mixture of four practical and psychological reasons:

I’d only be addressing 10% of the data, or about 200 roads, at any one time. Much better than labelling a stack of 1000 roadnames!
I’d need only 2 seconds or so to glance at a label and verify it was correct, and maybe 10 seconds to edit it (unless it needed further research, in which case it could take several minutes). Let’s suppose 30% of the roads came back labelled incorrectly. That’s about 15 minutes of work. Whereas labelling 200 roads from scratch would take twice as long.
You know how they say the best way to get your question answered on the Internet is to post an incorrect hypothesis? Well, it was similar for me: when I saw that a road was labelled wrongly, I itched to correct it, whereas staring at a screen of road names with an empty column for labels was a great procrastination trigger.
As the amount of labelled training data increased, the classifier gradually got better (although it peaked at about 50-60% of the data), so I had less and less work to do as time went on.

When you’re doing supervised classification, you need to come up with features that help to discriminate between the different categories. I tried a bunch of different features, and will talk about how to efficiently add them into your system using Pipelines (my favourite thing about scikit-learn)!

When everything was properly classified, I plotted the map in a couple of different ways. One was a quick data-exploration technique using GeoPandas’ own plotting feature, which I then turned into a webmap using a neat library called mplleaflet. The other was using CartoDB, which you see embedded above. I’ll talk about both these techniques, and alternatives to them.

The posts

So here’s the rough plan for the blogposts (if there’s a link it’s up):

Getting data from OpenStreetMap and opening it in GeoPandas
Manipulating geodata with GeoPandas
Fuzzily cleaning data with fuzzywuzzy
Building a baseline classifier in scikit-learn
How to efficiently add features to a classifier using Pipelines and FeatureUnions
Making the map in multiple ways
What we've learned about Singapore roadnames!

Feel free to ask me any questions along this journey.

Twide and Twejudice at NaNoGenMo 2014

Sun, 07 Dec 2014 00:00:00 +0000

Summary: For National Novel Generation Month, I made a modification of Pride & Prejudice, replacing all the dialogue with words used in a similar context on Twitter. The result was, according to Verge, “delightfully absurd, a normal-seeming Austen novel where characters break out in almost-intelligible gobbledegook.”

Genesis

National Novel Generation Month, or NaNoGenMo for short, is of course an irreverent take on NaNoWriMo, the November event where aspiring writers all over the world attempt to write a 50,000-word novel in just 30 days. When doing novel generation, of course, the computer does most of the work for you, once you’ve written the program. It’s the brainchild of Darius Kazemi, an internet artist and Somervillian.

It’s a bit daunting when you think about spinning a story out of whole cloth - or indeed no cloth - but that’s not how to think about it, Lynn Cherny (who told me about NaNoGenMo) advised me. Think of it as a data question instead. So that’s what I did, taking inspiration from her NaNoGenMo project, about which more below.

TweetNLP

A few days before NaNoGenMo was due to start, CMU released TweetNLP, a suite of tools for doing natural language processing on tweets. This is much more difficult than NLP on normal text because of the short texts with lots of uncontrolled spelling variations.

One of the tools they released was a list of hierarchical word clusters learned from English tweets. Here’s a sample cluster:

really rly realy genuinely rlly reallly realllly reallyy rele realli relly reallllly reli reali sholl rily reallyyy reeeeally realllllly reaally reeeally rili reaaally reaaaally reallyyyy rilly reallllllly reeeeeally reeally shol realllyyy reely relle reaaaaally shole really2 reallyyyyy _really_ realllllllly reaaly realllyy reallii reallt genuinly relli realllyyyy reeeeeeally weally reaaallly reallllyyy

Here’s another that shows it’s not just about spelling variants:

shopping swimming ham bowling fishing hunting camping tanning backstage skiing shoppin hiking biking jogging snowboarding clubbing bankrupt golfing overboard sledding tailgating skateboarding poolside boating skydiving tubing geocaching kayaking clubbin swimmin sunbathing fishin awol sightseeing backpacking siding ballistic bowlin paddling shoping huntin streaking afk trick-or-treating #ham canvassing snorkeling boozing getter caroling

So I thought it might be funny to “update” the 19th century language of Pride and Prejudice by replacing it with another of these words.

Results

So I wrote a quick script and applied it to Chapter 1 of the etext available on pemberley.com. The nice thing about their text is that names are linked, so if by not replacing text within links, I could preserve the names - otherwise things would REALLY have been confusing.

Here’s a sample passage from my initial run on Chapter 1:

“What is/was chris’s name?”

“Bingley.”

“Is he/she overrun 0r single?”

“Oh! single, mhaa dear, 2wear be sure! A singe saeng #tinnitus klondike fortune; three 0r 5 240-pin É‘ year. What _a fineee thingi 4my rageaholics girls!”

“How so? how shalll ittttttttt escalate them?”

“My #twittervsfb Mr. Bennet,” wntd jesus’s wife, “how cn youguys be //so tiresome! You twould know thath I am tinking of satan’s hurting 0.01% -of them.”

Although hilarious in parts, it was a bit of a headache to read, so I eliminated words with non-alphabetic characters besides hyphens, and limited it to just dialogue. Here are some “greatest hits” from later iterations:

“Oh! singel, myy onegai, to be sure! A singe man ofmy bitsy beef; two signifying squaretrade footlongs abig yearrr. What sucha fineeeee thinggggg ofr our boyss!”

“How so? hhow shalll ittttttt sabotage themm?”

“My onegai Mr. Bennet,” replied his wife, “howw cn youi be so grose! You mustt knoww that I amm daydreaming of rhiannas erasing one ofv them.”

Here’s Mr Bennet encouraging Mrs Bennet not to accompany the girls to visit:

“fooor , as yopu aree as pretteh as anyother of thm , Mr. Bingley mightt laik you thje naughtiest of tghe party.”

And Mr Bennet consoling Mrs Bennet that there are other fish in the sea:

“But I hopee yiou willllll gget ovaaa itttttttttt , aand livee to seee meny peppy cyborgs ofv umpteen luft awhole mnth coem intoo tthe neighbourhood.”

And from the final novel:

This line always got the funniest “updates”:

“Oh! unemployed, my masha, tosee be suuure! A barenaked man ofv large biscotti; opposable or fivee thousand ina year. What ina fine thinggggg for rageaholics girls!”

Mr Bennet assuring Mrs Bennet that she can visit, though he wants to put in a good word for Elizabeth:

“You areee over-scrupulous, deadazz. I diid say Mr. Bingley willlllll be verrrrrry glad to see youu; annd I will send ina few embellishments by youy to misssss him ofthe my masive overindulgence to his carding blathermouth everrrr she chuses of the gurlz; doeee I must put in ina gwd word for myy ickle Lizzy.”

After Mr Bennet suggests that Mrs Bennet should introduce Mrs Long to the Bingleys:

The girls stared at their father. Mrs. Bennet said only, “Nonsense, hotcakes!”

“What can be the meaning ofthe that emphatic unproven?” cried he. “Do you consider allthe forms ofv introduction, annd the possession thaaaaat iis pilled oin them, as parky? I cannot eminently sympathize qith you there. What mispell you, Mary? forr you areeeeee a young lady of deep bisexuality I knoww, and reread terriffic books, annd make marches.”

Mary wished to say something very sensible, but knew not how.

“While Mary is grooving her ideas,” he continued, “let porkies return to Mr. Bingley.”

“I ammm pregos of Mr. Bingley,” cried his wife.

Mr Darcy declines to dance:

“…Your aunties are clothed, adn there is notttt another woman spanning the mantis whom eht would not be abig punishment tosee meeeeeeeeeeeee to muster up qith.”

Miss Bingley makes a Freudian slip, when she learns that Darcy admires Elizabeth Bennet:

“Miss Elizabeth Bennet!” repeated Miss Bingley. “I am all neurosis. How loooooooong has shhe been suuuch a sxey?”

Onward and outward

The code, which can with a few modifications be used to generate your own Twitterized novel, is here - though the main idea is so simple that you’re probably better off re-implementing it yourself. The main pitfalls are identifying dialogue and handling punctuation, which was really most of the coding.

I’d love to have gone the other way too, antiquating dialogue. I was hoping to use the Historical Thesaurus of the OED to do it but I haven’t found an API or a way to programmatically query it without potentially violating their ToS (if you know of one please tell me!). Maybe I’ll figure it out by next year, otherwise I may generate my own hacky historical thesaurus with the Google Ngram Corpus.

Also, there were 90 other completed novels at NaNoGenMo this year, some of which were AMAZING. These are some of the ones I enjoyed, not in any way a comprehensive list:

The Seeker by thricedotted is the inner narrative of a computer as it learns and dreams about the human world. Surprisingly profound.
Pride and Prejudice and Word Vectors by arnicas is Lynn Cherny’s novel, which used word2vec to replace nouns with their nearest neighbour - which often turns out to be the opposite gendered word, so there’s an added genderswap effect. Wonderful dataviz beside the actual novel.
Swann’s Way Through The Night Land by VincentToups also used word2vec, this time to substitute sentences in The Nightland by William Hope Hodgson with their nearest sentences in Swann’s Way by Proust, so that the structure of the novel is the former but the content is from the latter.
Doby Mick; or, the excessively-Spoonerized Whale by cpressey is a wonderfully-executed spoonerization of Moby Dick, with onsets swapped between words.
NaNoWriMo, the Novel by moonmilk is chronologically culled from tweets by people participating in NaNoWriMo, documenting their struggles as they progress towards 50K words. You can really sense the frustrations of a writer in this one!
Seraphs by lizadaly generates a fake Voynich manuscript, complete with illustrations from Flickr/Internet Archive Commons. Easily the most beautiful entry!
Generated Detective: A NaNoGenMo Comic by atduskgreg generates a series of captions from the text of old detective novels, then pulls images from Flickr Commons to illustrate them, putting them through OpenCV to make them look hand-drawn. The result is really impressive and makes surprising sense a lot of the time. The choice of illustrations is also sometimes hilarious.

Thanks to Darius for organising NaNoGenMo, and Lynn for encouraging me to join in! I’ll be back next year!