<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Michelle Fullwood</title>
    <description>Explorations in language technology, Python, and other technical diversions</description>
    <link>http://michelleful.github.io/code-blog/code-blog/</link>
    <atom:link href="http://michelleful.github.io/code-blog/code-blog/feed.xml" rel="self" type="application/rss+xml" />
    <pubDate>Mon, 15 Jul 2019 22:28:51 +0000</pubDate>
    <lastBuildDate>Mon, 15 Jul 2019 22:28:51 +0000</lastBuildDate>
    <generator>Jekyll v3.8.5</generator>
    
      <item>
        <title>Natural language processing at PyGotham 2016</title>
        <description>&lt;p&gt;It’s been a week since I attended PyGotham 2016 in New York City. When I saw
the schedule, which was packed with natural language processing talks,
I knew I had to go. Plus, it was at the United Nations.
How cool is it to attend a conference at the UN??!!&lt;/p&gt;

&lt;p&gt;&lt;img alt=&quot;Picture of me at PyGotham&quot; src=&quot;/code-blog/assets/images/201607/pygotham_michelleful.jpg&quot; style=&quot;display: block; margin-left: auto; margin-right: auto; width: 49%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I had a great time attending those talks, which were uniformly excellent.
I also organised a Birds of a Feather (BoF) about NLP and got to meet a lot of
language-minded folk that way. Here’s a recap.&lt;/p&gt;

&lt;h3 id=&quot;teaching-and-doing-digital-humanities-with-jupyter-notebooks&quot;&gt;Teaching and Doing Digital Humanities with Jupyter Notebooks&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/mjlavin80&quot;&gt;Matt Lavin&lt;/a&gt; gave a really interesting talk combining a couple strands of his
work in the digital humanities: educating people about computational DH,
mainly via the medium of Jupyter notebooks, as well as his own research on
dating 19th and early 20th century horror novels to answer questions like:
did H. P. Lovecraft deliberately try to write in an older style than his
contemporaries to make his horror more…horrorful?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;While you may make questionable choices while processing the data and
  running your ML algorithms, it’s okay so long as you document and justify
  your methods so other people (and future you) can see your thought processes.
  Jupyter notebooks, which interleave prose and code and execution results,
  are great for this.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://mybinder.org/&quot;&gt;MyBinder&lt;/a&gt; enables you to make your Jupyter
  notebooks executable online, which is great for workshops,
  as it removes the need to get Jupyter notebook
  up and running on participants’ individual machines.&lt;/p&gt;

&lt;h3 id=&quot;summarizing-documents&quot;&gt;Summarizing documents&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;http://mike.place/talks/pygotham/#p1&quot;&gt;Slides&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Many people I talked to on Saturday evening cited this as their favourite talk
of the day. &lt;a href=&quot;https://twitter.com/mikepqr&quot;&gt;Mike Williams&lt;/a&gt; gave a masterly overview of how to do extractive
summarization, starting with the “dumb” but still effective Luhn method that anyone
can implement with a few lines of code. (If you’ve seen SummaryBot on Reddit,
that’s how it works.) Then we worked up to Latent Dirichlet Allocation and
recurrent neural networks. It was all stupendously clear and everyone felt like
they came out of the talk with their brains embiggened.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When extracting bag-of-words, in future, try substituting
  &lt;a href=&quot;https://github.com/ryankiros/skip-thoughts&quot;&gt;skip-thought vectors&lt;/a&gt;
  instead.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;http://keras.io&quot;&gt;Keras&lt;/a&gt; looks like a really neat way of implementing neural networks
  (higher-level than Theano/TensorFlow - in fact it builds on them).&lt;/p&gt;

&lt;h3 id=&quot;everything-you-always-wanted-to-know-about-nlp-but-were-afraid-to-ask&quot;&gt;Everything you always wanted to know about NLP but were afraid to ask&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://docs.google.com/presentation/d/1rYZEd7-8sZGBzg75OOPvSkIfd1FHq_d4elptiZXzJj8&quot;&gt;Slides&lt;/a&gt; and &lt;a href=&quot;https://github.com/srbutler/pygotham16_NLP/blob/master/pyg16_NLPtalk.ipynb&quot;&gt;notebook&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/staven_boulter&quot;&gt;Steven Butler&lt;/a&gt; and &lt;a href=&quot;https://twitter.com/deathandmaxes&quot;&gt;Max Schwartz&lt;/a&gt; gave a solid introduction to NLP on Friday
morning, covering a lot of ground from morphology through to semantics
in under an hour.
I think that was the first time I’d ever seen Morfessor (a classic approach
to the problem of morphological segmentation) taught in an intro NLP talk!
I really liked their emphasis on how knowledge of linguistics could help
with NLP tasks, especially when it comes to other languages when a pre-built
NLP library might not be available. If you’re looking to get started with NLP,
I highly recommend this talk when the video is out!&lt;/p&gt;

&lt;h3 id=&quot;higher-level-natural-language-processing-with-textacy&quot;&gt;Higher-level natural language processing with Textacy&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/bdewilde/pygotham_2016/blob/master/pygotham_2016.pdf&quot;&gt;Slides&lt;/a&gt;
and &lt;a href=&quot;https://github.com/bdewilde/pygotham_2016/blob/master/pygotham_2016.ipynb&quot;&gt;notebook&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Burton DeWilde, creator of the excellently-named library &lt;a href=&quot;https://github.com/chartbeat-labs/textacy&quot;&gt;textacy&lt;/a&gt;, gave
an overview of his NLP library. This sits atop of the also excellent
&lt;a href=&quot;https://spacy.io/&quot;&gt;spaCy&lt;/a&gt; library
and aims to provide a nice, performant API for higher-level NLP tasks such
as extracting key terms and topic modelling, with many more features planned.&lt;/p&gt;

&lt;p&gt;A nice touch to the library is built-in data visualisations for seeing the
results of an analysis. For example, you can visualise the relationship between
top terms and topics after topic modelling in a termite plot with one line
of code:&lt;/p&gt;

&lt;p&gt;&lt;img alt=&quot;Termite plot from Textacy&quot; src=&quot;/code-blog/assets/images/201607/textacy_chart.png&quot; style=&quot;display: block; margin-left: auto; margin-right: auto; width: 49%;&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Burton also put out a call for contributors to &lt;code class=&quot;highlighter-rouge&quot;&gt;textacy&lt;/code&gt;. From meeting him this weekend,
I can say he’s a really nice guy, and &lt;code class=&quot;highlighter-rouge&quot;&gt;textacy&lt;/code&gt; has the makings of a great library,
so go contribute!&lt;/p&gt;

&lt;h3 id=&quot;others&quot;&gt;Others&lt;/h3&gt;

&lt;p&gt;In addition to the NLP-centric talks, there were loads of data science-themed talks.
Deep learning was a big theme. Slightly less mainstream machine learning
techniques like reinforcement learning and probabilistic graphical models
were also covered, albeit at a simpler level.&lt;/p&gt;

&lt;p&gt;One non-NLP/ML talk I really enjoyed was &lt;a href=&quot;https://twitter.com/subyraman&quot;&gt;Suby Raman’s&lt;/a&gt;
“Making sense of 100 years of NYC opera with Python” (&lt;a href=&quot;https://pygothamsuby.herokuapp.com/#/?_k=cg5h8j&quot;&gt;slides&lt;/a&gt;), which was more dataviz-y
and gave good tips on scraping with &lt;code class=&quot;highlighter-rouge&quot;&gt;asyncio&lt;/code&gt;.
&lt;a href=&quot;http://subyraman.tumblr.com/post/101048131983/10-graphs-to-explain-the-metropolitan-opera&quot;&gt;His initial blogpost&lt;/a&gt; about
his project got a lot of attention in music social media and even made it to
the Washington Post. It was fun to hear about the aftermath of his post.
Something he emphasised that resonated with me was the need for
domain experts to learn to root around data, since they know the really interesting
questions. Once they master the tools to answer these questions,
they’re unstoppable.&lt;/p&gt;

&lt;h3 id=&quot;pythons-of-a-feather-slither-togetherwait-what&quot;&gt;Pythons of a feather slither together…wait what?&lt;/h3&gt;

&lt;p&gt;On the first day of PyGotham, I ran into &lt;a href=&quot;https://twitter.com/weatherpattern&quot;&gt;Ray Cha&lt;/a&gt;, who I had only met once before at Maptime Boston but quickly turned into a friend over the weekend.
I told him I was thinking of doing an NLP BoF and he said he would totally
participate. The risk of waiting alone and awkward in a room mitigated,
I registered a time on the BoF spreadsheet (it was really hard to find a timeslot
without talks an NLP person would be into) and tweeted out an announcement.&lt;/p&gt;

&lt;blockquote class=&quot;twitter-tweet tw-align-center&quot; data-lang=&quot;en&quot;&gt;&lt;p lang=&quot;en&quot; dir=&quot;ltr&quot;&gt;NLP folks at &lt;a href=&quot;https://twitter.com/PyGotham&quot;&gt;@PyGotham&lt;/a&gt;: come stop by the natural language processing BoF 3-4pm Sunday! &lt;a href=&quot;https://t.co/RWpYYxIhne&quot;&gt;https://t.co/RWpYYxIhne&lt;/a&gt; &lt;a href=&quot;https://twitter.com/bjdewilde&quot;&gt;@bjdewilde&lt;/a&gt; &lt;a href=&quot;https://twitter.com/hashtag/pygotham?src=hash&quot;&gt;#pygotham&lt;/a&gt; &lt;a href=&quot;https://twitter.com/hashtag/nlproc?src=hash&quot;&gt;#nlproc&lt;/a&gt;&lt;/p&gt;&amp;mdash; Michelle Fullwood (@michelleful) &lt;a href=&quot;https://twitter.com/michelleful/status/754396297768136704&quot;&gt;July 16, 2016&lt;/a&gt;&lt;/blockquote&gt;
&lt;script async=&quot;&quot; src=&quot;//platform.twitter.com/widgets.js&quot; charset=&quot;utf-8&quot;&gt;&lt;/script&gt;

&lt;p&gt;Just before the actual BoF I mentioned to Ray that I thought 6-10 was about
the right size for a BoF and we had 8 participants, so that was &lt;em&gt;juuuuuust&lt;/em&gt; right.
Many of them were speakers from the talks I mentioned above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Highlights&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://twitter.com/udibr&quot;&gt;Udi&lt;/a&gt; started our discussion off by
  sharing &lt;a href=&quot;https://github.com/udibr/headlines&quot;&gt;his implementation of a headline generator with RNNs in Keras&lt;/a&gt;.
  Not only were the results on Buzzfeed data super cool-looking, it reinforced
  that I should really take a good look at Keras.&lt;/p&gt;

&lt;p&gt;Matt took us through his novel-dating project once more and the
  whole group brainstormed other features to add to his machine learning model,
  sharing their own experiences. For example, Max has been doing some
  really neat authorship attribution stuff with blogs and Twitter and shared
  his findings from that.&lt;/p&gt;

&lt;p&gt;It turns out that Steven worked on non-concatenative morphological
  segmentation with Tagalog infixes, which is similar to my dissertation work on
  segmenting Arabic morphology! Small world!&lt;/p&gt;

&lt;p&gt;Burton and I discussed how to train a quote extractor from prose. &lt;code class=&quot;highlighter-rouge&quot;&gt;Textacy&lt;/code&gt;
  currently includes one but relies on the quotes being more or less correctly
  formatted, but my data is kind of messy. We were talking about
  using his extractor as a baseline and getting people to annotate while reading,
  then training a CRF on the resultant corpus.
  Adam Palay also suggested some resources we could look
  at that might already have annotated corpora.&lt;/p&gt;

&lt;p&gt;There was also general discussion of how to handle multilingual data,
  data ethics, and how to get started as a beginner.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;As someone who’s generally shy about approaching strangers in the hallway,
I’ve found that giving talks is a great way to get people to come talk to me
instead. Of course, if you’re shy about public speaking, that can be just as
bad…so running a BoF was the happy medium for me. I got to meet some great
people and share ideas, which is basically the point of going to a conference.&lt;/p&gt;

&lt;p&gt;So thanks to the BoF participants, to PyGotham organizers, and to Ray for
making my weekend!&lt;/p&gt;
</description>
        <pubDate>Sat, 23 Jul 2016 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2016/07/23/nlp-at-pygotham-2016/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2016/07/23/nlp-at-pygotham-2016/</guid>
        
        <category>natural language processing</category>
        
        <category>python</category>
        
        <category>conferences</category>
        
        
      </item>
    
      <item>
        <title>Parsing Chinese text with Stanford NLP</title>
        <description>&lt;p&gt;I’m doing some natural language processing on (Mandarin) Chinese text right now,
using Stanford’s NLP tools, and I’m documenting the steps here.
I’m just calling the tools from the command line, in a Unix environment, so
if your use case is different from that, this probably won’t help you.&lt;/p&gt;

&lt;p&gt;The tools we’ll be using are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;The &lt;a href=&quot;http://nlp.stanford.edu/software/segmenter.shtml&quot;&gt;Stanford Word Segmenter, version 3.5.2&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;The &lt;a href=&quot;http://nlp.stanford.edu/software/lex-parser.shtml&quot;&gt;Stanford Parser, version 3.5.2&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;step-1-segmenting-chinese-text&quot;&gt;Step 1: Segmenting Chinese text&lt;/h3&gt;

&lt;p&gt;Mandarin Chinese is written without spaces between words, for example:&lt;/p&gt;

&lt;p&gt;世界就是一个疯子的囚笼&lt;br /&gt;
“The world is a den of crazies.”&lt;/p&gt;

&lt;p&gt;That’s a sentence from the &lt;a href=&quot;http://tatoeba.org/eng/&quot;&gt;Tatoeba sentence corpus&lt;/a&gt;,
which is what I’m working on parsing, by the way.&lt;/p&gt;

&lt;p&gt;Unsurprisingly, all natural language processing on Chinese text
starts with word segmentation – we won’t get far by trying to interpret
that whole string as a single element. There are lots
of segmenters out there, including &lt;code class=&quot;highlighter-rouge&quot;&gt;jieba&lt;/code&gt; in Python, which I like, but they
may have different conventions for how they split things up. So if we’re going
to use the output of the segmentation in another Stanford tool downstream, it’s
best to stick to the Stanford Word Segmenter, whose usage is simple enough
with the script provided:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;./segment.sh pku path/to/input.file UTF-8 0 &lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; path/to/segmented.file&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The first argument can be either &lt;code class=&quot;highlighter-rouge&quot;&gt;pku&lt;/code&gt; (for Beijing (Peking) University)
or &lt;code class=&quot;highlighter-rouge&quot;&gt;ctb&lt;/code&gt; (for Chinese Treebank). According to the docs, &lt;code class=&quot;highlighter-rouge&quot;&gt;pku&lt;/code&gt; results
in “smaller vocabulary sizes and OOV rates on test data than CTB models”,
so I went with that.
“0” at the end indicates that we want the single best guess at the segmentation,
without printing its associated probability.&lt;/p&gt;

&lt;p&gt;If you’re curious, the output of the segmenter on the sentence above is:&lt;/p&gt;

&lt;p&gt;世界   就   是   一个
   疯子   的   囚笼&lt;/p&gt;

&lt;p&gt;which is an eminently sensible segmentation.&lt;/p&gt;

&lt;p&gt;The load times on the segmenter are pretty horrible, so it’s worth it to stuff
all your text into a single file and segment everything at one go.&lt;/p&gt;

&lt;h3 id=&quot;step-2-parsing&quot;&gt;Step 2: Parsing&lt;/h3&gt;

&lt;p&gt;The Stanford parser gives two different kinds of outputs, a constituency
parse, which shows the syntactic structure of the sentence:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;(ROOT
  (IP
    (NP (NN 世界))
    (VP
      (ADVP (AD 就))
      (VP (VC 是)
        (NP
          (DNP
            (NP
              (QP (CD 一)
                (CLP (M 个)))
              (NP (NN 疯子)))
            (DEG 的))
          (NP (NN 囚笼)))))))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;And a dependency parse, which shows, broadly speaking, the grammatical relations
the words have to each other:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;nsubj(囚笼-8, 世界-1)
advmod(囚笼-8, 就-2)
cop(囚笼-8, 是-3)
nummod(个-5, 一-4)
clf(疯子-6, 个-5)
assmod(囚笼-8, 疯子-6)
case(疯子-6, 的-7)
root(ROOT-0, 囚笼-8)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;There are specialized dependency parsers out there, but the Stanford parser first
does a constituency parse and converts it to a dependency parse. This
approach &lt;a href=&quot;http://nlp.stanford.edu/pubs/lrecstanforddeps_final_final.pdf&quot;&gt;seems to work better in general&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There are five Chinese parsing models supplied with the software, which
you can see by &lt;code class=&quot;highlighter-rouge&quot;&gt;less&lt;/code&gt;-ing the &lt;code class=&quot;highlighter-rouge&quot;&gt;stanford-parser-3.5.2-models.jar&lt;/code&gt; file.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz&lt;/li&gt;
  &lt;li&gt;edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz&lt;/li&gt;
  &lt;li&gt;edu/stanford/nlp/models/lexparser/xinhuaFactoredSegmenting.ser.gz&lt;/li&gt;
  &lt;li&gt;edu/stanford/nlp/models/lexparser/xinhuaFactored.ser.gz&lt;/li&gt;
  &lt;li&gt;edu/stanford/nlp/models/lexparser/xinhuaPCFG.ser.gz&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href=&quot;http://nlp.stanford.edu/software/parser-faq.shtml#o&quot;&gt;The FAQ&lt;/a&gt;
says that the PCFG grammars are the fastest, but the factored grammars are the
most performant. So choosing either &lt;code class=&quot;highlighter-rouge&quot;&gt;xinhuaFactored&lt;/code&gt; or &lt;code class=&quot;highlighter-rouge&quot;&gt;chineseFactored&lt;/code&gt;
is the way to go. The &lt;code class=&quot;highlighter-rouge&quot;&gt;xinhua&lt;/code&gt; models are trained on newswire data, while
the &lt;code class=&quot;highlighter-rouge&quot;&gt;chinese&lt;/code&gt; models include more varied types of text including some from
other regions, so select the model that best fits your data.&lt;/p&gt;

&lt;p&gt;In addition, there is a &lt;code class=&quot;highlighter-rouge&quot;&gt;xinhuaFactoredSegmenting&lt;/code&gt; model. This works on
unsegmented text, allowing us to bypass the segmentation procedure in Step 1.
However, this isn’t recommended as it doesn’t perform as well as the standalone
Segmenter.&lt;/p&gt;

&lt;p&gt;Now that we’ve chosen our model, it’s time to actually do the parsing.
There is a &lt;code class=&quot;highlighter-rouge&quot;&gt;lexparser-lang.sh&lt;/code&gt; helper script, but it assumes you’re using
GB18030 encoding for your Chinese text. It’s simple to edit the script
to include an &lt;code class=&quot;highlighter-rouge&quot;&gt;-encoding utf-8&lt;/code&gt; flag, but it’s not that much more difficult
to just construct the Java call yourself.&lt;/p&gt;

&lt;p&gt;Here’s how to get the constituency parse:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file &amp;gt; path/to/constituency.parsed.file
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;To get the dependency parse, just add an &lt;code class=&quot;highlighter-rouge&quot;&gt;outputFormat&lt;/code&gt; flag, and specify
&lt;code class=&quot;highlighter-rouge&quot;&gt;typedDependencies&lt;/code&gt;:&lt;/p&gt;

&lt;div class=&quot;highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;java
-mx500m
-cp stanford-parser.jar:stanford-parser-3.5.2-models.jar edu.stanford.nlp.parser.lexparser.LexicalizedParser
-encoding utf-8
-outputFormat typedDependencies
edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz
path/to/segmented.file &amp;gt; path/to/dependency.parsed.file
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Incidentally, the parse that was chosen for this sentence is &lt;em&gt;not&lt;/em&gt;
the intended reading – it’s interpreting the sentence as
“The world is the den of a single (unspecified) crazy person”.
Which seems scarily close to truth.&lt;/p&gt;

&lt;p&gt;You might want to consider the possibility of multiple parses, therefore.
To get multiple parses, we need to use one of the PCFG parsers
(not the factored parsers), and
add the flag &lt;code class=&quot;highlighter-rouge&quot;&gt;-printPCFGkBest n&lt;/code&gt;, where &lt;code class=&quot;highlighter-rouge&quot;&gt;n&lt;/code&gt; is 2 or more.&lt;/p&gt;

&lt;h3 id=&quot;troubleshooting&quot;&gt;Troubleshooting&lt;/h3&gt;

&lt;p&gt;The two errors I got while trying to do the parsing step had to do with
getting the appropriate Java version running, and supplying the correct
classPath.&lt;/p&gt;

&lt;p&gt;Version 3.5.2 requires Java 8. If you don’t have it, it will turn up the
error &lt;code class=&quot;highlighter-rouge&quot;&gt;Unsupported major.minor version 52.0&lt;/code&gt;. If you get this error,
make sure that (a) you have Java 8 installed, and that
(b) &lt;code class=&quot;highlighter-rouge&quot;&gt;java&lt;/code&gt; invokes Java 8. To do the latter, do&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;update-alternatives &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; java&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;and select Java 8.&lt;/p&gt;

&lt;p&gt;The second error you may come across if you follow the commands supplied in
the docs is &lt;code class=&quot;highlighter-rouge&quot;&gt;Unable to resolve
&quot;edu/stanford/nlp/models/lexparser/chineseFactored.ser.gz&quot;
as either class path, filename or URL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you get this, check the classPath (&lt;code class=&quot;highlighter-rouge&quot;&gt;-cp&lt;/code&gt;) argument you’re passing to Java.
It should have two parts: the parser &lt;code class=&quot;highlighter-rouge&quot;&gt;.jar&lt;/code&gt;, and the models &lt;code class=&quot;highlighter-rouge&quot;&gt;.jar&lt;/code&gt;, separated
by a colon (a semi-colon in some other OSes).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;&lt;span class=&quot;nt&quot;&gt;-cp&lt;/span&gt; stanford-parser.jar:stanford-parser-3.5.2-models.jar&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;I’m really grateful that Stanford makes all this great software available,
and particularly for non-English languages. I hope this guide saves someone
some time in getting the Chinese parser working. If all goes well, I’ll be
sharing what I’ve been using it for soon.&lt;/p&gt;
</description>
        <pubDate>Thu, 10 Sep 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/09/10/parsing-chinese-with-stanford/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/09/10/parsing-chinese-with-stanford/</guid>
        
        <category>natural language processing</category>
        
        <category>java</category>
        
        <category>stanford-nlp</category>
        
        <category>parsing</category>
        
        <category>chinese</category>
        
        <category>mandarin</category>
        
        
      </item>
    
      <item>
        <title>Making maps in Python</title>
        <description>&lt;p&gt;Previous articles in this series:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;1. Motivations and Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;2. Obtaining OpenStreetMap data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;3. Manipulating geodata with GeoPandas&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/&quot;&gt;4. Cleaning text data with fuzzywuzzy&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/06/18/classifying-roads/&quot;&gt;5. Building a street name classifier with scikit-learn&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/06/20/pipelines/&quot;&gt;6. Adding features with Pipelines and Feature Unions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;a-web-map-in-two-lines-of-python&quot;&gt;A web map in two lines of Python&lt;/h3&gt;

&lt;p&gt;Here’s how to make a map from a GeoPandas GeoDataFrame in one step:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;column&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'classification'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;colormap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'accent'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/map_accent.png&quot; alt=&quot;Basic map&quot; /&gt;&lt;/p&gt;

&lt;p&gt;where &lt;code class=&quot;highlighter-rouge&quot;&gt;classification&lt;/code&gt; was the name of the column with our new
Malay/Chinese/British/Indian/Generic/Other labels on each road (row).&lt;/p&gt;

&lt;p&gt;What if we want to make this nice and interactive, like a &lt;a href=&quot;http://leafletjs.com/&quot;&gt;Leaflet&lt;/a&gt; map?
So we can pan and zoom and actually see street names?
There’s a library called &lt;a href=&quot;https://github.com/jwass/mplleaflet&quot;&gt;mplleaflet&lt;/a&gt;,
by Jake Wasserman, that can do this for you:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;mplleaflet&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mplleaflet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;display&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'cartodb_positron'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;iframe width=&quot;800&quot; height=&quot;400&quot; src=&quot;/code-blog/assets/images/201506/sgmap2.html&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;(If you don’t see colours on that map, just reload the page.)&lt;/p&gt;

&lt;p&gt;To export it to an HTML page, you can do this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;mplleaflet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'cartodb_positron'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'sgmap.html'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;We don’t have much control over colours here, but it would be nice to theme them,
associating Chinese with its traditional red, Malay with its traditional green,
etc. Here’s a hacky way to do it:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'classification'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;labels&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sort&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# [u'British', u'Chinese', u'Generic', u'Indian', u'Malay', u'Other']&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# this is the order in which colours from a colourmap will be applied&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# British -&amp;gt; blue, Chinese -&amp;gt; red, etc...&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;my_colors&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'blue'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'red'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'gray'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'yellow'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'green'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'purple'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# create a colour map with these colours&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;matplotlib.colors&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSegmentedColormap&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;cmap&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSegmentedColormap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'my cmap'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;my_colors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# do the plot&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ax2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;column&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'classification'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;colormap&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cmap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mplleaflet&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;show&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fig&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ax2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;figure&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;crs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tiles&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'cartodb_positron'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'sgmap2.html'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;alternatives&quot;&gt;Alternatives&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;mplleaflet&lt;/code&gt; is awesome for exploratory data analysis, but you might want to have more control over how your map looks. For this, I recommend using one of the following:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;QGIS (C++ but has Python bindings)&lt;/li&gt;
  &lt;li&gt;Mapnik (C++ but has Python bindings)&lt;/li&gt;
  &lt;li&gt;Tilemill (GUI built on top of Mapnik)&lt;/li&gt;
  &lt;li&gt;Folium (maybe, haven’t investigated fully)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A nice feature of Tilemill is that it allows you to define your map styling using CartoCSS. For example, here’s how we would define the colours:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-css&quot; data-lang=&quot;css&quot;&gt;&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'Malay'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;green&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'British'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;blue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'Chinese'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;red&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'Indian'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;yellow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'Other'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;purple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;classification&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;'Generic'&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;py&quot;&gt;line-color&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;no&quot;&gt;gray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You can also control the line width at various zoom levels:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-css&quot; data-lang=&quot;css&quot;&gt;&lt;span class=&quot;nt&quot;&gt;line-opacity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;.7&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;18&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;17&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;15&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1.5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;zoom&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nt&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;py&quot;&gt;line-width&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;m&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;If these are too fiddly, some web mapping solutions also use CartoCSS.
I really like &lt;a href=&quot;https://cartodb.com&quot;&gt;CartoDB&lt;/a&gt;, which is how I made my main map:&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;520&quot; frameborder=&quot;0&quot; src=&quot;http://michelleful.cartodb.com/viz/b722485c-dbf6-11e4-9a7e-0e0c41326911/embed_map&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;We can browse this map to look at clusters of street names, which are now conveniently colour-coded for our analysis!&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;It’s remarkably easy to make maps with GeoPandas and ancillary libraries like &lt;code class=&quot;highlighter-rouge&quot;&gt;mplleaflet&lt;/code&gt;, thanks to the developers of these libraries :)&lt;/p&gt;

&lt;p&gt;That’s all the technical stuff in this series. Next time, I’ll round everything off
and talk about what I learned about Singapore street names from doing this project.&lt;/p&gt;
</description>
        <pubDate>Wed, 15 Jul 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/07/15/making-maps/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/07/15/making-maps/</guid>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>mapping</category>
        
        <category>geopandas</category>
        
        <category>matplotlib</category>
        
        <category>mplleaflet</category>
        
        <category>cartodb</category>
        
        
      </item>
    
      <item>
        <title>Using Pipelines and FeatureUnions in scikit-learn</title>
        <description>&lt;p&gt;Previous articles in this series:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;1. Motivations and Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;2. Obtaining OpenStreetMap data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;3. Manipulating geodata with GeoPandas&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/&quot;&gt;4. Cleaning text data with fuzzywuzzy&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/06/18/classifying-roads/&quot;&gt;5. Building a street name classifier with scikit-learn&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the &lt;a href=&quot;/code-blog/2015/06/18/classifying-roads/&quot;&gt;last article&lt;/a&gt;, we built a baseline classifier for street names. The results were a bit disappointing at 55% accuracy. In this article, we’ll add more features, and streamline the code with &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt;’s &lt;code class=&quot;highlighter-rouge&quot;&gt;Pipeline&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;FeatureUnion&lt;/code&gt; classes.&lt;/p&gt;

&lt;p&gt;I learned a lot about Pipelines and FeatureUnions from &lt;a href=&quot;http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html&quot;&gt;Zac Stewart’s article on the subject&lt;/a&gt;, which I recommend.&lt;/p&gt;

&lt;h3 id=&quot;adding-features&quot;&gt;Adding features&lt;/h3&gt;

&lt;p&gt;There’s a great paper called &lt;a href=&quot;http://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf&quot;&gt;&lt;em&gt;A few useful things to know about machine learning&lt;/em&gt;&lt;/a&gt;
by Pedros Domingos, one of the most prominent researchers in the field, in which he says:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used…This is typically where most of the effort in a machine learning project goes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So far I’d only used n-grams. But there were other sources of information I wasn’t using. Some ideas I had for more features were:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Number of words in name
    &lt;ul&gt;
      &lt;li&gt;More words: likely to be Chinese (e.g. “Ang Mo Kio Avenue 1”)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Average word length
    &lt;ul&gt;
      &lt;li&gt;Shorter: likely to be Chinese (e.g. “Ang Mo Kio”)&lt;/li&gt;
      &lt;li&gt;Longer: likely to be British or Indian (e.g. “Kadayanallur Street”)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Are all the words in the dictionary?
    &lt;ul&gt;
      &lt;li&gt;Yes: likely to be Generic (e.g. “Cashew Road”). Funny exception: Boon Lay Way (Chinese)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Is the “road tag” Malay?
    &lt;ul&gt;
      &lt;li&gt;Yes: likely Malay (e.g. “Jalan Bukit Merah”, “Lorong Penchalak”, vs “Upper Thomson Road”, “Ang Mo Kio Avenue 1”)&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;How to incorporate these into the previous code? Let’s look at the code we needed to create the n-gram feature matrix:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.feature_extraction.text&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.svm&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# build the feature matrices&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ngram_range&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;analyzer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'char'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;X_train&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;X_test&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# train the classifier&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# test the classifier&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;To add the new features, what we’re looking at is:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Writing functions that produce a feature vector for each feature&lt;/li&gt;
  &lt;li&gt;Repeating the &lt;code class=&quot;highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;fit&lt;/code&gt; lines for each feature&lt;/li&gt;
  &lt;li&gt;Adding two lines of code where we combine the resultant &lt;code class=&quot;highlighter-rouge&quot;&gt;numpy&lt;/code&gt; matrices into a one giant training feature matrix and one testing feature matrix&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This may not seem like a huge deal, but it is pretty repetitive, opening ourselves up to the possibility of errors, for example calling &lt;code class=&quot;highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt; on the testing data rather than just &lt;code class=&quot;highlighter-rouge&quot;&gt;transform&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Fortunately, &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; gives us a better way: Pipelines.&lt;/p&gt;

&lt;h3 id=&quot;pipelines&quot;&gt;Pipelines&lt;/h3&gt;

&lt;p&gt;Another way to think about the code above is to imagine a pipeline that takes in our input data, puts it through a first transformer – the n-gram counter – then through another transformer – the SVC classifier – to produce a trained model, which we can then use for prediction.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/simple_pipeline.png&quot; alt=&quot;Simple machine learning pipeline&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This is precisely what the &lt;code class=&quot;highlighter-rouge&quot;&gt;Pipeline&lt;/code&gt; class in &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; does:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.feature_extraction.text&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.svm&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# build the pipeline&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ppl&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
              &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ngram'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ngram_range&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;analyzer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'char'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)),&lt;/span&gt;
              &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'clf'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;   &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
      &lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# train the classifier&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ppl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# test the classifier&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Notice that this time, we’re operating on &lt;code class=&quot;highlighter-rouge&quot;&gt;data_train&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;data_test&lt;/code&gt;,
i.e. just the lists of road names. We didn’t have to manually create a separate
feature matrix for training and testing – the pipeline takes care of that.&lt;/p&gt;

&lt;h3 id=&quot;creating-a-new-transformer&quot;&gt;Creating a new transformer&lt;/h3&gt;

&lt;p&gt;Now we want to add a new feature – average word length. There’s no built-in
feature extractor like &lt;code class=&quot;highlighter-rouge&quot;&gt;CountVectorizer&lt;/code&gt; for this, so we’ll have to write our
own transformer. Here’s the code to do that. This time, instead of a list of
names, we’re going to start passing in a &lt;code class=&quot;highlighter-rouge&quot;&gt;Pandas&lt;/code&gt; dataframe, which has a column
for the street name and another column for the “road tag”
(Street, Avenue, Jalan, etc).&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.base&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BaseEstimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TransformerMixin&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;AverageWordLengthExtractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BaseEstimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TransformerMixin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Takes in dataframe, extracts road name column, outputs average word length&quot;&quot;&quot;&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;pass&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;average_word_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Helper code to compute average word length of a name&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mean&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;word&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;word&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()])&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;The workhorse of this feature extractor&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'road_name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;average_word_length&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;s&quot;&gt;&quot;&quot;&quot;Returns `self` unless something different happens in train and test&quot;&quot;&quot;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Unless you’re doing something more complicated where something different happens
in the training and testing phase (like when extracting n-grams),
this is the general pattern for a transformer:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.base&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BaseEstimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TransformerMixin&lt;/span&gt;

&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SampleExtractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BaseEstimator&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TransformerMixin&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;__init__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;vars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;vars&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;vars&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# e.g. pass in a column name to extract&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;do_something_to&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;vars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# where the actual feature extraction happens&lt;/span&gt;

    &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;self&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# generally does nothing&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now that we’ve created our transformer, it’s time to add it into the pipeline.&lt;/p&gt;

&lt;h3 id=&quot;featureunions&quot;&gt;FeatureUnions&lt;/h3&gt;

&lt;p&gt;We have a slight problem: we only know how to add transformers in series, but
what we need to do is to add our average word length transformer in parallel
with the n-gram extractor. Like this:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/more_complex_pipeline.png&quot; alt=&quot;Parallel machine learning pipeline&quot; /&gt;&lt;/p&gt;

&lt;p&gt;For this, there is &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt;’s &lt;code class=&quot;highlighter-rouge&quot;&gt;FeatureUnion&lt;/code&gt; class.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.pipeline&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FeatureUnion&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'feats'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;FeatureUnion&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ngram'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ngram_count_pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# can pass in either a pipeline&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'ave'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AverageWordLengthExtractor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# or a transformer&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;])),&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'clf'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# classifier&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Notice that the first item in the &lt;code class=&quot;highlighter-rouge&quot;&gt;FeatureUnion&lt;/code&gt; is &lt;code class=&quot;highlighter-rouge&quot;&gt;ngram_count_pipeline&lt;/code&gt;.
This is just a &lt;code class=&quot;highlighter-rouge&quot;&gt;Pipeline&lt;/code&gt; created out of a column-extracting transformer,
and &lt;code class=&quot;highlighter-rouge&quot;&gt;CountVectorizer&lt;/code&gt; (the column extractor is necessary
now that we’re operating on a &lt;code class=&quot;highlighter-rouge&quot;&gt;Pandas&lt;/code&gt; dataframe
rather than directly sending the list of road names through the pipeline).&lt;/p&gt;

&lt;p&gt;That’s perfectly okay: a pipeline is itself just a giant transformer, and
is treated as such. That makes it easy to write complex pipelines by
building smaller pieces and then putting them together in the end.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;So what happened after adding in all these new features? Accuracy went up
to 65%, so that was a decent result. Note that using Pipelines and FeatureUnions
did not in itself contribute to the performance. They’re just another way of
organising your code for readability, reusability and easier experimentation.&lt;/p&gt;

&lt;p&gt;If you’re looking to do hyperparameter tuning (which I won’t explain here),
pipelines make that easy, as below:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.grid_search&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GridSearchCV&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;pg&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'clf__C'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;grid&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GridSearchCV&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pipeline&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param_grid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;grid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;best_params_&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# {'clf__C': 0.1}&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;grid&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;best_score_&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# 0.702290076336&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Ultimately, after adding in more features, adding more data, and doing
hyperparameter tuning, I had about 75-80% accuracy, which was good enough for me.
I only had to hand-correct 20-25% of the roads, which didn’t seem too daunting.
I was ready to make my map. That’s what we’ll do in &lt;a href=&quot;/code-blog/2015/07/15/making-maps/&quot;&gt;the next article&lt;/a&gt;.&lt;/p&gt;
</description>
        <pubDate>Sat, 20 Jun 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/06/20/pipelines/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/06/20/pipelines/</guid>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>scikit-learn</category>
        
        <category>machine-learning</category>
        
        
      </item>
    
      <item>
        <title>Building a street name classifier with scikit-learn</title>
        <description>&lt;p&gt;Previous articles in this series:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;1. Motivations and Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;2. Obtaining OpenStreetMap data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;3. Manipulating geodata with GeoPandas&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/&quot;&gt;4. Cleaning text data with fuzzywuzzy&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this fifth article, we’ll look at how to build a classifier, classifying street names by linguistic origin, using &lt;a href=&quot;http://scikit-learn.org/stable/&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;step-1-pick-a-classification-schema&quot;&gt;Step 1: pick a classification schema&lt;/h3&gt;

&lt;p&gt;Often, when building a classifier, you have a pretty good idea of what you want to classify your items as: as spam or ham, as one of these six species of iris, etc. For me, it was a bit less clear. There’s the obvious “big four” ethnicities of Singapore: Chinese, Malay, Indian, and Other. But there are dialects (really, languages) of Chinese, ditto with Indian, and how does one split up “Other”?&lt;/p&gt;

&lt;p&gt;In the end, after some data exploration and some thought about what I wanted to see on the map, I went with:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Chinese (all dialects including Cantonese, Hokkien, Mandarin, etc)&lt;/li&gt;
  &lt;li&gt;Malay&lt;/li&gt;
  &lt;li&gt;Indian (all languages of the subcontinent)&lt;/li&gt;
  &lt;li&gt;British&lt;/li&gt;
  &lt;li&gt;Generic (Race Course Road, Sunrise Place)&lt;/li&gt;
  &lt;li&gt;Other (generally other languages).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Six seemed about right: reducing the number of categories would make for meaningless clusters; increasing the number of categories would result in an indecipherable map.&lt;/p&gt;

&lt;p&gt;So that’s Step 1 done.&lt;/p&gt;

&lt;h3 id=&quot;step-2-create-some-training-and-testing-data&quot;&gt;Step 2: create some training and testing data&lt;/h3&gt;

&lt;p&gt;To train the classifier, we need to give it some examples: MacPherson is a British name, Keng Lee is a Chinese name. So I went ahead and hand-coded about 10% of the dataset (200 street names).&lt;/p&gt;

&lt;p&gt;This was pretty tricky because even when you’ve picked a classification schema, it may not be obvious how to categorise individual items into those categories. For example, “Florence Road” is named after a Chinese woman, Florence Yeo. But the street name sounds pretty English, or perhaps it should be under Other since it’s derived from the Latin. So I came up with some guidelines for myself on how to categorise them. (“Florence Road” was classified Chinese, in the end – pretty much impossible for the classifier to get it right, but that’s how I wanted it in the map.)&lt;/p&gt;

&lt;p&gt;Once we have this data, we need to divide it into a train set and a test set. &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; gives us a function, &lt;code class=&quot;highlighter-rouge&quot;&gt;train_test_split&lt;/code&gt;, to do this easily:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.cross_validation&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;data_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_true&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; \
    &lt;span class=&quot;n&quot;&gt;train_test_split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'road_name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'classification'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Here, &lt;code class=&quot;highlighter-rouge&quot;&gt;data_train&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;data_test&lt;/code&gt; are the street names, while &lt;code class=&quot;highlighter-rouge&quot;&gt;y_train&lt;/code&gt; and &lt;code class=&quot;highlighter-rouge&quot;&gt;y_test&lt;/code&gt; are the classifications into British, Chinese, Malay, etc. And we did an 80-20 split, which is quite normal.&lt;/p&gt;

&lt;h3 id=&quot;step-3-choose-features&quot;&gt;Step 3: Choose features&lt;/h3&gt;

&lt;p&gt;Classifiers don’t really work on strings like street names. They work on numbers, either integers or reals. So we need to find a way to convert our street names to something numeric that the classifier can sink its teeth into.&lt;/p&gt;

&lt;p&gt;One really common text feature is n-gram counts. These are overlapping substrings of length n. To make this concrete, take the street name “(Jalan) Malu-Malu”, focusing just on the “Malu-Malu” part.&lt;/p&gt;

&lt;p&gt;There are five 1-grams, or unigrams: “m” (count: 2), “a” (2), “l” (2), “u” (2), and “-“ (1).&lt;/p&gt;

&lt;p&gt;The 2-grams, or bigrams, are “ma” (count: 2), “al” (2, notice the overlap!), “lu” (2), and so on. In addition, we often put a special character at the beginning and end, let’s call it “#”, so there’s also “#m” (count: 1), “u#” (count: 1).&lt;/p&gt;

&lt;p&gt;The 3-grams, or trigrams, are “##m” (count: 1), “#ma” (1), “mal” (2), etc. You get the picture.&lt;/p&gt;

&lt;p&gt;Why pick n-grams? Basically, we need features that are simple to compute, and discriminate between the various categories. Here are the n-gram counts for a 2-gram (or bigram), “ck”.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;British&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Chinese&lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;   Malay   &lt;/th&gt;
      &lt;th style=&quot;text-align: center&quot;&gt;Indian&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;23&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;17&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;0&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Alnwick&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Boon Teck&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Berwick&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Hock Chye&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;  Brickson  &lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;Kheam Hock&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;…&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt;…&lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
      &lt;td style=&quot;text-align: center&quot;&gt; &lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;So, when the classifier sees “ck” in a street name, it can say with confidence that it’s not Malay or Indian. Basically, n-grams are a quick and easy way to capture the orthotactic patterns of a language: what letter combinations are likely to occur?&lt;/p&gt;

&lt;p&gt;I promised that computing these would be easy. That’s because &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; has our back for computing these n-gram counts, in the form of the &lt;code class=&quot;highlighter-rouge&quot;&gt;CountVectorizer&lt;/code&gt; class. Here’s how to use it:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.feature_extraction.text&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# compute n-grams of size 1 through 4&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CountVectorizer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ngram_range&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;analyzer&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'char'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;X_train&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;X_test&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ngram_counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;This gives us &lt;code class=&quot;highlighter-rouge&quot;&gt;X_train&lt;/code&gt;, a numpy array with each row representing a street name,
columns representing the n-grams, and each cell representing the count of the
n-gram in that street name.&lt;/p&gt;

&lt;p&gt;Notice that there’s a different function for training, &lt;code class=&quot;highlighter-rouge&quot;&gt;fit_transform&lt;/code&gt;,
than testing, where it’s just &lt;code class=&quot;highlighter-rouge&quot;&gt;transform&lt;/code&gt;. The reason for this is that
we need to have exact same features in training as well as in testing.
There’s no point having a new n-gram in the test set, since
the classifier will not have any information about how well it correlates
with the various labels.&lt;/p&gt;

&lt;h3 id=&quot;step-4-select-a-classifier&quot;&gt;Step 4: Select a classifier&lt;/h3&gt;

&lt;p&gt;There are a bunch of classification algorithms included in &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt;.
They all share the same API, so it’s really easy to swap them around. But
we need to know where to start. The &lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; folks helpfully provide
this diagram to pick a classification tool.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;http://scikit-learn.org/stable/_static/ml_map.png&quot; alt=&quot;Scikit-learn classifier choice diagram&quot; /&gt;&lt;/p&gt;

&lt;p&gt;If you follow the steps, we wind up at Linear SVC, so that’s what we’ll use.&lt;/p&gt;

&lt;h3 id=&quot;step-5-train-the-classifier&quot;&gt;Step 5: Train the classifier&lt;/h3&gt;

&lt;p&gt;First, the code:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;sklearn.svm&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;LinearSVC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;classifier&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Now, let’s get some intuition for what’s going on.&lt;/p&gt;

&lt;p&gt;We can think of each of our street names as a point in an n-dimensional feature space.
For the purposes of illustration, let’s pretend there are just 2 features, and that it
looks like this, with red crosses representing Chinese street names and blue dots
representing British street names.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/svm1_new.png&quot; alt=&quot;Plotting fake street names in 2-dimensional space.&quot; /&gt;&lt;/p&gt;

&lt;p&gt;What the Linear SVC classifier does is to draw a line in between the two sets of points
as best it can, with as large a margin as possible.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/svm2_new.png&quot; alt=&quot;How linear SVC works: draw a line between the two sets of points with as large a margin as possible&quot; /&gt;&lt;/p&gt;

&lt;p&gt;This line is our model.&lt;/p&gt;

&lt;p&gt;Now suppose we have two new points that we don’t know the labels of.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/svm3_new.png&quot; alt=&quot;Introducing two new unknown points&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The classifier looks at where they fall with respect to the line, and tells us whether they’re Chinese or British.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/code-blog/assets/images/201506/svm4_new.png&quot; alt=&quot;Classify the new points based on where they fall with respect to the line&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Obviously, I’ve simplified a lot of things.
In higher-dimensional space, the line becomes a hyperplane.
And of course, not all datasets fall so smoothly into separate camps.
But the basic intuition is still the same.&lt;/p&gt;

&lt;h3 id=&quot;step-6-test-the-classifier&quot;&gt;Step 6: Test the classifier&lt;/h3&gt;

&lt;p&gt;At the end of the last step, we had &lt;code class=&quot;highlighter-rouge&quot;&gt;model&lt;/code&gt;, a trained classifier object.
We can now use it to classify new data, as was explained above,
and see how correct it is by
comparing it to the actual predictions I hand-coded in Step 2.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;predict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;sklearn&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;metrics&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;accuracy_score&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y_true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_test&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# 0.551818181818&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;scikit-learn&lt;/code&gt; has &lt;a href=&quot;http://scikit-learn.org/stable/modules/classes.html&quot;&gt;a bunch of metrics built in&lt;/a&gt;. Choose the one that best
reflects how you’ll use and assess the classifier. In my case, my workflow
was to use the classifier to predict the labels of streets I had never
hand-coded, and correct the ones that were incorrect, rather than doing
everything from scratch. I wanted to save time by having as few incorrect ones
as possible, so accuracy was the right metric. But if you have different priorities,
other metrics might make more sense.&lt;/p&gt;

&lt;h3 id=&quot;improving-the-classifier&quot;&gt;Improving the classifier&lt;/h3&gt;

&lt;p&gt;So we wound up with an accuracy of 55%. That sounds like chance, but it isn’t:
we had 6 categories, so chance is really 16.6%.&lt;/p&gt;

&lt;p&gt;There’s another super-dumb way
of classifying things, to pretend that everything is Malay, the most common
classification. That would give us 35% accuracy. So we’re 20% above the baseline.&lt;/p&gt;

&lt;p&gt;Our likely upper bound is around 90%, because of names like “Florence” where it’s
really unclear. We’re 35% away from that, so it should be possible to make things
a lot better.&lt;/p&gt;

&lt;p&gt;Here are some ideas for improving it:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Use more data.&lt;/strong&gt; More training data is always better, but it’s more work.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Trying other classifiers.&lt;/strong&gt; We could swap in another classifier for Linear SVC. Might help.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Adding more features.&lt;/strong&gt; Yes! There’s a lot of information in the data that’s not reflected by n-grams. We could try that.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hyperparameter tuning.&lt;/strong&gt; We invoked &lt;code class=&quot;highlighter-rouge&quot;&gt;LinearSVC&lt;/code&gt; with no arguments, but we can pass it hyperparameters that tweak how it works. This is pretty fiddly. Let’s see where we get with the other strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;a href=&quot;/code-blog/2015/06/20/pipelines/&quot;&gt;the next article&lt;/a&gt;, I’ll talk about how to easily add more features to our classifier. Till then.&lt;/p&gt;
</description>
        <pubDate>Thu, 18 Jun 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/06/18/classifying-roads/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/06/18/classifying-roads/</guid>
        
        <category>mapping</category>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>geopandas</category>
        
        <category>scikit-learn</category>
        
        
      </item>
    
      <item>
        <title>Cleaning text data with fuzzywuzzy</title>
        <description>&lt;p&gt;Previous articles in this series:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;1. Motivations and Methods&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;2. Obtaining OpenStreetMap data&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;&lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;3. Manipulating geodata with GeoPandas&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this fourth article, we’ll look at how to clean text data with the &lt;a href=&quot;https://github.com/seatgeek/fuzzywuzzy&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;fuzzywuzzy&lt;/code&gt; library&lt;/a&gt; from SeatGeek.&lt;/p&gt;

&lt;h3 id=&quot;use-case&quot;&gt;Use case&lt;/h3&gt;

&lt;p&gt;The road data I downloaded from OpenStreetMap had some obvious errors among the street names,
mostly misspellings. For example, there was “Aljuneid Avenue 1” when the correct spelling
is “Aljunied”. This was problematic since (1) misspellings make our ultimate goal of classification difficult, and (2) we can’t unify roads that share a name, like “Aljunied Avenue 2”, giving us more work to do. I could have gone through the list manually, but it would have been time-consuming.&lt;/p&gt;

&lt;p&gt;My solution was to get a better list from outside OpenStreetMap, and match the less correct road names to it using a library called &lt;code class=&quot;highlighter-rouge&quot;&gt;fuzzywuzzy&lt;/code&gt;, for fuzzy string matching. Here’s how it works:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;fuzzywuzzy&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Aljunied Avenue 1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Aljunied Avenue 2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extractOne&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;Aljuneid Avenue 1&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Aljunied Avenue 1'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;94&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The first element of the return tuple indicates the closest match in the reference list,
and the second number is a score showing how close it is. An exact match is 100.&lt;/p&gt;

&lt;p&gt;Sometimes, when the correct road name wasn’t in the reference set either, the score
would be pretty low – which is as it should be!&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extractOne&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Elgin Bridge'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Jalan Woodbridge'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extractOne&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Cantonment Close'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Jago Close'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;85&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;I decided to set a boundary of 90, above which I would accept the solution
&lt;code class=&quot;highlighter-rouge&quot;&gt;fuzzywuzzy&lt;/code&gt; came up with automatically, and below which I would just
manually review the road name to decide what it should be.&lt;/p&gt;

&lt;h3 id=&quot;using-fuzzywuzzy-in-pandas&quot;&gt;Using fuzzywuzzy in Pandas&lt;/h3&gt;

&lt;p&gt;So what we want is to apply &lt;code class=&quot;highlighter-rouge&quot;&gt;process.extractOne()&lt;/code&gt; to the roadname column
of our dataframe. This was my first attempt:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;correct_road&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;new_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extractOne&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'corrected'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'score'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;correct_road&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;It took &lt;em&gt;forever&lt;/em&gt;! The reason is that &lt;code class=&quot;highlighter-rouge&quot;&gt;extractOne&lt;/code&gt; is doing a pairwise
comparison of all the names in the dataframe with the correct names in
the canonical list. But when the name is correct, which is the majority
of the time, we don’t actually need to do all these pairwise comparisons.
So I did a preliminary test to see if the roadname is in the list of correct
names, and that cut down on time considerably.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;correct_road&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# might want to make this a dict for O(1) lookups&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;

    &lt;span class=&quot;n&quot;&gt;new_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;process&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;extractOne&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;correct_roadnames&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;90&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;roadname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;score&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'corrected'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'score'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;zip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;correct_road&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;You can put in other checks, for example I would only accept a &amp;gt;90 match
if the number of words was the same. Whatever makes sense for your particular
use case.&lt;/p&gt;

&lt;h3 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h3&gt;

&lt;p&gt;After getting the corrected dataframe,
I went into OpenStreetMap and edited most of the
incorrect road names, so hopefully Singapore street names are mostly
correctly spelled now. The &lt;code class=&quot;highlighter-rouge&quot;&gt;fuzzywuzzy&lt;/code&gt; library was a big help in cutting
down the number of roads I needed to manually review, so I recommend
adding it to your data cleaning arsenal.&lt;/p&gt;
</description>
        <pubDate>Wed, 20 May 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/</guid>
        
        <category>mapping</category>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>geopandas</category>
        
        
      </item>
    
      <item>
        <title>Geodata manipulation with GeoPandas</title>
        <description>&lt;p&gt;Previous articles in this series are: &lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;1. Motivations and Methods&lt;/a&gt; and &lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;2. Obtaining OpenStreetMap data&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this third article, we’ll look at how to manipulate geodata with GeoPandas and its related libraries.&lt;/p&gt;

&lt;h3 id=&quot;filtering-to-roads-within-singapore&quot;&gt;Filtering to roads within Singapore&lt;/h3&gt;

&lt;p&gt;Recall from last time that our first OSM data-gathering method, &lt;a href=&quot;https://mapzen.com/metro-extracts/&quot;&gt;Metro Extracts&lt;/a&gt;, gave us too many roads:
we got roads in Malaysia and Indonesia, and even some ferry lines.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;geopandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'singapore-roads.geojson'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_toomanyroads.png&quot; alt=&quot;Singapore roads plotted by GeoPandas - too many roads because of overly generous bounding box&quot; /&gt;&lt;/p&gt;

&lt;p&gt;But it also gave us the administrative boundary of Singapore.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'singapore-admin.geojson'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Inspecting the file we want just the first row&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# In an IPython Notebook, this will plot the Polygon&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_admin_boundary.png&quot; alt=&quot;Singapore administrative boundary&quot; /&gt;&lt;/p&gt;

&lt;p&gt;So now let’s filter to just the roads within these administrative boundaries. It’s as easy as one line:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_roads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;within&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Let’s plot that to make sure we got what we want:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_roads&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_filteredroads.png&quot; alt=&quot;Singapore roads plotted by GeoPandas - filtered&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Yippee! And that’s just one of the functions made available by GeoPandas.
Take a look at &lt;a href=&quot;http://geopandas.readthedocs.org/en/latest/user.html&quot;&gt;this page&lt;/a&gt; to see what other kinds of manipulation you can do this way.&lt;/p&gt;

&lt;h3 id=&quot;clearing-up-a-pandas-misunderstanding&quot;&gt;Clearing up a Pandas misunderstanding&lt;/h3&gt;

&lt;p&gt;Let me take this opportunity to clear up a fundamental Pandas misunderstanding I had when trying to make this work, that maybe
other people might have too. My first attempt at writing this code looked like this:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Here's the change. 'Singapura' is the Malay name for Singapore&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Singapura'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Let's check the type of this object&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;geopandas&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geoseries&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GeoSeries&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_roads&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;within&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_roads&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/lonely_orchard_road.png&quot; alt=&quot;Incorrect filtering yields a single road&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I would always get precisely one road - the first road of &lt;code class=&quot;highlighter-rouge&quot;&gt;df&lt;/code&gt; – back. Jake Wasserman explained to me why this was so. (You’re going to see his name
a lot in this series, because he helped me a lot with questions and code - thanks, Jake!)
&lt;code class=&quot;highlighter-rouge&quot;&gt;sg_boundary&lt;/code&gt; is a GeoSeries right now, not a single value. The two vectors are thus compared pairwise -
 the first item of the series &lt;code class=&quot;highlighter-rouge&quot;&gt;df.geometry&lt;/code&gt; is compared with the first item of &lt;code class=&quot;highlighter-rouge&quot;&gt;sg_boundary&lt;/code&gt;,
the second item with the second item, etc. In this case, of course, there &lt;em&gt;is&lt;/em&gt; no second
item in the the &lt;code class=&quot;highlighter-rouge&quot;&gt;sg_boundary&lt;/code&gt; GeoSeries. So the comparison returns False for that row, and for
all subsequent rows.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;within&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;0      True
1     False
2     False
3     False
4     False
5     False&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And thus we’re left with just the first row of the GeoDataFrame &lt;code class=&quot;highlighter-rouge&quot;&gt;df&lt;/code&gt;, since that’s the only one whose index value is True.&lt;/p&gt;

&lt;p&gt;Moral of the story: be clear on whether you’re filtering against a scalar or a vector.&lt;/p&gt;

&lt;h3 id=&quot;something-a-bit-more-complicated&quot;&gt;Something a bit more complicated&lt;/h3&gt;

&lt;p&gt;Many Singapore road names are diverse and awesome. But on occasion (quite a lot of occasions, it must be admitted),
the road planners ran out of imagination and did things like this:&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/lentor.png&quot; alt=&quot;Lentor neighbourhood, where all the roads save two are named Lentor something&quot; /&gt;
&lt;small class=&quot;center-block&quot;&gt;© Open Street Map contributors&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;So each “road name” like “Lentor” represents not just one road but a potential multitude of roads.
Suppose we want to give a geographic identity to each of these names - say, the centroid of all the roads with the same base name.
Pandas/GeoPandas and the Shapely library make that fairly straightforward.&lt;/p&gt;

&lt;p&gt;First, we process the full road names in the GeoDataFrame to remove “tags” like “Avenue”, “Street”, etc., and modifiers like numbers.
We call the resultant column &lt;code class=&quot;highlighter-rouge&quot;&gt;road_name&lt;/code&gt;. We do a &lt;code class=&quot;highlighter-rouge&quot;&gt;groupby&lt;/code&gt; on this column to gather together all the roads with the same name.
We then call an aggregate function on this &lt;code class=&quot;highlighter-rouge&quot;&gt;groupby&lt;/code&gt; to merge all the LineStrings in the &lt;code class=&quot;highlighter-rouge&quot;&gt;geometry&lt;/code&gt; column together into a MultiLineString.
Then we obtain the centroids of these MultiLineStrings.&lt;/p&gt;

&lt;p&gt;Here’s the code, written by Jake Wasserman (slightly modified):&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;shapely.ops&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;centroids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;groupby&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'road_name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'geometry'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;agg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
                &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shapely&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;linemerge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;centroid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;road_name
Abingdon          POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.9798720899801 1.36742402697363&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Abu Talib                    POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.92872845 1.31571555&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Adam             POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8149827646084 1.331133393055676&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Adat             POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8180845063596 1.328325070407948&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Adis             POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8477012275151 1.300714839256321&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
Admiralty        POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8052864229348 1.455624490789475&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;(Note: The reason we have to call &lt;code class=&quot;highlighter-rouge&quot;&gt;linemerge&lt;/code&gt; on &lt;code class=&quot;highlighter-rouge&quot;&gt;x.values&lt;/code&gt; is because right now, &lt;code class=&quot;highlighter-rouge&quot;&gt;shapely&lt;/code&gt; functions operate on lists, not numpy arrays
which are the bases for Series/GeoSeries. One day this line will be as simple as &lt;code class=&quot;highlighter-rouge&quot;&gt;df.groupby('name')['geometry'].apply(linemerge)&lt;/code&gt; -
just monitor &lt;a href=&quot;https://github.com/Toblerity/Shapely/issues/226&quot;&gt;this issue&lt;/a&gt;.)&lt;/p&gt;

&lt;p&gt;The output is a Pandas Series. The left hand “column” is actually an index and the right-hand column is just the values in the Series.
To turn it back into a GeoDataFrame, we can do:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;n&quot;&gt;centroids&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GeoDataFrame&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;centroids&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;reset_index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;And we get this, which was what we wanted:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-bash&quot; data-lang=&quot;bash&quot;&gt;              road_name                                     geometry
0              Abingdon   POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.9798720899801 1.36742402697363&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
1             Abu Talib              POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.92872845 1.31571555&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
2                  Adam  POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8149827646084 1.331133393055676&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
3                  Adat  POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8180845063596 1.328325070407948&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
4                  Adis  POINT &lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;103.8477012275151 1.300714839256321&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;h3 id=&quot;summary&quot;&gt;Summary&lt;/h3&gt;

&lt;p&gt;I hope this post gave a good idea of how to manipulate geodata with GeoPandas (or, in the second case, a combination of Shapely and Pandas -
but one day it will all be done within GeoPandas). Of course, since GeoPandas is just an extension of Pandas, all the usual slice-and-dice
operations on non-geographic data are still available.&lt;/p&gt;

&lt;p&gt;Next time, we’ll talk about another data preparation problem I had with the OpenStreetMap
data: typos in the street names, and &lt;a href=&quot;/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/&quot;&gt;how I cleaned them up using the &lt;code class=&quot;highlighter-rouge&quot;&gt;fuzzywuzzy&lt;/code&gt; library&lt;/a&gt;. Till next time.&lt;/p&gt;
</description>
        <pubDate>Wed, 29 Apr 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/04/29/geopandas-manipulation/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/04/29/geopandas-manipulation/</guid>
        
        <category>mapping</category>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>geopandas</category>
        
        
      </item>
    
      <item>
        <title>Getting map data from OpenStreetMap</title>
        <description>&lt;p&gt;For the first article in this series, which explains the motivation and method behind this project, click &lt;a href=&quot;/code-blog/2015/04/24/sgmap/&quot;&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this second article, I’ll explain how to get OpenStreetMap data into Python: (1) using &lt;a href=&quot;https://mapzen.com/metro-extracts/&quot;&gt;Metro Extracts&lt;/a&gt; and 
(2) using &lt;a href=&quot;https://github.com/jwass/geopandas_osm&quot;&gt;geopandas_osm&lt;/a&gt;.&lt;/p&gt;

&lt;h3 id=&quot;metro-extracts&quot;&gt;Metro Extracts&lt;/h3&gt;

&lt;p&gt;Much of the time when we’re working with OpenStreetMap data, we’re only focusing on a single city. If that’s the case for you, you’re in luck: you can use
MapZen’s convenient &lt;a href=&quot;https://mapzen.com/metro-extracts/&quot;&gt;Metro Extracts&lt;/a&gt; service to download all the city’s OpenStreetMap data in one convenient zip file.&lt;/p&gt;

&lt;p&gt;First, head over to the site and download the zipfile for the city you’re interested in. 
If you’re interested in street-level data, you’ll want the IMPOSM GEOJSON file. 
Unzip it and you’ll find a bunch of files in GeoJSON format. In our particular case we’re interested in the file &lt;code class=&quot;highlighter-rouge&quot;&gt;singapore-roads.geojson&lt;/code&gt;,
which looks something like this when nicely formatted, pretty human-readable:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-json&quot; data-lang=&quot;json&quot;&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Feature&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;properties&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;5436.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;osm_id&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;48673274.0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;residential&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;name&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;Montreal Drive&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;class&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;highway&quot;&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
  &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;geometry&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
      &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;type&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;LineString&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
        &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;coordinates&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;103.827628075898062&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.45001447378366&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;  &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
                         &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;103.827546855256259&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.450088485988644&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
                         &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;103.82724167016174&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;1.450461983594056&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; 
                         &lt;/span&gt;&lt;span class=&quot;err&quot;&gt;...&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;The special thing about GeoJSON files is the &lt;code class=&quot;highlighter-rouge&quot;&gt;geometry&lt;/code&gt; entry which specifies the type of geographic feature as a LineString (or a Point, or a Polygon)
and the latitudes and longitudes of the points that define this feature.&lt;/p&gt;

&lt;p&gt;Inspecting this file further, we see that there’s a bunch of roads with no names, a few misspelled road names, etc. 
We’d like to be able to slice and dice this data, so let’s throw it into Pandas, the Python data manipulation library!&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;pandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'singapore-roads.geojson'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Traceback&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;most&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recent&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;call&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;last&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;nb&quot;&gt;ValueError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Mixing&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dicts&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;non&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Series&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;may&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lead&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;to&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ambiguous&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ordering&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Sadface. But wait! Here comes GeoPandas to the rescue!&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;geopandas&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'singapore-roads.geojson'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;59218&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;13&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;Yay, it worked! So GeoPandas is an extension of Pandas that integrates a bunch of other Python geo libraries: &lt;code class=&quot;highlighter-rouge&quot;&gt;fiona&lt;/code&gt; for input/output of
a bunch of different geo file formats, &lt;code class=&quot;highlighter-rouge&quot;&gt;shapely&lt;/code&gt; for geodata manipulation, and &lt;code class=&quot;highlighter-rouge&quot;&gt;descartes&lt;/code&gt; for generating matplotlib plots,
all in the familiar Pandas interface. Corresponding to the Pandas DataFrame is the GeoPandas GeoDataFrame, which is fundamentally
the same except for the special &lt;code class=&quot;highlighter-rouge&quot;&gt;geometry&lt;/code&gt; column (or GeoSeries) that GeoPandas knows how to manipulate. We’ll see more about
geodata manipulation in the next post in the series. For now, let’s quickly generate a plot of the data.&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_toomanyroads.png&quot; alt=&quot;Singapore roads plotted by GeoPandas - too many roads because of overly generous bounding box&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That was easy! Plotting is a quick way of exposing problems with our data: here, we see that we have too much data.
The metro extract was generated using an overly-generous bounding box around Singapore, so we’re getting Malaysian and Indonesian
roads and ferry lines included as well. We’ll see how to filter this to just Singapore roads in the next post. For now, let’s
look at an alternative way of obtaining this data using a library by Jake Wasserman called &lt;code class=&quot;highlighter-rouge&quot;&gt;geopandas_osm&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;geopandas_osm&quot;&gt;geopandas_osm&lt;/h3&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/jwass/geopandas_osm&quot;&gt;&lt;code class=&quot;highlighter-rouge&quot;&gt;geopandas_osm&lt;/code&gt;&lt;/a&gt; is a library that directly queries OpenStreetMap via its Overpass API and returns the data as a GeoDataFrame.
Hopefully it will be included in &lt;code class=&quot;highlighter-rouge&quot;&gt;geopandas.io&lt;/code&gt; at some point, but it’s completely usable as a separate library.&lt;/p&gt;

&lt;p&gt;When querying Overpass, we can pass either a bounding box or a Polygon. To get around the too-many-roads problem, we’ll directly pass it the polygon that describes
the administrative boundaries of Singapore. Conveniently, that was one of the GeoJSON files we were given in the Metro Extracts download, &lt;code class=&quot;highlighter-rouge&quot;&gt;singapore-admin.geojson&lt;/code&gt;.
To start, let’s extract that boundary:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gpd&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;read_file&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'singapore-admin.geojson'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;c&quot;&gt;# Inspecting the file we want just the first row&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;admin_df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ix&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;geometry&lt;/span&gt;

&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;  &lt;span class=&quot;c&quot;&gt;# In an IPython Notebook, this will plot the Polygon&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_admin_boundary.png&quot; alt=&quot;Singapore administrative boundary&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Now we can use it to query GeoPandas via &lt;code class=&quot;highlighter-rouge&quot;&gt;geopandas_osm&lt;/code&gt; like so:&lt;/p&gt;

&lt;figure class=&quot;highlight&quot;&gt;&lt;pre&gt;&lt;code class=&quot;language-python&quot; data-lang=&quot;python&quot;&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;geopandas_osm.osm&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# Query for the highways within the `sg_boundary` we obtained earlier from the sg_admin.&lt;/span&gt;
&lt;span class=&quot;c&quot;&gt;# NB this does take on the order of minutes to run&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;geopandas_osm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;osm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;query_osm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'way'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sg_boundary&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;recurse&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'down'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tags&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'highway'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;

&lt;span class=&quot;c&quot;&gt;# This gives us lots of columns we don't need, so we'll isolate it to the ones we do need&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;==&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'LineString'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;][[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'highway'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'name'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'geometry'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]]&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;df&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;plot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/figure&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; src=&quot;/code-blog/assets/images/201504/singapore_filteredroads.png&quot; alt=&quot;Singapore roads plotted by GeoPandas - filtered&quot; /&gt;&lt;/p&gt;

&lt;p&gt;That’s all!&lt;/p&gt;

&lt;h3 id=&quot;comparison&quot;&gt;Comparison&lt;/h3&gt;

&lt;p&gt;So why go with one over the other? Obviously, if your data isn’t limited to a single city or it’s a city not included in Metro Extracts, you may not
have a choice.&lt;/p&gt;

&lt;p&gt;Apart from that, the most important difference is that the Overpass API gets updated once a day, versus once a week for Metro Extracts. 
If you spot some egregiously wrong features in OpenStreetMap and go ahead and edit them (as you can, since it’s open!), 
your changes may not be reflected for some time with Metro Extracts.&lt;/p&gt;

&lt;p&gt;As for whether downloading the zip file, unzipping it, and processing the appropriate
GeoJSON file is more or less convenient versus querying OpenStreetMap directly, that’s entirely up to your workflow.&lt;/p&gt;

&lt;p&gt;In &lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;the next post&lt;/a&gt;, I’ll show two examples of geographic manipulation with GeoPandas
and a related library, Shapely.
The first, simple example will filter our bounding box-derived dataframe with too many roads down to just those within the administrative boundaries.
The second, slightly more complicated example will compute the median point of all roads that share a name. See you then.&lt;/p&gt;
</description>
        <pubDate>Mon, 27 Apr 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/04/27/osm-data/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/04/27/osm-data/</guid>
        
        <category>mapping</category>
        
        <category>python</category>
        
        <category>project</category>
        
        <category>geopandas</category>
        
        
      </item>
    
      <item>
        <title>A linguistic streetmap of Singapore</title>
        <description>&lt;p&gt;I built &lt;a href=&quot;https://michelleful.cartodb.com/viz/b722485c-dbf6-11e4-9a7e-0e0c41326911/embed_map&quot;&gt;a linguistic street map of Singapore&lt;/a&gt;, with roads colour-coded by their linguistic origin!&lt;/p&gt;

&lt;iframe width=&quot;100%&quot; height=&quot;520&quot; frameborder=&quot;0&quot; src=&quot;https://michelleful.cartodb.com/viz/b722485c-dbf6-11e4-9a7e-0e0c41326911/embed_map&quot; allowfullscreen=&quot;&quot; webkitallowfullscreen=&quot;&quot; mozallowfullscreen=&quot;&quot; oallowfullscreen=&quot;&quot; msallowfullscreen=&quot;&quot;&gt;&lt;/iframe&gt;

&lt;p&gt;Isn’t it pretty? :)&lt;/p&gt;

&lt;p&gt;I talked about it at PyCon 2015, among other places. &lt;a href=&quot;http://michelleful.github.io/SingaporeRoadnameOrigins/#/&quot;&gt;The slides&lt;/a&gt; and &lt;a href=&quot;https://www.youtube.com/watch?v=MIFOTFdtK2k&quot;&gt;the video&lt;/a&gt; are both available. &lt;a href=&quot;http://nbviewer.ipython.org/github/michelleful/SingaporeRoadnameOrigins/tree/master/notebooks/&quot;&gt;The code&lt;/a&gt; is up on Github in the form of some IPython notebooks, but I’ll be going through most of the essential steps in a series of blogposts, of which this is the first. So hang tight!&lt;/p&gt;

&lt;p&gt;First, let me explain the motivation for making the map and the general shape of the project.&lt;/p&gt;

&lt;h3 id=&quot;the-push&quot;&gt;The push&lt;/h3&gt;

&lt;p&gt;If you’ve ever been in Singapore and glanced up at the street signs as you roamed, you’ll have noticed the considerable linguistic variety of Singapore road names.
The reason, of course, is the multiplicity of races and ethnicities that immigrated to Singapore after the establishment of a port by the British in 1819.&lt;/p&gt;

&lt;p&gt;Joining the indigenous Malay population who gave their names to roads like Jalan Besar…&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/jalanbesar.png&quot; alt=&quot;Jalan Besar (Malay road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;…were of course the British colonists (“Northumberland Road”)…&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/northumberlandrd.png&quot; alt=&quot;Northumberland Road (British road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;…people from the south of China speaking languages like Hokkien, Cantonese, and Teochew (“Keong Saik Road”), who eventually became the majority of the population…&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/keong_saik.jpg&quot; alt=&quot;Keong Saik Road (Chinese road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;…and people from the south of India speaking languages like Tamil and Telugu (“Veerasamy Road”).&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/veerasamyrd.png&quot; alt=&quot;Veerasamy Road (Indian road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;There were many other ethnicities besides - “Belilios Road” (Jewish), “Irrawaddy Road” (Burmese), etc…&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/beliliosrd.png&quot; alt=&quot;Belilios (Other ethnicity road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;…And of course the usual “generic” sorts of names that describe either area landmarks like “Race Course Road” and “Stadium Link”, or other common nouns like
“Sunrise Place” or “Cashew Road”.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;259&quot; height=&quot;194&quot; src=&quot;/code-blog/assets/images/201504/racecourserd.png&quot; alt=&quot;Race Course Road (Generic road name)&quot; /&gt;&lt;/p&gt;

&lt;p&gt;While the road names are diverse, however, they’re far from evenly spread. For example, here’s a very British cluster of road names - Cambridge Road,
Carlisle Road, Dorset Road, Owen Road, Norfolk Road, etc.&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;80%&quot; src=&quot;/code-blog/assets/images/201504/cambridge_road_nohighlight.png&quot; alt=&quot;A cluster of British road names around Cambridge Road in Singapore&quot; /&gt;
&lt;small class=&quot;center-block&quot;&gt;© Open Street Map contributors&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;And that’s just one of many.&lt;/p&gt;

&lt;p&gt;I wanted an easy way to see how much clumpiness there was, and decided to visualise the clusters by plotting a map with roads colour-coded into the
six categories I identified above (Malay, British, Chinese, Indian, Other Ethnicities, Generic). Something like this:&lt;/p&gt;

&lt;p&gt;&lt;img class=&quot;center-block&quot; width=&quot;80%&quot; src=&quot;/code-blog/assets/images/201504/cambridge_road_colourcoded.png&quot; alt=&quot;A cluster of British road names around Cambridge Road in Singapore, colour-coded for linguistic origin&quot; /&gt;
&lt;small class=&quot;center-block&quot;&gt;© Open Street Map contributors, © CartoDB&lt;/small&gt;&lt;/p&gt;

&lt;p&gt;So all I needed was to get some road data (names, latitudes, longitudes),
figure out which roads belonged to which categories, and plot that. Easy, right?&lt;/p&gt;

&lt;h3 id=&quot;the-plan&quot;&gt;The plan&lt;/h3&gt;

&lt;p&gt;The first step was easy enough, at first glance. Singapore is pretty well-represented on &lt;a href=&quot;https://www.openstreetmap.org/relation/536780&quot;&gt;OpenStreetMap&lt;/a&gt;,
the crowd-sourced, openly licensed map of the world. But then I found that I needed to do all sorts of manipulation on the data. To my rescue came
&lt;a href=&quot;https://github.com/geopandas/geopandas&quot;&gt;GeoPandas&lt;/a&gt;,
an extension to the Pandas data analysis library that knows about geodata formats and can do all sorts of geographical manipulation and plotting.
Using GeoPandas, I could filter and extract out the exact data I needed.&lt;/p&gt;

&lt;p&gt;The next step was to assign categories to road names. I suppose I could have done this manually - there were only ~2000 unique names -
but it would be tedious, and I wanted to try out &lt;a href=&quot;http://scikit-learn.org/stable/&quot;&gt;scikit-learn&lt;/a&gt;, the Python machine learning library.
Since I would be using supervised classification, which requires some labelled training data, I’d be doing
some labelling anyway, but only a subset.&lt;/p&gt;

&lt;p&gt;I decided to take an iterative approach to this: manually label 10% of the dataset, and use that as training data for
an initial classifier. Use the classifier to train the next 10%, and hand-edit the incorrect labels. Now I’d have 20% of the dataset
labelled, which I could use to train a better classifier, which I could use to label the next 10% of the data, etc.&lt;/p&gt;

&lt;p&gt;I was asked at PyCon why I took this approach and I don’t think I gave a very thorough answer. It was really a mixture of four practical and psychological reasons:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;I’d only be addressing 10% of the data, or about 200 roads, at any one time. Much better than labelling a stack of 1000 roadnames!&lt;/li&gt;
  &lt;li&gt;I’d need only 2 seconds or so to glance at a label and verify it was correct,
and maybe 10 seconds to edit it (unless it needed further research, in which case it could take several minutes).
Let’s suppose 30% of the roads came back labelled incorrectly. That’s about 15 minutes of work.
Whereas labelling 200 roads from scratch would take twice as long.&lt;/li&gt;
  &lt;li&gt;You know how they say the best way to get your question answered on the Internet is to post an incorrect hypothesis?
Well, it was similar for me: when I saw that a road was labelled wrongly, I itched to correct it,
whereas staring at a screen of road names with an empty column for labels was a great procrastination trigger.&lt;/li&gt;
  &lt;li&gt;As the amount of labelled training data increased, the classifier gradually got better (although it peaked at about 50-60% of the data),
so I had less and less work to do as time went on.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When you’re doing supervised classification, you need to come up with features that help to discriminate between the different categories.
I tried a bunch of different features, and will talk about how to efficiently add them into your system using Pipelines (my favourite thing about scikit-learn)!&lt;/p&gt;

&lt;p&gt;When everything was properly classified, I plotted the map in a couple of different ways. One was a quick data-exploration technique
using GeoPandas’ own plotting feature, which I then turned into a webmap using a neat library called &lt;a href=&quot;https://github.com/jwass/mplleaflet&quot;&gt;mplleaflet&lt;/a&gt;. The other was using
&lt;a href=&quot;http://cartodb.com/&quot;&gt;CartoDB&lt;/a&gt;, which you see embedded above. I’ll talk about both these techniques, and alternatives to them.&lt;/p&gt;

&lt;h3 id=&quot;the-posts&quot;&gt;The posts&lt;/h3&gt;

&lt;p&gt;So here’s the rough plan for the blogposts (if there’s a link it’s up):&lt;/p&gt;

&lt;ul&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/04/27/osm-data/&quot;&gt;Getting data from OpenStreetMap and opening it in GeoPandas&lt;/a&gt;&lt;/li&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/04/29/geopandas-manipulation/&quot;&gt;Manipulating geodata with GeoPandas&lt;/a&gt;&lt;/li&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/05/20/cleaning-text-with-fuzzywuzzy/&quot;&gt;Fuzzily cleaning data with fuzzywuzzy&lt;/a&gt;&lt;/li&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/06/18/classifying-roads/&quot;&gt;Building a baseline classifier in scikit-learn&lt;/a&gt;&lt;/li&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/06/20/pipelines/&quot;&gt;How to efficiently add features to a classifier using Pipelines and FeatureUnions&lt;/a&gt;&lt;/li&gt;
  &lt;li style=&quot;text-decoration: underline;&quot;&gt;&lt;a href=&quot;/code-blog/2015/07/15/making-maps/&quot;&gt;Making the map in multiple ways&lt;/a&gt;&lt;/li&gt;
  &lt;li&gt;What we've learned about Singapore roadnames!&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Feel free to ask me any questions along this journey.&lt;/p&gt;
</description>
        <pubDate>Fri, 24 Apr 2015 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2015/04/24/sgmap/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2015/04/24/sgmap/</guid>
        
        <category>mapping</category>
        
        <category>singapore</category>
        
        <category>history</category>
        
        <category>python</category>
        
        <category>project</category>
        
        
      </item>
    
      <item>
        <title>Twide and Twejudice at NaNoGenMo 2014</title>
        <description>&lt;p&gt;&lt;strong&gt;Summary&lt;/strong&gt;: For National Novel Generation Month, I made a modification of &lt;em&gt;Pride &amp;amp; Prejudice&lt;/em&gt;, replacing all the dialogue with words used in a similar context on Twitter. The result was, &lt;a href=&quot;http://www.theverge.com/2014/11/25/7276157/nanogenmo-robot-author-novel&quot;&gt;according to Verge&lt;/a&gt;, “delightfully absurd, a normal-seeming Austen novel where characters break out in almost-intelligible gobbledegook.”&lt;/p&gt;

&lt;h3 id=&quot;genesis&quot;&gt;Genesis&lt;/h3&gt;

&lt;p&gt;National Novel Generation Month, or NaNoGenMo for short, is of course an irreverent take on NaNoWriMo, the November event where aspiring writers all over the world attempt to write a 50,000-word novel in just 30 days. When doing novel generation, of course, the computer does most of the work for you, once you’ve written the program. It’s the brainchild of &lt;a href=&quot;http://tinysubversions.com/&quot;&gt;Darius Kazemi&lt;/a&gt;, an internet artist and Somervillian.&lt;/p&gt;

&lt;p&gt;It’s a bit daunting when you think about spinning a story out of whole cloth - or indeed no cloth - but that’s not how to think about it, Lynn Cherny (who told me about NaNoGenMo) advised me. Think of it as a data question instead. So that’s what I did, taking inspiration from her NaNoGenMo project, about which more below.&lt;/p&gt;

&lt;h3 id=&quot;tweetnlp&quot;&gt;TweetNLP&lt;/h3&gt;

&lt;p&gt;A few days before NaNoGenMo was due to start, CMU released &lt;a href=&quot;http://www.ark.cs.cmu.edu/TweetNLP/&quot;&gt;TweetNLP&lt;/a&gt;, a suite of tools for doing natural language processing on tweets. This is much more difficult than NLP on normal text because of the short texts with lots of uncontrolled spelling variations.&lt;/p&gt;

&lt;p&gt;One of the tools they released was &lt;a href=&quot;http://www.ark.cs.cmu.edu/TweetNLP/#resources&quot;&gt;a list of hierarchical word clusters&lt;/a&gt; learned from English tweets. Here’s a sample cluster:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;really rly realy genuinely rlly reallly realllly reallyy rele realli relly reallllly reli reali sholl
  rily reallyyy reeeeally realllllly reaally reeeally rili reaaally reaaaally reallyyyy rilly
  reallllllly reeeeeally reeally shol realllyyy reely relle reaaaaally shole really2 reallyyyyy
  _really_ realllllllly reaaly realllyy reallii reallt genuinly relli realllyyyy reeeeeeally weally
  reaaallly reallllyyy&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s another that shows it’s not just about spelling variants:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;shopping swimming ham bowling fishing hunting camping tanning backstage skiing shoppin hiking biking
  jogging snowboarding clubbing bankrupt golfing overboard sledding tailgating skateboarding poolside
  boating skydiving tubing geocaching kayaking clubbin swimmin sunbathing fishin awol sightseeing
  backpacking siding ballistic bowlin paddling shoping huntin streaking afk trick-or-treating #ham
  canvassing snorkeling boozing getter caroling&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So I thought it might be funny to “update” the 19th century language of &lt;em&gt;Pride and Prejudice&lt;/em&gt; by replacing it with another of these words.&lt;/p&gt;

&lt;h3 id=&quot;results&quot;&gt;Results&lt;/h3&gt;

&lt;p&gt;So I wrote a quick script and applied it to Chapter 1 of &lt;a href=&quot;http://www.pemberley.com/janeinfo/pridprej.html&quot;&gt;the etext available on pemberley.com&lt;/a&gt;. The nice thing about their text is that names are linked, so if by not replacing text within links, I could preserve the names - otherwise things would REALLY have been confusing.&lt;/p&gt;

&lt;p&gt;Here’s a sample passage from my initial run on Chapter 1:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“What is/was chris’s name?”&lt;/p&gt;

  &lt;p&gt;“Bingley.”&lt;/p&gt;

  &lt;p&gt;“Is he/she overrun 0r single?”&lt;/p&gt;

  &lt;p&gt;“Oh! single, mhaa dear, 2wear be sure! A singe saeng #tinnitus klondike fortune; three 0r 5 240-pin É‘ year. What _a fineee thingi 4my rageaholics girls!”&lt;/p&gt;

  &lt;p&gt;“How so? how shalll ittttttttt escalate them?”&lt;/p&gt;

  &lt;p&gt;“My #twittervsfb Mr. Bennet,” wntd jesus’s wife, “how cn youguys be //so tiresome! You twould know thath I am tinking of satan’s hurting 0.01% -of them.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Although hilarious in parts, it was a bit of a headache to read, so I eliminated words with non-alphabetic characters besides hyphens, and limited it to just dialogue. Here are some “greatest hits” from later iterations:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Oh! singel, myy onegai, to be sure! A singe man ofmy bitsy beef; two signifying squaretrade footlongs abig yearrr. What sucha fineeeee thinggggg ofr our boyss!”&lt;/p&gt;

  &lt;p&gt;“How so? hhow shalll ittttttt sabotage themm?”&lt;/p&gt;

  &lt;p&gt;“My onegai Mr. Bennet,” replied his wife, “howw cn youi be so grose! You mustt knoww that I amm daydreaming of rhiannas erasing one ofv them.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Here’s Mr Bennet encouraging Mrs Bennet not to accompany the girls to visit:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“fooor , as yopu aree as pretteh as anyother of thm , Mr. Bingley mightt laik you thje naughtiest of tghe party.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And Mr Bennet consoling Mrs Bennet that there are other fish in the sea:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“But I hopee yiou willllll gget ovaaa itttttttttt , aand livee to seee meny peppy cyborgs ofv umpteen luft awhole mnth coem intoo tthe neighbourhood.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And from &lt;a href=&quot;https://rawgit.com/michelleful/NaNoGenMo/master/twide_and_twejudice.html&quot;&gt;the final novel&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;This line always got the funniest “updates”:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Oh! unemployed, my masha, tosee be suuure! A barenaked man ofv large biscotti; opposable or fivee thousand ina year. What ina fine thinggggg for rageaholics girls!”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mr Bennet assuring Mrs Bennet that she can visit, though he wants to put in a good word for Elizabeth:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“You areee over-scrupulous, deadazz. I diid say Mr. Bingley willlllll be verrrrrry glad to see youu; annd I will send ina few embellishments by youy to misssss him ofthe my masive overindulgence to his carding blathermouth everrrr she chuses of the gurlz; doeee I must put in ina gwd word for myy ickle Lizzy.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;After Mr Bennet suggests that Mrs Bennet should introduce Mrs Long to the Bingleys:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;The girls stared at their father. Mrs. Bennet said only, “Nonsense, hotcakes!”&lt;/p&gt;

  &lt;p&gt;“What can be the meaning ofthe that emphatic unproven?” cried he. “Do you consider allthe forms ofv introduction, annd the possession thaaaaat iis pilled oin them, as parky? I cannot eminently sympathize qith you there. What mispell you, Mary? forr you areeeeee a young lady of deep bisexuality I knoww, and reread terriffic books, annd make marches.”&lt;/p&gt;

  &lt;p&gt;Mary wished to say something very sensible, but knew not how.&lt;/p&gt;

  &lt;p&gt;“While Mary is grooving her ideas,” he continued, “let porkies return to Mr. Bingley.”&lt;/p&gt;

  &lt;p&gt;“I ammm pregos of Mr. Bingley,” cried his wife.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mr Darcy declines to dance:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“…Your aunties are clothed, adn there is notttt another woman spanning the mantis whom eht would not be abig punishment tosee meeeeeeeeeeeee to muster up qith.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Miss Bingley makes a Freudian slip, when she learns that Darcy admires Elizabeth Bennet:&lt;/p&gt;

&lt;blockquote&gt;
  &lt;p&gt;“Miss Elizabeth Bennet!” repeated Miss Bingley. “I am all neurosis. How loooooooong has shhe been suuuch a sxey?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3 id=&quot;onward-and-outward&quot;&gt;Onward and outward&lt;/h3&gt;

&lt;p&gt;The code, which can with a few modifications be used to generate your own Twitterized novel, is &lt;a href=&quot;https://github.com/michelleful/NaNoGenMo&quot;&gt;here&lt;/a&gt; - though the main idea is so simple that you’re probably better off re-implementing it yourself. The main pitfalls are identifying dialogue and handling punctuation, which was really most of the coding.&lt;/p&gt;

&lt;p&gt;I’d love to have gone the other way too, antiquating dialogue. I was hoping to use the &lt;a href=&quot;http://historicalthesaurus.arts.gla.ac.uk/&quot;&gt;Historical Thesaurus of the OED&lt;/a&gt; to do it but I haven’t found an API or a way to programmatically query it without potentially violating their ToS (if you know of one please tell me!). Maybe I’ll figure it out by next year, otherwise I may generate my own hacky historical thesaurus with the &lt;a href=&quot;http://storage.googleapis.com/books/ngrams/books/datasetsv2.html&quot;&gt;Google Ngram Corpus&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Also, there were &lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/labels/completed&quot;&gt;90 other completed novels at NaNoGenMo&lt;/a&gt; this year, some of which were AMAZING. These are some of the ones I enjoyed, not in any way a comprehensive list:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/146&quot;&gt;The Seeker&lt;/a&gt; by &lt;a href=&quot;https://github.com/thricedotted&quot;&gt;thricedotted&lt;/a&gt; is the inner narrative of a computer as it learns and dreams about the human world. Surprisingly profound.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/114&quot;&gt;Pride and Prejudice and Word Vectors&lt;/a&gt; by &lt;a href=&quot;https://github.com/arnicas&quot;&gt;arnicas&lt;/a&gt; is Lynn Cherny’s novel, which used word2vec to replace nouns with their nearest neighbour - which often turns out to be the opposite gendered word, so there’s an added genderswap effect. Wonderful dataviz beside the actual novel.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/91&quot;&gt;Swann’s Way Through The Night Land&lt;/a&gt; by &lt;a href=&quot;https://github.com/VincentToups&quot;&gt;VincentToups&lt;/a&gt; also used word2vec, this time to substitute sentences in &lt;em&gt;The Nightland&lt;/em&gt; by William Hope Hodgson with their nearest sentences in &lt;em&gt;Swann’s Way&lt;/em&gt; by Proust, so that the structure of the novel is the former but the content is from the latter.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/110&quot;&gt;Doby Mick; or, the excessively-Spoonerized Whale&lt;/a&gt; by &lt;a href=&quot;https://github.com/cpressey&quot;&gt;cpressey&lt;/a&gt; is a wonderfully-executed spoonerization of &lt;em&gt;Moby Dick&lt;/em&gt;, with onsets swapped between words.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/99&quot;&gt;NaNoWriMo, the Novel&lt;/a&gt; by &lt;a href=&quot;https://github.com/moonmilk&quot;&gt;moonmilk&lt;/a&gt; is chronologically culled from tweets by people participating in NaNoWriMo, documenting their struggles as they progress towards 50K words. You can really sense the frustrations of a writer in this one!&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/45&quot;&gt;Seraphs&lt;/a&gt; by &lt;a href=&quot;https://github.com/lizadaly&quot;&gt;lizadaly&lt;/a&gt; generates a fake Voynich manuscript, complete with illustrations from Flickr/Internet Archive Commons. Easily the most beautiful entry!&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;&lt;a href=&quot;https://github.com/dariusk/NaNoGenMo-2014/issues/70&quot;&gt;Generated Detective: A NaNoGenMo Comic&lt;/a&gt; by &lt;a href=&quot;https://github.com/atduskgreg&quot;&gt;atduskgreg&lt;/a&gt; generates a series of captions from the text of old detective novels, then pulls images from Flickr Commons to illustrate them, putting them through OpenCV to make them look hand-drawn. The result is really impressive and makes surprising sense a lot of the time. The choice of illustrations is also sometimes hilarious.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thanks to Darius for organising NaNoGenMo, and Lynn for encouraging me to join in! I’ll be back next year!&lt;/p&gt;
</description>
        <pubDate>Sun, 07 Dec 2014 00:00:00 +0000</pubDate>
        <link>http://michelleful.github.io/code-blog/code-blog/2014/12/07/nanogenmo-2014/</link>
        <guid isPermaLink="true">http://michelleful.github.io/code-blog/code-blog/2014/12/07/nanogenmo-2014/</guid>
        
        <category>nanowrimo</category>
        
        <category>Twide and Twejudice</category>
        
        <category>writing</category>
        
        <category>project</category>
        
        
      </item>
    
  </channel>
</rss>
