Nullege Got a Simple API

I’m very glad to announce that Nullege got its own REST API. The first version only supports searching for code examples. Let me know if you need more functions.

A huge thanks to Nicco, who proposed this idea, and gave me a lot of input on defining a minimal but useful interface. On top of that, he wrote a python wrapper for our API.

If, for whatever reason, you prefer to use the REST API directly, here is a little sample.

The API is really simple. You specify a query via ‘cq’, and it will return a JSON like below:

    "name": "file.close",
    "samples": [
            "project": "shedskin",
            "repository": "git://",
            "file": "examples/",
            "lines": [
            "nullege_file": "
            "project": "matplotlib",
            "repository": "git://",
            "file": "py4science/examples/logistic/sethna_ori/
            "lines": [
            "nullege_file": "

Here, a sample is a file that contains the class or method you are searching for. ‘file’ gives you its relative path under the repository. ‘lines’ are the line numbers where this class or method get referred to. Since each file may call a method multiple times, ‘lines’ is an array. ‘nullege_file’ points to the raw python file in our index.

To get more samples, you can use a combination of ‘start’ and ‘count’. For example
Here, ‘start’ is a 0-based index and ‘count’ specify how many samples you want to get. If the server returns less then ‘count’ results, it means you have reached an end.

Email me if you have any idea on how to make this API more useful.

Posted in Uncategorized | Leave a comment

Nullege Internals: The Whole Picture

This entry is part 4 of 4 in the series Nullege Internals

As I said in the introduction, Nullege has 4 major components. I’ve explained the indexer. In this post, I’ll explain the whole picture, and the rest 3 components.

Before I started, I knew 2 things for sure. First, Nullege will not generate any significant money. I can’t rent a server farm for it. The whole process must be able to finish on my home computer. Second, I’m going to make a lot of mistakes. The system must allow me to try things in fast iterations.

Base on these assumptions, I decided to break the process into small tasks, and run them in sequential. I allows parallel within a task, but any 2 tasks can not run in parallel. This rule ensures that I can interrupt, change, and resume at any time.

Now I’ll introduce the rest 3 components and several small tasks that happens between those major components.

First component is the web crawler. It contains several scripts that download data from PyPI, github, and SourceForge. I also separated scripts for downloading data and scripts for parsing data, so I can change parser without generating burden to those 3rd party websites. With help from Xml2obj and PyQuery, this component is by far the easiest one.

Once we got information about open source projects, source code crawler needs to download their source codes. This step is a little bit harder. We need to handle special cases like untrusted certifications, unresponsive SVN severs, etc. We also need to carefully avoid downloading branches and tags, they can be huge. PySVN helps, but many times things still go wrong and we have to kill the crawler process. Killableprocess is handy here.

Before giving these source codes to indexer, I fix permission bits. Then I try to install each project into its own sandbox, so indexer can import code from those projects. Again, Killableprocess is a big help.

Then indexer came and do its job, which I’ve explained in another post.

After indexer found tons of samples and saved them into a MySQL database, I convert the data into another set of tables that are optimized for web portal to query, export the converted data and upload them, with all source code, to my web server.

Web portal runs on web server. It searches its database and finds samples for you.

That’s it. It took me about 6 months’ spear time, mostly on optimizing, so the whole process can be finished within several days, and the result can fit into my VPS. If you want me to explain anything in a greater detail, or got a idea about Nullege, you are more than welcome to leave it in comment.

Posted in Nullege | 3,611 Comments

Python Indexer: a hybrid way

This entry is part 3 of 4 in the series Nullege Internals

Currently, my python indexer uses a hybrid way. It relies on both static and dynamic information, with more emphasize on the static side. Let’s take a simple sample

from wx import Frame
f = Frame(...)

For each python file, the indexer first converts it to a Abstract Syntax Tree (AST). Thanks to python’s compiler package, this step is pretty straightforward. Once the indexer get an AST, it tries to find all names in current namespace. For python, a name can came from 4 sources: built-in names, the names we imported, function arguments, and local variables. A special case here is ‘self’, we can track it separately. In the above sample, we’ll get ‘Frame’ and ‘f’. We can see that Frame’s full name is wx.Frame. And f’s type is the return type of Frame().

In python, it is generally impossible to know return type of an arbitrary function. But in some special cases, we can cheat. Here we can import wx.Frame, and see it is a class. Than we can guess Frame() is a constructor, and f is an instance of Frame. At this point, we can see that f.Show is a sample for wx.Frame.Show.

For real cases, we need more work to track namespaces. But you got the idea. These simple steps cover majority of my data. As I said before, the goal of this project is to find useful samples, not all samples.

Posted in Nullege | 3,405 Comments

Python Indexer: the original way

This entry is part 2 of 4 in the series Nullege Internals

Python indexer takes python files and finds function calls out of them. My goal is to find samples that I can copy over. So accuracy is more important then completeness. It’s okay to miss some samples, as long as I find good ones.

For static languages like C++ and Java, compiler can help. Actually there are sites like and doing exactly what I want, but for C++ and Java.

However, for Python, and other dynamic languages, this way doesn’t work. Many things remain unknown until we really hit the code. So my original idea was simple — ask interpretor for help. I modified python interpretor, let it log every function call, and things like caller, arguments passed in, etc. Then I tried to run every python file I can get, wished that most popular functions will get hit sometime. The initial results made me excited for a while. It generated huge log files and found many samples for popular libraries like wxPython and NumPy.

My excitement didn’t last long. This approach hit dead end, mainly for 2 reasons.

First problem is a widely used design pattern called Facade. People wrap things together to provide a cleaner interface, or to hide platform-dependent details. But interpretor doesn’t care about interfaces for human, and always gives me name of the real implementation.  For example, it will show ‘posixpath.join’ instead of ‘os.path.join’. It’s absolutely right, but not as useful as I wanted it to be. When I searched for samples of ‘os.path.join’, I got 0 result. I tried several ways to map ‘posixpath’ back to ‘os.path’, but no one worked for all scenarios.

Second problem is more deadly. Many libraries require external input, like a configure file, a database, a API key, or some user interaction. My python indexer ran code blindly. So it covers initialization code pretty well, but usually not hit any interesting part.

I abandoned this approach after a month. But it is not a total failure. When trying to solve the first problem, I found a way that works better, the way I’m using today. I’ll explain it in next post.


Posted in Nullege | 3,521 Comments

Nullege Internals

This entry is part 1 of 4 in the series Nullege Internals

Ever since I lunched this site to public, people kept asking how I did it. This request has been sitting on top of my feature request list since day one.  So I decided to create this serial to share what I’ve learned during developing this site.

I started this project in late 2009. Back then, I was working on a desktop tool called Hacker’s Clock. It based on wxPython, a python portal for a cross-platform GUI library called wxWidgets written in C++.  Both wxWidgets and wxPython are great. wxWidgets has every feature I wanted and wxPython has interfaces for pretty much everything in wxWidgets. However, wxPython’s document is not as complete as its interface. A lot of times, I’ve to refer to documents for C++ or Ruby interface. Sometimes I could find some sample on Google Code Search. That helped a lot. But this kind of luck didn’t happen often. For an OO language like Python, Google Code Search usually works better on comments than actual code. In an OO language, you need to understand the language to know what a variable refers to and where a method call actually goes to.  That is why I started Nullege, a search engine that understands python.

Nullege has a web crawler, a source code crawler, a Python indexer, and a web portal. In following posts, I’ll introduce them one by one. Start from the most interesting one, the Python indexer.

Posted in Nullege | 3,617 Comments