Saturday, December 14, 2013

Plot XKCD style charts from Google Ngram data using Python

In this Github repo you'll find two Python scripts, one of which can be used to fetch data from the Google Ngram Viewer and the other to create XKCD style plots using the CSV data returned by the first script.

I previously made a blog post introducing the script for retrieving data from the Google Ngram Viewer, but it now has a lot more functionality. With the exception of 1 or 2 advanced features that one can do using the Google Ngram Viewer web interface, the getngrams.py script is a fully function command line interface to the Google Ngram Viewer. To learn how to use the script, check out the README file in the Github repo.

The accompanying xkcd.py script provides automatic creation of XKCD style plots when retrieving ngram data using getngrams.py. Simply add the -plot flag to your query and a line chart in a .png file will be created alongside the CSV data file.

For example, to create the plot shown above, you could run:

python getngrams.py railroad,radio,television,internet -startYear=1900 -endYear=2000 -plot -caseInsensitive

There are other ways to plot the ngram data as well, so read the plotting section of the README to learn more.

Friday, May 3, 2013

Analysis of Country-Level Search Engine Market Shares

Here is an IPython Notebook where I look at the market shares of search engines across countries and time. First I look at each country one-by-one, then I compute the HHI in each country and compare them over time.

As shown in the Notebook, Google dominates the search engine market in most countries, except China where government regulation has largely pushed Google out of the market. Russia also tells a slightly different story as Google has fairly strong competition from the local search engine, Yandex.

Friday, March 15, 2013

MapReduce Word Count with GPUs using Pycuda

I recently came across an article demonstrating how to count words in a txt file using GPUs with a MapReduce algorithm. Having access to a monster rig at work with 4 NVIDIA Tesla C2075 GPUs, I decided to give it a try. Their code didn't work out of the box, but I made some changes to it and managed to get it working while also speeding it up a bit.

Here is my version of their script:

from pycuda import gpuarray
from pycuda.reduction import ReductionKernel
import pycuda.autoinit
import numpy as np
import time


def createCudawckernal():
    # 32 is ascii code for whitespace
    mapper = "(a[i] == 32)*(b[i] != 32)"
    reducer = "a+b"
    cudafunctionarguments = "char* a, char* b"
    wckernal = ReductionKernel(np.dtype(np.float32), neutral="0",
                               reduce_expr=reducer, map_expr=mapper,
                               arguments=cudafunctionarguments)
    return wckernal


def createBigDataset(filename):
    print "Reading data"
    dataset = np.fromfile(filename, dtype=np.int8)
    originaldata = dataset.copy()
    for k in xrange(100):
        dataset = np.append(dataset, originaldata)
    print "Dataset size = ", len(dataset)
    return np.array(dataset, dtype=np.uint8)


def wordCount(wckernal, bignumpyarray):
    print "Uploading array to gpu"
    gpudataset = gpuarray.to_gpu(bignumpyarray)
    datasetsize = len(bignumpyarray)
    start = time.time()
    wordcount = wckernal(gpudataset[:-1], gpudataset[1:]).get()
    stop = time.time()
    seconds = (stop-start)
    estimatepersecond = (datasetsize/seconds)/(1024*1024*1024)
    print "Word count took ", seconds*1000, " milliseconds"
    print "Estimated throughput ", estimatepersecond, " Gigabytes/s"
    return wordcount

if __name__ == "__main__":
    print 'Downloading the .txt file'
    from urllib import urlretrieve
    txtfileurl = 'https://s3.amazonaws.com/econpy/shakespeare.txt'
    urlretrieve(txtfileurl, 'shakespeare.txt')
    print 'Go Baby Go!'
    bignumpyarray = createBigDataset("shakespeare.txt")
    wckernal = createCudawckernal()
    wordcount = wordCount(wckernal, bignumpyarray)
    print 'Word Count: %s' % wordcount

Assuming you have GPUs on your machine and Pycuda installed, run the code by saving the script above as gpu_wordcount.py, then open up a terminal and run:

    python gpu_wordcount.py

The output on my machine looks like this:

    Downloading the .txt file
    Reading data
    Dataset size =  101250379
    Uploading array to gpu
    Word count took  8.30793380737  milliseconds
    Estimated throughput  11.3502064216  Gigabytes/s
    Word Count: 17726106.0

The script downloads a 1MB txt file, reads it in as a numpy array, makes the array 100 times longer by duplicating the data 100 times (just to try it out easily with a dataset that is significantly larger than 1MB), then counts the number of words in the array by splitting on white spaces.

Comparing the GPU script to a regular Python script that does the same thing, the GPU script was 355 times faster! Eventually I'd like to build out this example to count the frequency of all the unique words in a txt file, rather than just counting the number of total words. Easier said than done, so if you have any advice I'm all ears!

Thursday, March 7, 2013

Querying and Analyzing Google Domestic Trends Data

Similar to Google Trends, Google Domestic Trends is a set of indices created by aggregating search volumes for groups of queries that are related to a specific sector.

In this IPython Notebook, I go through some statistical tests in Python with Google Domestic Trends data using searches by automotive buyers (queries such as "cars, kelly blue book, auto, used cars, toyota, autotrader") to try and predict the volume of search queries related to automotive financing (queries such as "lease, mileage, loan calculator, auto loan, car payment").

I also do some basic tests of periodicity in the data, as well as provide a Python wrapper for querying Google Domestic Trends to return a pandas DataFrame.