Protected: Readability scoring of the United Nations Corpus

This content is password protected. To view it, drop me a line at info@danielpradilla.info, and enter your password below:

Recommender system for finding subject matter experts using the Enron email corpus

This is a little project to create a recommender system to find mentors inside an organization, using Natural Language Processing. It started as an excuse to build a data visualization I had in mind: an interactive word cloud that… did something. When I started, I didn’t know anything about Topic Modeling, Topic Extraction, or Natural Language Processing; and fell head first into a rabbit hole.

TL;DR:

Topic extraction is deep and potentially rewarding. Sanitize properly. SpaCy and Gensim are your friends. Search YouTube for knowledge. This is related to “Topic Extraction from Scientific Literature for Competency Management” and “The Author-Topic Model for Authors and Documents“. Get the code of this project at https://github.com/danielpradilla/enron-playground

»

Linear Optimization with or-tools

 

Getting started

Over the last couple of months I’ve been getting my feet wet with linear programming and mathematical optimisation. I got a sense of how it all worked from this Discrete Optimisation course in Coursera and googling around I discovered that there are a ton of tools out there to help you solve optimisation problems. Makes sense, why would you want to implement a solving algorithm from scratch when some of the best minds in the history of mankind have already given it a shot?

Solvers, as these tools are often called, can reach the hundreds of thousands of dollars… and they are worth every penny! But since I wanted to play with one without forking up the dough, I narrowed my search down to open-source options.

Hans Mittelmann from the Arizona State University performs regular automated benchmarks on different mathematical optimisation tools. He publishes the benchmarks at http://plato.asu.edu/bench.html. Following what I read in the results, I picked Google Optimization Tools (OR-Tools) because it performed fairly well and… well, because if you’re looking for a sinister tool to model and solve hard problems in the human world, you can rarely go wrong with Google.

GLOP has C++ and Python APIs. I’m better at Python and I expected to quickly put together a web front-end for this, so I picked the latter one.

TL;DR. Just gimme the code

You can get the full code here.

 

The Problem

I picked a variation of the knapsack problem: a grocery-shopping example in which you try to maximise the number of calories you can buy with a limited budget. I got the problem from this blog post:

http://www.jasq.org/just-another-scala-quant/new-agey-interviews-at-the-grocery-startup

Basically: You walk into a grocery store with a grocery bag and some cash, to buy groceries for a week. You need to follow these rules:

1. Your bag can hold ten pounds.
2. You have $100
3. You need about 2000 calories a day, so a weekly shopping trip is about 14,000 calories.
4. You must purchase at least 4 ounces of each grocery item.

These are the groceries you can by and their price per pound:

Ham:     650 cals,  $4
Lettuce:  70 cals,  $1.5
Cheese: 1670 cals,  $5
Tuna:    830 cals, $20
Bread:  1300 cals,  $1.20

 

Installing OR-Tools

Follow the instructions at https://developers.google.com/optimization/introduction/installing. You will need Python and Python setuptools installed in your machine.

 

Coding the problem

We can split the coding of the problem into 6 elements:

1. Identify the problem

We tell or-tools that we are attempting to solve a linear programming problem. We create a solver variable that is going to contain all the necessary items to solve the problem.

from ortools.linear_solver import pywraplp
solver = pywraplp.Solver('SolveSimpleSystem',pywraplp.Solver.GLOP_LINEAR_PROGRAMMING)

 

2. Ingest the input

We are going to send our table of possible groceries, calories and prices as a nested list:

food = [['ham',650, 4],
 ['lettuce',70,1.5],
 ['cheese',1670,5],
 ['tuna',830,20],
 ['bread',1300,1.20]]

 

3. Configure the decision variables

We need 5 decision variables which contain how many pounds of each product you are going to buy. Instead of creating 5 variables, we can create a list of size 5 (the size of the food list). Each item in the list will contain a decision variable.

Each decision variable will be created with a call to the NumVar method of the solver variable, passing the minimum amount of groceries we can buy, the maximum (infinity), and a unique name for the variable (contained in the previously-defined food list).

    #food is a list of groceries, calories and prices
    variable_list = [[]] * len(food)
    for i in range(0, len(food)):
        #you must buy at least minShop of each
        variable_list[i] = solver.NumVar(minShop, solver.infinity(), str(food[i][0]))

They pythonic way of writing that loop is using a list comprehension. However I’m using the loop for readability.

 
    #same thing but with comprehension
    variable_list=[solver.NumVar(minShop, solver.infinity(), str(food[i][0])) for i in range(0, len(food))]

 

4. Configure the constraints

This is where most of the magic happens. We will create one constraint per “rule” specified in the problem description.

In linear programming each constraint is specified in terms of addition of the decision variables:

lower bound <= var1+var2+var3… <=upper bound

In the knapsack problem, the conversion is pretty straightforward. In some other cases, you have to re-think and re-model your problem in these terms.

We will create a list of 3 constraints, calling Constraint(lower bound, upper bound) for each one, and then walk the variables list and call SetCoefficient for each of the variables.

 
    #Define the constraints    
    constraint_list=[]
    #Constraint 1: totalWeight<maxWeight
    #ham + lettuce + cheese + tuna + bread <= maxWeight
    constraint_list.append(solver.Constraint(0, maxWeight))
    for i in range(0, len(food)):
        constraint_list[0].SetCoefficient(variable_list[i],1)

    #Constraint 2: totalPrice<=maxCost 
    constraint_list.append(solver.Constraint(0, maxCost)) 
    for i in range(0, len(food)): 
        constraint_list[1].SetCoefficient(variable_list[i],food[i][2]) 

    #Constraint 3: totalCalories>=minCals
    constraint_list.append(solver.Constraint(minCals, minCals + 100))
    for i in range(0, len(food)):
        constraint_list[2].SetCoefficient(variable_list[i],food[i][1])

Note that the 4th rule of the problem, “You must purchase at least 4 ounces of each grocery item,” is already coded in the variables definition.

 

5. Configure the objective function

Similar to the constraint definition, the goal function is specified in terms of addition of the decision variables:

goal = Maximize/Minimize (var1+var2+var3…)

If we wish to minimize cost, we walk our variable list, get the price from the food list, and set the objective.

 
        for i in range(0, len(variable_list[)):
            objective.SetCoefficient(variable_list[i], food[i][2])
        objective.SetMinimization()

Say we wanted to maximize calories intake. We would do the same, but taking the calories value from the food list, and setting a maximization goal

# Define our objective: maximizing calories
for i in range(0, len(food)):
    objective.SetCoefficient(variable_list[i], food[i][1])
objective.SetMaximization()

 

6. Solve!

After all these configuration steps, we just call the solve method against the solver variable and print out a solution if we find it.

 
    result_status = solve(solver)

    if result_status == solver.OPTIMAL:
        print('Successful solve.')
        # The problem has an optimal solution.
        print(('Problem solved in %f milliseconds' % solver.wall_time()))
        # The objective value of the solution.
        print(('Optimal objective value = %f' % solver.Objective().Value()))
        # The value of each variable in the solution.
        var_sum=0
        for variable in variable_list:
            print(('%s = %f' % (variable.name(), variable.solution_value())))
            var_sum+=variable.solution_value()
        print(('Variable sum = %f' % var_sum));

        print('Advanced usage:')
        print(('Problem solved in %d iterations' % solver.iterations()))

        for variable in variable_list:
            print(('%s: reduced cost = %f' % (variable.name(), variable.reduced_cost())))
        
        activities = solver.ComputeConstraintActivities()
        for i, constraint in enumerate(constraint_list):
            print(('constraint %d: dual value = %f\n'
              '               activity = %f' %
              (i, constraint.dual_value(), activities[constraint.index()])))

    elif result_status == solver.INFEASIBLE:
        print('No solution found.')
    elif result_status == solver.POSSIBLE_OVERFLOW:
        print('Some inputs are too large and may cause an integer overflow.')

 

You can get the full code here.

 

Next steps

We got a solution, but you need to know a little bit of python to run this program, change its inputs or read the solution. Wouldn’t it be nice to have some sort of user-friendly UI? This will be the subject of future posts.

Building a web front-end with falcon and gunicorn

Containerizing the solution with docker

 

How to tidy up and organize music on your PC o Mac

Do you have a lot of Untitled / Without Artist / Without Album songs?

All music players allow you to sort and catalog the songs, as long as you know the titles and names of the artists. But what happens when many of your songs are unidentified?

There are some tools out there like Tune Up and Jaikoz, but you have to pay for them.

So if you’re looking for a free way to sort, catalog, tidy up and organize the music you’ve downloaded, the solution is Picard, a Free mp3 classifying application that relies on Musicbrainz, the second largest database of music on the internet (and the only free one).

To begin, click “add files” or “add folder”. I recommend that you test with a folder with few songs.
(If you do not know where your songs are, you’ll usually find them in the “Music” folder in your user directory, at the same level as “Documents”)

Once you have added the files, click on “Lookup”.

At the end of the analysis, you’ll have a list on the right hand side with the found albums, each one with songs inside. If you are very lucky, all your songs will be identified. If you are like the rest of us, Picard will have identified only some of the songs.

Typically, these types of programs are very good identifying English Pop / Rock. So I use Salsa to test how good they are. In this case, I used a folder with twenty mp3 songs by Ismael Rivera and it only didn’t recognize 6. Not bad.

It is at this point that we can “help” Picard by dragging mp3 from the left to the correct album to the right (or in between albums on the right, you can also do that). Once you do, Picard will recognize the song and assign to it the correct data!

You can help Picard further by going to Options, Fingerprinting and activating Fingerprinting with AmpliFIND. This will allow you to use the “Scan” icon to identify songs by their sound pattern.

 

When you are finished, choose the album on the right and click on “Save”. Then remove it from the list and continue working with the rest.

 

I recommend that you also download the plugin to download album art. To do so, visit the plugins page and download the latest version of Cover Art Downloader. Then, within the Picard options, go to the plugins section, click on “Install Plugin” and look for the coverart.py file in your downloads folder.

 

 

A swiss railway clock in D3.js

Screen Shot 2015-06-05 at 5.26.54 PM

 

The other day I saw this cool JavaScript+CSS clock and I immediately thought that I wanted to create an analog clock in D3.js. First of all as an exercise, but also with the aim —perhaps— of smuggling it in to a future dashboard 😉

It seemed straightforward: just paint the clock and re-paint it each second. Right? Well, yes if you’re lazy. The better way is to paint the clock and periodically rotate the hands.

As with most things on the internet, turns out I was late to the party. Some much more clever people have built D3 clocks and even a multi-clock D3 visualization for displaying multiple time zones.

Good. I had some ideas and a starting point.

I’m in love with the Swiss Federal Railways (SBB) clock, designed by Hans Hilfiker. It’s an emblem of the most effective train system in the world, and a testimony of exquisite swiss modernism. I wanted to make that.

There’s already an SVG visualization of the SBB clock (here’s the code). As I was checking it out, it made me realize for the first time that the axis of rotation is not at the end of the hands, but like at 10% of its length. And that extra length is different for the seconds hand. Also, that ball at the end of the seconds hand was going to be tricky.

As with all great designs of that era, everything is proportional in that clock. I noticed that I would be able to derive all the measurements from the width of the minute ticks. So I defined a starting measurement unit and multiplied away!

The key for doing the hands rotation in D3 is to enclose each hand in its own g element and rotate those g elements using setInterval. The rotation of the hands in the SBB clock is smooth, and thankfully D3 offers transitions for seamless interpolations.

The seconds and minutes hands have to rotate 6˚ each time (360˚÷60) and the hours hand has to rotate 30˚. Since the clock shows only 12 hours, you have to multiply 30˚ by the modulo 12 of the hours.

   
clockHand.transition().duration(1000).ease('linear').attr('transform', function(d,i) {
	if (d.unit==='hours'){
		return 'rotate('+d.numeric%12 * 30+')';
	} else {
		return 'rotate('+d.numeric * 6+')';
	}
});

I was able to quickly put together a very rudimentary version and tweak the design little by little. Then I decided not to waste my first tests, got a little bit carried away and externalized all the functions that defined the characteristics of the clock. I ended up with a module that enables you to create multiple instances of the clock, each with its own configuration, defined in the ‘face’ variable of the configuration object.

d3clock({
	target:'#sbb',
	face:'sbb',
	width:500,
	//date:'Mon May 25 2015 10:09:37',
	TZOffset:{
		hours:0
	}
});



You can get the code for the library here.

Mind you, this implementation has its differences with the SBB clock (perhaps that will spare me from a takedown notice?):

  • The SBB clock makes a 2 second pause at the end of each rotation. Yes, crazy, look.
  • The hands in the original clock are not rectangles, but very tall truncated triangles.