{"id":2245,"date":"2018-11-20T14:30:52","date_gmt":"2018-11-20T18:30:52","guid":{"rendered":"https:\/\/www.danielpradilla.info\/blog\/?p=2245"},"modified":"2018-12-04T22:49:14","modified_gmt":"2018-12-04T22:49:14","slug":"recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus","status":"publish","type":"post","link":"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/","title":{"rendered":"Recommender system for finding subject matter experts using the Enron email corpus"},"content":{"rendered":"<p><a href=\"https:\/\/unsplash.com\/photos\/JYBBcCbRaFc\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2247\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/alexandra-633017-unsplash\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash.jpg\" data-orig-size=\"1000,750\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"alexandra-633017-unsplash\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash-300x225.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash.jpg\" class=\"aligncenter size-full wp-image-2247\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash.jpg\" alt=\"\" width=\"1000\" height=\"750\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash.jpg 1000w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash-300x225.jpg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/alexandra-633017-unsplash-768x576.jpg 768w\" sizes=\"auto, (max-width: 1000px) 100vw, 1000px\" \/><\/a><\/p>\n<p>This is a little project to create a recommender system to find mentors inside an organization, using Natural Language Processing. It started as an excuse to build a data visualization I had in mind: an interactive word cloud that did something. When I started, I didn&#8217;t know anything about Topic Modeling, Topic Extraction, or Natural Language Processing; and fell head first into a rabbit hole.<\/p>\n<h2>TL;DR:<\/h2>\n<p>Topic extraction is deep and potentially rewarding. Sanitize properly. <a href=\"https:\/\/spacy.io\/\">SpaCy<\/a> and <a href=\"https:\/\/radimrehurek.com\/gensim\/\">Gensim<\/a> are your friends.\u00a0<a href=\"https:\/\/www.youtube.com\/results?search_query=topic+extraction\">Search YouTube<\/a> for knowledge. This is related to &#8220;<a href=\"http:\/\/ceur-ws.org\/Vol-403\/paper5.pdf\">Topic Extraction from Scientific Literature for Competency Management<\/a>&#8221; and &#8220;<a href=\"https:\/\/mimno.infosci.cornell.edu\/info6150\/readings\/398.pdf\">The Author-Topic Model for Authors and Documents<\/a>&#8220;. Get the code of this project at\u00a0<a href=\"https:\/\/github.com\/danielpradilla\/enron-playground\">https:\/\/github.com\/danielpradilla\/enron-playground<\/a><\/p>\n<p><!--more--><\/p>\n<h2>General Description<\/h2>\n<p>Imagine we would like to know who is the best person to ask about a subject inside a company \u2013a potential mentor. One way would be to infer each person&#8217;s speciality from their main body of work: emails.<\/p>\n<p>If I lived in another world in which privacy is not an obvious concern \u2013or if I worked in Google\u2013 reading other people&#8217;s email would be totally kosher. In the normal, privacy-complaint world, this remains a purely academic exercise.<\/p>\n<p>However we do have access to a publicly-released corpus of emails to work with: <a href=\"https:\/\/www.cs.cmu.edu\/~enron\/\">the Enron email dataset<\/a>.<\/p>\n<p>My first idea was to use a named entity recognizer (NER), because if one were designing a recommender system for an energy company, one of the use cases would be to suggest whom to ask about a very specific technical issue. At the time, I found <a href=\"https:\/\/explosion.ai\/demos\/displacy-ent\">SpaCy<\/a> to have a nice NER for python.<\/p>\n<p>To identify the mentors, I assumed that whomever wrote an email about a subject, knew something about it. I&#8217;m not\u00a0<em>that<\/em>\u00a0naive, I know that&#8217;s not always the case, but hey, that&#8217;s what I had! I could create a distribution subject-person and argue that the ones at the top of each subject knew something about it. Looking at the corpus, it seemed I had to extract the &#8220;from&#8221; field and the body of the email to build this distribution. I could use the <a href=\"https:\/\/docs.python.org\/3\/library\/email.parser.html\">email<\/a> module and <a href=\"https:\/\/www.crummy.com\/software\/BeautifulSoup\/\">Beautiful Soup<\/a> for HTML emails. With a little bit of text mining, I could transform this bunch of files \u2013over 600,000!\u2013 into a structured dataset.<\/p>\n<p>Before I dived in, I had to clean the text. I thought that a couple of regular expressions would suffice but these emails had some surprises in store:<\/p>\n<ul>\n<li>The emails themselves were stored in folders that mirrored the folder structure of the owner&#8217;s email client. I wasn&#8217;t counting on people using different email clients that exported different folder structures.<\/li>\n<li>There was a considerable amount of repeated emails, with multiple copies stored in different folders. I needed to ignore those if I wanted to build an accurate distribution of person-subject.<\/li>\n<li>A lot of these emails were responses or email threads. I needed to find a way to extract only the original text that belonged to the author of the email. Otherwise, I could mis-attribute the text.<\/li>\n<li>I found a ton of extraneous characters that rendered the text unreadable.<\/li>\n<\/ul>\n<p><a href=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004.jpeg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2251\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/enron-playground-presentation-004\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004.jpeg\" data-orig-size=\"1024,768\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"enron-playground-presentation.004\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004-300x225.jpeg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004-1024x768.jpeg\" class=\"alignnone wp-image-2251 size-medium\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004-300x225.jpeg\" alt=\"\" width=\"300\" height=\"225\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004-300x225.jpeg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004-768x576.jpeg 768w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.004.jpeg 1024w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><a href=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003.jpeg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2250\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/enron-playground-presentation-003\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003.jpeg\" data-orig-size=\"1024,768\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"enron-playground-presentation.003\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003-300x225.jpeg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003-1024x768.jpeg\" class=\"alignnone wp-image-2250 size-medium\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003-300x225.jpeg\" alt=\"\" width=\"300\" height=\"225\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003-300x225.jpeg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003-768x576.jpeg 768w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.003.jpeg 1024w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><a href=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002.jpeg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2249\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/enron-playground-presentation-002\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002.jpeg\" data-orig-size=\"1024,768\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"enron-playground-presentation.002\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002-300x225.jpeg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002-1024x768.jpeg\" class=\"alignnone wp-image-2249 size-medium\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002-300x225.jpeg\" alt=\"\" width=\"300\" height=\"225\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002-300x225.jpeg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002-768x576.jpeg 768w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.002.jpeg 1024w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><a href=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001.jpeg\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2248\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/enron-playground-presentation-001\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001.jpeg\" data-orig-size=\"1024,768\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"enron-playground-presentation.001\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001-300x225.jpeg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001-1024x768.jpeg\" class=\"alignnone wp-image-2248 size-medium\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001-300x225.jpeg\" alt=\"\" width=\"300\" height=\"225\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001-300x225.jpeg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001-768x576.jpeg 768w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/enron-playground-presentation.001.jpeg 1024w\" sizes=\"auto, (max-width: 300px) 100vw, 300px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>This is a group of regex expressions that produced sufficiently-clean bodies.<\/p>\n<pre class=\"brush: python; title: ; notranslate\" title=\"\">\r\n\r\nstop_regexes = &#x5B;\r\nre.compile('----\\s*Forwarded by'),\r\nre.compile('----\\s*Original Message'),\r\nre.compile('_{20}'),\r\nre.compile('\\*{20}'),\r\nre.compile('={20}'),\r\nre.compile('-{20}'),\r\nre.compile('\\son \\d{2}\\\/\\d{2}\\\/\\d{2,4} \\d{2}:\\d{2}:\\d{2} (AM|PM)$', re.I),\r\nre.compile('\\d{2}\\\/\\d{2}\\\/\\d{2,4} \\d{2}:\\d{2} (AM|PM)', re.I),\r\nre.compile('^\\s?&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;?(From|To):\\s?', re.I), \r\nre.compile('LOG MESSAGES:',re.I),\r\nre.compile('=3D=3D',re.I),\r\nre.compile('Memo from.*on \\d{2}\\s(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(Nov|Dec)(?:ember)?),', re.I),\r\nre.compile('Outlook Migration Team', re.I),\r\nre.compile('PERSON~', re.I),\r\nre.compile('&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;&amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;gt;', re.I)\r\n]\r\n\r\n<\/pre>\n<p>Being a static data set, I didn&#8217;t have to do this every time, so I stored the extracted and sanitized data. Since I was going to be creating a web application that most probably would feed from a REST endpoint, it would be beneficial to store the data in a format that is the closest to what the API provides. I went with MongoDB as a database, that way I could &#8220;think in JSON&#8221; all the way from the database to the user interface.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2252\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/solution\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution.jpg\" data-orig-size=\"824,102\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"solution\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution-300x37.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution.jpg\" class=\"aligncenter size-full wp-image-2252\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution.jpg\" alt=\"\" width=\"824\" height=\"102\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution.jpg 824w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution-300x37.jpg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/solution-768x95.jpg 768w\" sizes=\"auto, (max-width: 824px) 100vw, 824px\" \/><\/p>\n<p>(this is close to the approach I took in that other post about\u00a0<a href=\"https:\/\/www.danielpradilla.info\/blog\/linear-optimization-with-or-tools-building-a-web-front-end-with-falcon-and-gunicorn\/\">building a web front-end with falcon and gunicorn<\/a>)<\/p>\n<p>If the emails were to be ingested dynamically, I would&#8217;ve chosen the <a href=\"https:\/\/www.elastic.co\">Elastic Stack<\/a> for the task, but since it was a one-off thing, I created a python script to sanitize and then store the data in Mongo. That\u00a0<a href=\"https:\/\/github.com\/danielpradilla\/enron-playground\/blob\/master\/src\/01_load_to_mongo.py\">first script<\/a> took altogether around 30 minutes to identify and store 251K unique records. In order to avoid duplicates, I created an index out of the &#8220;from&#8221;, &#8220;subject&#8221; and &#8220;date&#8221; fields. Arguably, few people are capable of sending one email with different subjects at the same exact millisecond!<\/p>\n<p>&nbsp;<\/p>\n<h2>Text Mining \u2013 Entity Extraction<\/h2>\n<p>Like I mentioned above, my plan was to use a Named Entity Recognizer. I was looking for subject matter experts and subject matter experts deal with named entities, right? I could get the frequency of utilization of each entity per author and infer from that how much the sender knows about a subject. Or, at least, its familiarity with the subject.<\/p>\n<p><a href=\"https:\/\/explosion.ai\/demos\/displacy-ent\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2254\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/e285c413-bf11-4464-9cf7-0c851ad3a7b0-2\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1.png\" data-orig-size=\"981,167\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"E285C413-BF11-4464-9CF7-0C851AD3A7B0\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1-300x51.png\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1.png\" class=\"aligncenter size-full wp-image-2254\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1.png\" alt=\"\" width=\"981\" height=\"167\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1.png 981w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1-300x51.png 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E285C413-BF11-4464-9CF7-0C851AD3A7B0-1-768x131.png 768w\" sizes=\"auto, (max-width: 981px) 100vw, 981px\" \/><\/a><\/p>\n<p>I invested a few hours on that. I used <a href=\"https:\/\/spacy.io\/\">spaCy<\/a> to <a href=\"https:\/\/en.wikipedia.org\/wiki\/Lemmatisation\">lemmatize<\/a> the words \u2013an important step, as I wanted to build my model using root words that appear in the dictionary, words that made sense. I wanted &#8220;writer&#8221;, &#8220;write&#8221; and &#8220;writing&#8221; to be turned into &#8220;write&#8221;, not &#8220;writ&#8221;. Lemmatization is generally slower than stemming, but from the user-interface perspective produces more readable results.<\/p>\n<p>As I was going deeper into the subject, I stumbled upon the wonderful world of <a href=\"https:\/\/en.wikipedia.org\/wiki\/Topic_model\">topic extraction<\/a>. This was my first time doing NLP. So I called <a href=\"http:\/\/hectorpalacios.net\/\">H\u00c3\u00a9ctor Palacios<\/a>, which is my AI\/NLP\/ML go-to-guy, and he confirmed that what I was attempting to do was topic extraction and that I should look into <a href=\"https:\/\/en.wikipedia.org\/wiki\/Latent_Dirichlet_allocation\">Latent Dirichlet Allocation<\/a> (LDA).<\/p>\n<p>LDA produces a list of topics in the form of a distribution of words per topic. The output looks like a bunch of words grouped together in lists. However, I was hellbent on finding those entities with the NER. LDA is an unsupervised algorithm and that seemed a lesser solution for me. A bag of words around certain topics? How can that be better than &#8220;this person writes a lot about the XYZ generator&#8221;? Also, how could topics made out of bag of words possibly make sense?<\/p>\n<p>However, the NER approach had a problem: I was going to need to manually parse a distribution of all the detected entities and manually build some kind of related entities dictionary. Because the person who knows a lot about the XYZ generator might also know about the ABC generator, but hasn&#8217;t said much about it. I would have to manually enrich the detected entities. So, 15 hours into this project, I decided to give LDA a try.<\/p>\n<p>&nbsp;<\/p>\n<h2>Natural Language Processing \u2013 Topic Extraction<\/h2>\n<p>Using <a href=\"https:\/\/radimrehurek.com\/gensim\/\">gensim<\/a> \u2013which seems to be the most popular python library for topic modeling,\u2013 I ran a simple topic extraction just to see what I got in the output. Each line is one topic:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2255\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/e8f7b275-1066-49c4-b165-d01d4a38886c\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C.jpg\" data-orig-size=\"1024,238\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"E8F7B275-1066-49C4-B165-D01D4A38886C\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C-300x70.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C-1024x238.jpg\" class=\"aligncenter size-full wp-image-2255\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C.jpg\" alt=\"\" width=\"1024\" height=\"238\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C.jpg 1024w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C-300x70.jpg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/E8F7B275-1066-49C4-B165-D01D4A38886C-768x179.jpg 768w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/p>\n<p>See that natural almost-self-explanatory grouping?<\/p>\n<ul>\n<li>Topic #2 is about a gas installation in Shackleton, Saskatchewan (with a bit of Sara Shackleton, VP of Enron North America, thrown in the mix).<\/li>\n<li>Topic #7 is arguably about the newsworthy <a href=\"https:\/\/en.wikipedia.org\/wiki\/California_electricity_crisis\">California energy crisis.<\/a><\/li>\n<\/ul>\n<p>After seeing the results I realized that I needed to start trusting the literature produced by people much more intelligent than me!<\/p>\n<p>Given the results above I revised the approach:<\/p>\n<ol>\n<li>Find the topics in the corpus.<\/li>\n<li>Get the words for each topic<\/li>\n<li>Get all the words used by all the authors<\/li>\n<li>Calculate frequency of word usage per author<\/li>\n<li>Find the topics that match each author&#8217;s word frequencies<\/li>\n<\/ol>\n<p>In the example above, if a person used the words &#8220;LNG&#8221;, &#8220;gas&#8221; and &#8220;Shackleton&#8221; a lot, arguably this person is a candidate to know a lot about topic #2.<\/p>\n<p>As I was gearing towards creating this distribution of word frequencies, I found that the gensim library actually provides <a href=\"https:\/\/radimrehurek.com\/gensim\/models\/atmodel.html\">author-topic modeling<\/a>, which outputs the topic distribution of an author. <strong>Exactly<\/strong> what I was looking for. The author-topic model returns a probability distribution of how likely a topic is to be expressed by an author. I found two relevant papers about what I was attempting to do: &#8220;<a href=\"http:\/\/ceur-ws.org\/Vol-403\/paper5.pdf\">Topic Extraction from Scientific Literature for Competency Management<\/a>&#8221; and &#8220;<a href=\"https:\/\/mimno.infosci.cornell.edu\/info6150\/readings\/398.pdf\">The Author-Topic Model for Authors and Documents<\/a>&#8221;<\/p>\n<figure id=\"attachment_2256\" aria-describedby=\"caption-attachment-2256\" style=\"width: 510px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2256\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/df65d0c6-d01b-42e2-b91a-ae9d1b57959b\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B.jpg\" data-orig-size=\"510,369\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"DF65D0C6-D01B-42E2-B91A-AE9D1B57959B\" data-image-description=\"\" data-image-caption=\"&lt;p&gt;What&amp;#8217;s the probability that Shelly will write about  each topic.&lt;\/p&gt;\n\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B-300x217.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B.jpg\" class=\"size-full wp-image-2256\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B.jpg\" alt=\"\" width=\"510\" height=\"369\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B.jpg 510w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/DF65D0C6-D01B-42E2-B91A-AE9D1B57959B-300x217.jpg 300w\" sizes=\"auto, (max-width: 510px) 100vw, 510px\" \/><figcaption id=\"caption-attachment-2256\" class=\"wp-caption-text\">What&#8217;s the probability that Shelly will write about each topic.<\/figcaption><\/figure>\n<h3>How do I know if the list of topics is a &#8220;good&#8221; list of topics?<\/h3>\n<p>Gensim provides an automated way to measure topic coherence. A good model will generate coherent topics \u2013topics with high topic coherence scores.\u2013 Good topics are semantically-coherent, all their words seem to go together and it&#8217;s relatively easy for a person to describe the topic with a short label. If you want to know how this measure was designed, read &#8220;<a href=\"https:\/\/people.cs.umass.edu\/~wallach\/publications\/mimno11optimizing.pdf\">Optimizing Semantic Coherence in Topic Models<\/a>&#8221;<\/p>\n<p>Topic coherence is affected by the number of topics you choose to extract and the number of iterations you use to build the model. A higher number of iterations will produce a better model, but it will take longer. In some cases, you will get better results by reducing or expanding your desired number of topics. I started from 5 topics and 3 iterations and waited almost forever for a not-very-good list of topics. I ended up with an acceptable list at 10 topics and 2 iterations.<\/p>\n<p>Check out &#8220;<a href=\"http:\/\/svn.aksw.org\/papers\/2015\/WSDM_Topic_Evaluation\/public.pdf\">Exploring the Space of Topic Coherence Measures<\/a>&#8221; if you want to know about other coherence measurements.<\/p>\n<p>I applied additional filtering before building the model, excluding empty emails, 2-word emails, all stop words and high-occurrence words \u2013words which appeared in more than 50% of the emails.<\/p>\n<p>The latest run took 191 minutes on a single thread, this is the list of topics and how I labelled them.<\/p>\n<ul>\n<li>Topic 1 <strong>Corporate<\/strong>: <em>provide year enron employee help new company contact sincerely plan<\/em><\/li>\n<li>Topic 2 <strong>IT Services<\/strong>: <em>receive click email information mail message free service offer access<\/em><\/li>\n<li>Topic 3 <strong>Investor<\/strong>: <em>industry investment investor technology news international market policy company announce<\/em><\/li>\n<li>Topic 4 <strong>e-commerce<\/strong>: <em>visit click great special today home friend link new change<\/em><\/li>\n<li>Topic 5 <strong>Scheduling<\/strong>: <em>thank time know let meeting date work schedule good like<\/em><\/li>\n<li>Topic 6 <strong>Planning<\/strong>: <em>look know like think good year new time work plan<\/em><\/li>\n<li>Topic 7 <strong>Sports<\/strong>:<em> game play good week season team sunday player start free<\/em><\/li>\n<li>Topic 8 <strong>Contracts<\/strong>: <em>gas know thank question let deal contract day change attach<\/em><\/li>\n<li>Topic 9 <strong>Headquarters<\/strong>: <em>enron thank attach fax agreement phone know houston legal let<\/em><\/li>\n<li>Topic 10 <strong>California<\/strong>: <em>california market good issue jeff ferc state energy commission power<\/em><\/li>\n<\/ul>\n<p>I stored all the topics and all the author-topic distributions in MongoDB.<\/p>\n<p>&nbsp;<\/p>\n<h2>User Interface \u2013 API and web application<\/h2>\n<p>As a reminder, the whole idea is to build a recommender system, so the application needed at least three main functionalities:<\/p>\n<ul>\n<li>Show the list of topics. Allow the user to select a topic.<\/li>\n<li>Show the list of experts for the selected topic<\/li>\n<li>Show a sample of the emails produced by the selected expert, as a way to confirm that our selection appeals to us.<\/li>\n<\/ul>\n<p>I used <a href=\"https:\/\/gunicorn.org\/\">gunicorn<\/a> to create <a href=\"https:\/\/github.com\/danielpradilla\/enron-playground\/tree\/master\/src\/www\/api\">a simple API<\/a> that provides JSON objects for the topics, the author-topics and the body of the emails.<\/p>\n<p>I created a chart D3.js that showed a bubble chart word cloud, color-coded by topic, in which the radius of the bubble is proportional to the probability of the word appearing in that topic.<\/p>\n<p><a href=\"https:\/\/github.com\/danielpradilla\/enron-playground\/blob\/master\/src\/www\/js\/d3bubble.js\"><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2261\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/bubbles2\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2.jpg\" data-orig-size=\"643,267\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"bubbles2\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2-300x125.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2.jpg\" class=\"aligncenter wp-image-2261 size-full\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2.jpg\" alt=\"\" width=\"643\" height=\"267\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2.jpg 643w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/bubbles2-300x125.jpg 300w\" sizes=\"auto, (max-width: 643px) 100vw, 643px\" \/><\/a><\/p>\n<p>When you click on any word of a topic, 9 information cards appear, each one with a recommended mentor.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2259\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/mentors\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors.jpg\" data-orig-size=\"800,174\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"mentors\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors-300x65.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors.jpg\" class=\"size-full wp-image-2259 aligncenter\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors.jpg\" alt=\"\" width=\"800\" height=\"174\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors.jpg 800w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors-300x65.jpg 300w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/mentors-768x167.jpg 768w\" sizes=\"auto, (max-width: 800px) 100vw, 800px\" \/><\/p>\n<p>When you click on a mentor, a list of their emails appear and if you click on a particular email, you can inspect the body below.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" data-attachment-id=\"2260\" data-permalink=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/emails\/\" data-orig-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails.jpg\" data-orig-size=\"500,393\" data-comments-opened=\"1\" data-image-meta=\"{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}\" data-image-title=\"emails\" data-image-description=\"\" data-image-caption=\"\" data-medium-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails-300x236.jpg\" data-large-file=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails.jpg\" class=\"aligncenter size-full wp-image-2260\" src=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails.jpg\" alt=\"\" width=\"500\" height=\"393\" srcset=\"https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails.jpg 500w, https:\/\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/11\/emails-300x236.jpg 300w\" sizes=\"auto, (max-width: 500px) 100vw, 500px\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>For the user interface, I built a web page using the <a href=\"https:\/\/semantic-ui.com\/\">Semantic UI framework<\/a> semantic get it?, and this is an animation of how it works:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full\" src=\"https:\/\/danielpradilla.info\/media\/topic-bubbles.gif\" width=\"986\" height=\"769\" \/><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>This visualization was inspired by The New York Time&#8217;s\u00a0<a href=\"https:\/\/archive.nytimes.com\/www.nytimes.com\/interactive\/2012\/09\/04\/us\/politics\/democratic-convention-words.html#Obama\">At the Democratic Convention, the Words Being Used<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h2>What could be better<\/h2>\n<p>I was partially satisfied with this list of topics, I did not like some of the terms proposed by the model \u2013a lot of &#8220;thanks&#8221; in there\u2013 and I have the suspicion that the data was skewed towards a few very-prolific emailers. A way to improve these results would be to normalize the sample of the users, so that they are all equally represented. Another thing that could be attempted is to build a Frankenstein between this and a NER, and throw everything in Elastic Search, to have a search engine of Topics and Entities.<\/p>\n<p>I would argue that a recommender system that involves people should weight social reputation when possible. Maybe we could improve this model if we knew the job description of each author. Perhaps weighting people according to their expected speciality or their position in the hierarchy could help us fine tune the results.<\/p>\n<p>&nbsp;<\/p>\n<h2>More<\/h2>\n<p>If you are interested in this topic, I can recommend:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.coursera.org\/learn\/python-text-mining\">Applied Text Mining course (Coursera)<\/a><\/li>\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=BuMu-bdoVrU\">Topic Modeling with Python (YouTube)<\/a><\/li>\n<li><a href=\"https:\/\/www.youtube.com\/watch?v=Eeg1DEeWUjA\">Recommender Systems (YouTube)<\/a><\/li>\n<li><a href=\"https:\/\/www.coursera.org\/learn\/recommender-systems-introduction\">Introduction to Recommender Systems course (Coursera)<\/a><\/li>\n<\/ul>\n<p>You may find all the code for this project at <a href=\"https:\/\/github.com\/danielpradilla\/enron-playground\">https:\/\/github.com\/danielpradilla\/enron-playground<\/a><\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>This is a little project to create a recommender system to find mentors inside an organization, using Natural Language Processing. It started as an excuse to build a data visualization I had in mind: an interactive word cloud that did something. When I started, I didn&#8217;t know anything about Topic Modeling, Topic Extraction, or Natural&hellip; <a class=\"more-link\" href=\"https:\/\/www.danielpradilla.info\/blog\/recommender-system-for-finding-subject-matter-experts-using-the-enron-email-corpus\/\">Continue reading <span class=\"screen-reader-text\">Recommender system for finding subject matter experts using the Enron email corpus<\/span><\/a><\/p>\n","protected":false},"author":2,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[174,331],"tags":[353,354,356,341,355],"class_list":["post-2245","post","type-post","status-publish","format-standard","hentry","category-bestof","category-software-development-en-en","tag-data-science","tag-natural-language-processing","tag-nlp","tag-python","tag-topic-extraction","entry"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack_shortlink":"https:\/\/wp.me\/p1tlzy-Ad","jetpack_sharing_enabled":true,"jetpack-related-posts":[{"id":2263,"url":"https:\/\/www.danielpradilla.info\/blog\/readability-scoring-of-the-united-nations-corpus\/","url_meta":{"origin":2245,"position":0},"title":"Readability scoring of the United Nations Corpus","author":"Daniel Pradilla","date":"01\/12\/2018","format":false,"excerpt":"Imagine you could estimate how hard would be to read a document, before reading it. Imagine you could do it for entire batches of documents you need to process. Imagine you could have a recommender system that would help you prioritize unread documents according to their difficulty. A bit of\u2026","rel":"","context":"In &quot;Best of&quot;","block_context":{"text":"Best of","link":"https:\/\/www.danielpradilla.info\/blog\/category\/bestof\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/12\/future-scoring-diagram-1.png?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/12\/future-scoring-diagram-1.png?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/12\/future-scoring-diagram-1.png?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2018\/12\/future-scoring-diagram-1.png?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":2152,"url":"https:\/\/www.danielpradilla.info\/blog\/hello-p5js\/","url_meta":{"origin":2245,"position":1},"title":"Having fun coding with p5.js","author":"Daniel Pradilla","date":"01\/06\/2015","format":false,"excerpt":"p5.js is an effort to port the ideas and concepts of the Processing programming language to JavaScript. Even though there's already processing.js \u2013which transcodes Processing code into JavaScript\u2013, p5.js is built with extensibility in mind, trough plugins, and instead of writing Processing code to be transcoded, you write pure JavaScript.\u2026","rel":"","context":"In &quot;Software Dev.&quot;","block_context":{"text":"Software Dev.","link":"https:\/\/www.danielpradilla.info\/blog\/category\/software-development-en-en\/"},"img":{"alt_text":"p5js","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2015\/06\/p5js.jpg?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2015\/06\/p5js.jpg?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2015\/06\/p5js.jpg?resize=525%2C300 1.5x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2015\/06\/p5js.jpg?resize=700%2C400 2x"},"classes":[]},{"id":1969,"url":"https:\/\/www.danielpradilla.info\/blog\/how-to-present-statistics\/","url_meta":{"origin":2245,"position":2},"title":"How to present statistics without boring your audience","author":"Daniel Pradilla","date":"08\/08\/2013","format":false,"excerpt":"A few days ago I found a very valuable, yet free resource for improving the way we report statistics. Making Data Meaningful is a series of short, sweet and free ebooks created by the United Nations Economic Commission for Europe as a practical tool to improve the way charts, tables\u2026","rel":"","context":"In &quot;Project Mgmt.&quot;","block_context":{"text":"Project Mgmt.","link":"https:\/\/www.danielpradilla.info\/blog\/category\/projectmanagement-en\/"},"img":{"alt_text":"boring_lecture","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2013\/08\/boring_lecture.jpeg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":1973,"url":"https:\/\/www.danielpradilla.info\/blog\/improve-sap-business-objects\/","url_meta":{"origin":2245,"position":3},"title":"How to improve your Business Objects charts","author":"Daniel Pradilla","date":"13\/08\/2013","format":false,"excerpt":"Business Objects, SAP's BI platform, is notoriously bad for data visualization. Somehow, it empowers the developers to make all the wrong decisions at the same time and create really ugly and unusable \"dashboards\". Lately, I've seen my share of ugly bobip visualizations, like the one above. Which would seem ok\u2026","rel":"","context":"In &quot;Project Mgmt.&quot;","block_context":{"text":"Project Mgmt.","link":"https:\/\/www.danielpradilla.info\/blog\/category\/projectmanagement-en\/"},"img":{"alt_text":"disaster, disguised as a \"dashboard\"","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2013\/08\/Slide-5-SAP-BusinessObjects-4.0-Event-Insight2.jpg?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2013\/08\/Slide-5-SAP-BusinessObjects-4.0-Event-Insight2.jpg?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2013\/08\/Slide-5-SAP-BusinessObjects-4.0-Event-Insight2.jpg?resize=525%2C300 1.5x"},"classes":[]},{"id":2407,"url":"https:\/\/www.danielpradilla.info\/blog\/5-lessons-i-learned-from-my-first-foray-into-clawdbot-a-local-agentic-assistant\/","url_meta":{"origin":2245,"position":4},"title":"5 lessons I learned playing with Clawdbot, a local agentic assistant","author":"Daniel Pradilla","date":"26\/01\/2026","format":false,"excerpt":"I\u2019ve spent the last few weeks playing with Clawdbot. My instance is named Clawd. If you haven\u2019t seen this category yet: think \u201cchat assistant\u201d, but with hands. It can run commands, write files, poke your integrations, and generally do the annoying glue-work you normally do by tab-switching and copy\/pasting. TL;DR\u2026","rel":"","context":"In &quot;Best of&quot;","block_context":{"text":"Best of","link":"https:\/\/www.danielpradilla.info\/blog\/category\/bestof\/"},"img":{"alt_text":"Clawdbot local agentic assistant","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2026\/01\/clawdbot.jpg?resize=350%2C200&ssl=1","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2026\/01\/clawdbot.jpg?resize=350%2C200&ssl=1 1x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2026\/01\/clawdbot.jpg?resize=525%2C300&ssl=1 1.5x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2026\/01\/clawdbot.jpg?resize=700%2C400&ssl=1 2x"},"classes":[]},{"id":1939,"url":"https:\/\/www.danielpradilla.info\/blog\/drawing-the-world-by-hand\/","url_meta":{"origin":2245,"position":5},"title":"Drawing the world by hand","author":"Daniel Pradilla","date":"02\/04\/2013","format":false,"excerpt":"At Lucerne's Gletschergarten,\u00a0among old maps, models and reliefs of the Swiss Alps, we'll find an expo from Ueli L\u00c3\u00a4uppi, a local cartographer that makes hand drawings and colorings of maps using a particular projection that highlights a thorough representation of the mountains. Moreover, L\u00c3\u00a4uppi has moved his studio to the\u2026","rel":"","context":"In &quot;Best of&quot;","block_context":{"text":"Best of","link":"https:\/\/www.danielpradilla.info\/blog\/category\/bestof\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2012\/11\/pluviosidad.jpg?resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2012\/11\/pluviosidad.jpg?resize=350%2C200 1x, https:\/\/i0.wp.com\/www.danielpradilla.info\/blog\/wp-content\/uploads\/2012\/11\/pluviosidad.jpg?resize=525%2C300 1.5x"},"classes":[]}],"_links":{"self":[{"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/posts\/2245","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/users\/2"}],"replies":[{"embeddable":true,"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/comments?post=2245"}],"version-history":[{"count":0,"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/posts\/2245\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/media?parent=2245"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/categories?post=2245"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.danielpradilla.info\/blog\/wp-json\/wp\/v2\/tags?post=2245"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}