<?xml version="1.0" encoding="UTF-8"?>
<response>
  <status>success</status>
  <result>
    <project>
      <id>47102</id>
      <name>expertfinder</name>
      <created_at>2008-12-16T03:48:26Z</created_at>
      <updated_at>2009-01-10T16:45:10Z</updated_at>
      <description>ProblemOur project will address the following problem: given an academic paper or any general textual query, find a researcher who would be best suited to talk to you about it (based on their previous research). For example, given a paper as a pdf file, we would like to know who in the UW Computer Science department would be best suited to review the paper. Or given a phrase 'probabilistic motion planning', who are the people to talk to in the top Computer Science departments in the US? 

SolutionIn order to answer this question we will build a statistical profile for a list of researchers based on their published papers, and do a cosine similarity between the query and the profiles to find the best match. Our program will roughly consist of the following modules: 

Obtain the data and do some basic processing 

Given a web URL address of a person's web page, download all the publications of that person (in pdf format)  Convert the pdf file to a text file  Clean up the text: Appropriate text cleaning like tokenization, removing stop words, and maybe stemming can be done.  Repeat the above procedure for a list of web addresses of multiple researchers' web pages (perhaps given in a file).  Build a bag-of-word statistical profile for each researcher  Given a string, pdf file or a text file as a query, transform it into appropriate format, calculate cosine similarity of the document to each of the profiles, and output a sorted list of best matches.  Consider other applications on the statistical profiles, such as clustering people doing similar work.</description>
      <homepage_url>http://code.google.com/p/expertfinder</homepage_url>
      <download_url></download_url>
      <url_name>expertfinder</url_name>
      <user_count>0</user_count>
      <average_rating></average_rating>
      <rating_count>0</rating_count>
      <analysis_id>418200</analysis_id>
      <analysis>
        <id>418200</id>
        <project_id>47102</project_id>
        <updated_at>2009-12-02T14:39:27Z</updated_at>
        <logged_at>2009-12-02T14:39:03Z</logged_at>
        <min_month>2007-05-01T00:00:00Z</min_month>
        <max_month>2007-05-01T00:00:00Z</max_month>
        <twelve_month_contributor_count>0</twelve_month_contributor_count>
        <total_code_lines>3373</total_code_lines>
        <main_language_id>5</main_language_id>
        <main_language_name>Java</main_language_name>
      </analysis>
      <licenses>
        <license>
          <name>gpl</name>
          <nice_name>GNU General Public License 2.0</nice_name>
        </license>
      </licenses>
    </project>
  </result>
</response>
