Browsing projects by Tag(s)

Select a tag to browse associated projects and drill deeper into the tag cloud.

Showing page 1 of 3

Beautiful Soup parses XML and HTML as seen in the wild, and provides a variety of methods and Pythonic idioms for iterating and searching the parse tree. Beautiful Soup development is now done at https://www.launchpad.net/beautifulsoup. The discussion forum is still at http://groups.google.com/group/beautifulsoup/.

4.8
   
  0 reviews  |  14 users  |  1,966 lines of code  |  0 current contributors  |  Analyzed 9 days ago
 
 

Scrapy is a fast high-level scraping and web crawling framework, used to crawl websites and extract structured data from their pages. It can be used for a wide range of purposes, from data mining to monitoring and automated testing.

5.0
 
  0 reviews  |  14 users  |  17,047 lines of code  |  39 current contributors  |  Analyzed 11 days ago
 
 

pyquery

5.0
 
  0 reviews  |  3 users  |  2,120 lines of code  |  6 current contributors  |  Analyzed 3 days ago
 
 

ScraperWiki is an online tool for screen scraping and data mining.

0
 
  0 reviews  |  1 user  |  155,830 lines of code  |  9 current contributors  |  Analyzed 9 days ago
 
 

Pythonic Crawling / Scraping Framework Built on Eventlet Features * High Speed WebCrawler built on Eventlet. * Supports databases engines like Postgre, Mysql, Oracle, Sqlite. * Command line tools. * Extract data using your favourite tool. XPath or Pyquery (A Jquery-like library for python). ... [More] * Cookie Handlers. * Very easy to use (see the example). Documentation http://packages.python.org/crawley/ [Less]

0
 
  0 reviews  |  1 user  |  3,527 lines of code  |  1 current contributor  |  Analyzed 8 days ago
 
 

Perl web scraping toolkit

5.0
 
  0 reviews  |  1 user  |  1,053 lines of code  |  1 current contributor  |  Analyzed 3 days ago
 
 

A html extractor in javascript. usage: ---- jhe_im(extract_conditions...) return inner html match the extract conditions. jhe_om(extract_conditions...) return outter html match the extract conditions. jhe_ma(extract_conditions..., attributeName) return the attribute value in the special tag ... [More] that match the extract conditions. jhe_mt(extract_conditions...) return the text in the special tag that match the extract conditions. about the extract_conditions, extract conditions are uncertain length arguments. They are used to match the position you want, the extract conditions could be: htmlTagName, such as 'div', 'a'..., It's means the tag you want to locate. @attributeName=attributeValue, such as '@class=red', '@id=container' @@attributeName=attributeRegexValue, such as '@@class=\w+', '@id=1-9' ^htmlTagName, the first tag must be htmlTagName. >htmlTagName, the next tag must be htmlTagName. example: "div1 ".jhe_im("div") return: ["div1 "] "div1".jhe_ma("div", "id") return: ["attr_div1"] "div1 ".jhe_mt("div") return: ["div1"] "div1div2".jhe_om("div") return: ["div2", "div2"] "div1div2 content".jhe_im("div", "@id=div2") return 'div2 content' "div1div2".jhe_im("div", "p") return ["div1", "div2"] "div1div2".jhe_im("div", ">p") return ["div1"] "11 div2".jhe_im("^div") return [] "div211 ".jhe_im("^div") return ["div2"]there are more examples in the unittest. 方法说明: jhe_im(匹配参数..) 该方法返回符合匹配参数的标签内的所有内容,返回值类型是数组。 2. jhe_om(匹配参数..) 该方法返回符合匹配参数标签及其标签内的所有内容,返回值类型是数组。 3. jhe_ma(匹配参数.., 属性名) 该方法返回符合匹配参数标签的指定属性的属性值,返回值类型是数组。 4. jhe_mt(匹配参数..) 该方法返回符合匹配参数的标签下的所有文本内容, 返回值类型是数组。 5. 关于匹配参数,匹配参数是个不定长的参数,他可以为以下内容 html标签: 如 'div', 'a'...,表示为需要匹配的标签名称 属性表达式: @attributeName=attributeValue, 如 '@class=red', '@id=container',表示需要匹配的标签的属性必须符合指定条件 属性表达式: @@attributeName=attributeValue, 如 '@@class=\\w', '@id=1-9',表示需要匹配的标签的属性必须符合指定的正则式条件 ^+html标签:,表示当前html字符串的第一个标签 >+html标签 : ,表示紧接前一标签的下一标签 [Less]

0
 
  0 reviews  |  0 users  |  1,538 lines of code  |  0 current contributors  |  Analyzed 7 days ago
 
 

python program that scrapes all types of web sources for torrents whether it is a rss, atom or even web site that doesnt have aggregating feeds. supports episodes (series). smart downloading and regexp filtering and hopefully a nice simple gui but at least a nice modular gui configurable through a config file.

0
 
  0 reviews  |  0 users  |  0 current contributors  |  Analyzed 8 days ago
 
 

Content Extractor is professional data-mining software that organizes collected information for a convenient work. You can use it for a regular automatic data collection or extraction of any web content manually. The program is very accurate and collects data from pages associated with the specified ... [More] source. It comes with a handy built-in browser and can save data in MS Excel .xml, .html and .csv formats. Please read Content Extractor Quickstart to start working with Content Extractor. If you want to give some feedback don't hesitate to contact us. All your comments are welcome to mc.vertix@gmail.com [Less]

0
 
  0 reviews  |  0 users  |  31,839 lines of code  |  0 current contributors  |  Analyzed 4 days ago
 
 

This project aims to sse Manchester City Council Meeting's Minutes as a basis for highlighting the good work that Councillors do, and to find out further information on the activity of Councillors and the decision making process of our Council. We will be using a Minute Scraper as a basis for ... [More] a Meeting Attendee List and moving on from that towards a http://TheyWorkForYou.com like functionality but based on Manchester Council. It's early days, but we had a hack day on Sat 29th November, 2008 and came up with a few basic scrapping scripts, and an idea of how it might move foward. On Saturday February 7th, 2009 we met up, and more progress was made. Everyone went away motivated to do more. On Saturday March 7th, 2009 we met up once again teaming up with the National Hack The Government Day on http://tyoc.co.uk for the details. The Next tyoc collaboration day will be held in held in on Saturday May 2nd, where it's hoped to make progress on this and other development streams. Please get in touch (http://ianmoss.com/contact) if you'd like to help out, and we can add you to our mailing list. If you can't make this one you are welcome to contribute online, and there will be further date annouced shortly after that that you maybe able to make. This project is taking place under the banner The Year Of Collaboration. More information can be found out about this at http://alteris.blogspot.com or why not see what people have been tweet'ing about it: http://search.twitter.com/search?q=%23tyoc Please use #tyoc to tag these events. The Year Of Collaboration will have it's own website ( at tyoc.co.uk )in due course. [Less]

0
 
  0 reviews  |  0 users  |  10,550 lines of code  |  0 current contributors  |  Analyzed about 6 hours ago
 
 
 
 

Creative Commons License Copyright © 2013 Black Duck Software, Inc. and its contributors, Some Rights Reserved. Unless otherwise marked, this work is licensed under a Creative Commons Attribution 3.0 Unported License . Ohloh ® and the Ohloh logo are trademarks of Black Duck Software, Inc. in the United States and/or other jurisdictions. All other trademarks are the property of their respective holders.