Flags and Lollipops

Tuesday, December 09, 2008

Connotea Ian, today

Look at that focus! The dedication!




  • New hardware is here

  • Code is all set up

  • Database files are being moved over now

  • Testing is imminent

  • don't lose hope

Comments and trackbacks Feel free to post your comments Blogger McDawg Blogger Stew OpenID cameron Blogger Stew Blogger Stew OpenID maxine Anonymous Luke . This post has trackbacks.

Friday, December 05, 2008

Strip HTML tags from a string, Ruby edition

Get Hpricot.


require 'hpricot'
page = Hpricot("<b>some marked up <i>text</i></b>")
puts page.to_plain_text


Interestingly the Hpricot FAQ says:


Q: How do I strip all HTML tags from a page?
A: Use regex replace!
A2: The regex is ok, but will break in some cases, even with valid html. Try the to_plain_text or inner_text methods instead.

Comments and trackbacks Feel free to post your comments Anonymous Michael Barton Blogger Stew . This post has trackbacks.

Wednesday, December 03, 2008

Strip HTML tags from a string, Python edition

Obtain Beautiful Soup.


from BeautifulSoup import BeautifulSoup

''.join(BeautifulSoup(page).findAll(text=True))



where 'page' is your string of text and HTML.

I'm not a pythonista, there might be a nicer way of doing it (Beautiful Soup is a lot of overhead). Might want to expand on this a bit to make sure spacing is handled OK, you can keep certain tags etc. etc. Feel free to post corrections or better suggestions in the comments.

Just don't use one line <(?:.*?)> regular expressions. No, really.

Labels: , ,

Comments and trackbacks Feel free to post your comments Blogger baoilleach Blogger Steve Blogger Stew Blogger Stew Blogger Steve . This post has trackbacks.

Saturday, October 25, 2008

Academia.edu

I like academia.edu. The academic family tree idea is pretty cool (I know that the concept has been around for a while in various guises but their implementation is pretty slick) and I like the fact that new visitors can arrive and be interacting with the site within minutes. It's also nice to see an academic networking site that, well, doesn't look like Facebook.

I'm also impressed by the speed at which they've been throwing up refinements and bug fixes... and by the adverts on PhD. Canny marketing (good work poorly paid but well fed intern)! The academia.edu team are a smart bunch of people which is probably how they got funding in the first place.

For balance what's not good about it? The flash freezes my mac on an empty cache... and the .edu TLD is really only for educational institutions, not commercial enterprise (vetting only started in 2001, academia.edu was first registered back in '99). Tsk! Ironically my other bugbear is that I can't join properly because I work for a commercial enterprise and not an accredited educational institution.

Labels:

Comments and trackbacks Feel free to post your comments Anonymous Jonny82 . This post has trackbacks.

Saturday, May 24, 2008

Disappointed with Popfly

Popfly is the mashup editor that Microsoft released last year. The idea is good. The 3D graphics are good. Silverlight is a bit buggy in Firefox (sidebars don't always redraw properly) but that's OK.

If you're going to create a web 2.0 mashups builder, though, don't you think it's be a good idea to provide some Atom support?

Comments and trackbacks Feel free to post your comments . This post has trackbacks.

Monday, May 19, 2008

Meta-analysis

The journal platform team here at NPG just rolled out machine readable metadata for the papers we publish in Dublin Core, PRISM (good PRISM, not to be confused with evil PRISM) and Google metadata formats.

No more scraping to automatically get the citation for a paper, it's all in the HEAD:


<meta name="citation_journal_title" content="Nature" />
<meta name="citation_publisher" content="Nature Publishing Group" />
<meta name="citation_authors" content="Paul Schenk, Isamu Matsuyama, Francis Nimmo" />
<meta name="citation_title" content="True polar wander on Europa from global-scale small-circle depressions" />
<meta name="citation_volume" content="453" />
<meta name="citation_issue" content="7193" />
<meta name="citation_firstpage" content="368" />
<meta name="citation_doi" content="doi:10.1038/nature06911" />


Useful for apps like Zotero and Connotea (which before now downloaded two files each time you bookmarked a Nature paper: the page itself and then the linked EndNote file to parse).

The metadata will be there for all papers going forward and back through some of the archives.

For fulltext indexing of papers behind the paywall you can use the linekd OTMI file (I only just saw Twease, which does just that) although there's only OTMI for Nature papers at the moment, I think.

Lastly at some point in the future we're aiming to put XMP metadata in our PDFs, which should make it much easier for scripts and applications (like Papers) to look at PDF files on your filesystem and work out what they represent.

Comments and trackbacks Feel free to post your comments Anonymous Ian Tresman Blogger Stew . This post has trackbacks.


See all posts from: July 2005 August 2005 September 2005 October 2005 November 2005 December 2005 January 2006 February 2006 March 2006 April 2006 May 2006 June 2006 July 2006 September 2006 October 2006 November 2006 December 2006 January 2007 February 2007 March 2007 April 2007 May 2007 June 2007 July 2007 August 2007 October 2007 November 2007 December 2007 January 2008 February 2008 March 2008 April 2008 May 2008 October 2008 December 2008