AutoScholar 1.0.0 2008-03-24

To my delight, the University of Texas at Austin allows all accepted students to access the scientific literature, even before the summer orientation sessions. I tried it out today and started to download some papers. Click, click, click, let the browser load, unload, let the PDF program load in the background, click back to the browser since kpdf wants to grab attention, lots and lots of useless steps. This is obviously much more than O(n) or just directly getting the papers. But there was a pattern. So, being deprived of real science and literature for the past four years, I took this opportunity to write a script to fetch as much as science as I can handle -- automatically. The result is AutoScholar, a hacky perl script that is in desperate need of improvement. AutoScholar should work with any university using ezproxy (this is most of them -- chances are, your uni is using ezproxy). Just type in the name of the paper you want. Ask and ye shall receive.

fetch.pl.txt <-- perl script

Drink away.

Future improvements

- Automatic bibliography management (generate BibTeX for the results)
- BibTeX integration (input and output)
- Log files detailing the different options for downloading the file
- ncurses
- Extract references from the paper.
----- Spidering the web of papers.
-- (1) PDF -> HTML -> TXT conversion, extract the references.
-- (2) Extract references from the website on which you find the PDF (not good if you download directly from Google Scholar).
----- (2.1) Nature provides RSI files. PROLA does citations on an HTML page after a page click.
--------> Include easily extendable modules to extract citations. When citations cannot be found on the page, dump the HTML output and make a silent error report (don't quit). Extensions will include checks on the URL to figure out what website it is, and then have a specialized way of getting the references, either by a few extra clicks or by downloading a certain document, whatever.
- Email/IM-based readings --> Generate sentences of text at a time, allowing rapid annotation on the part of the user with respect to the text coming at the screen. Send diagrams via DirectConnect in real time whenever a figure is referenced.
- Include "Feel Lucky" parameter so that the first Google Scholar result is chosen (not the "Get this article" link and not the "All x versions" link) even if the link is not a PDF.



OCR is not going well:



Some notes:
Ways to grab the PDF:
- Google Scholar immediate link if PDF. If not PDF, go to the other techniques (directly below). If none of those are going to work, then "click the link" method. *
- Google Scholar "Get this article" link (goes to UT site, I have this down)
- Google Scholar "All X copies" where you repeat either (1) immediate link or (2) Get this article.
- Alternatively: Ambiguity Resolver Script --- show the other titles of the Google Scholar search results and let the user select which one's right for him.

* This should bring the user to a particular journal database's website. Here, search for the text 'PDF' and 'references' and get that content. In some cases, the websites are not well configured, so a search box has to be found and results have to be followed *until* you get to the page that links to the PDF and references for that specific article. Two options:
** Code for a general site and just find the 'search engine' assuming the form is always the same across different websites (bad)
** Code in some specialty stuff for dealing with a particular database.
*** Let the user directly specify the use of this database, or the default is to just search Google Scholar. (apt-get --db=APS --author=Feynman --year=1949 ----> ncurses interface of all results, unless you specify --gf (grab first)).
Define a file format for specifying a 'bundle': PDF, references, BibTeX, any sort of relevant data like that.

- After grabbing the PDF, grab metadata, as much as possible. Then also grab any link with 'references' in it. The references might be on the page with the abstract, or they might be somewhere else. I guess this will vary from site to site, so I need to make a "site-checking interface" where I can just plug in if-thens and be done with it.
- If no references were found on the publisher's website, then check if the PDF is text or image. If text, then extract the references. If image, then flag it and alert a human.



See also pubget - it's like pubmed but you get PDFs immediately. :) Web interface only. AutoScholar can become its cli. At the moment, here's its entire web linkdomain: David Rothman, Ian Connor, some youtube vid, moustaki.org, and pubget-blog/pubget-test. Yep. As of 2008-04-21.


2008-06-03: http://blog.programmableweb.com/2008/05/30/arxiv-an-api-for-research-grade-information/ -- integrate arxiv API into autoscholar (as suggested by Russell Hanson)