As I recently bought Sony PRS-600 (fairly good reader with nice touch screen, in case you read Polish see my review), I become interested in ebook management. For Linux user it looks like the only reasonable option is to use Calibre - useful application which not only lets me manage my reader, but also provides well designed ebook database.
One of the nice Calibre options is that once you enter a book ISBN, plenty of useful information (canonical versions of author name and book title, description, cover, even tags) can be downloaded automatically. But, for some reason, the application does not detect ISBN. I repeated the sequence open a book, go a page or a few down, copy ISBN, go back to Calibre, open book data, paste ISBN a few times and decided it is boring and could be automated.
So I wrote a short script which performs this very action.
Purpose
The script is analysing calibre database (it assumes calibre is
already installed and properly configured), looking for books without
ISBN, then tries to find their ISBN by scanning leading pages. If ISBN is found,
the script saves it (updates given book Calibre metatada). No other metadata
changes are performed.
Later on ISBN can be used to grab the book metatada and/or book cover
inside Calibre GUI. Just spawn Calibre and look for books with ISBN
set and missing metadata, for example using query like:
isbn:~[0-9] not publisher:~[a-z]
(above means: isbn contains some digit, publisher does not contain any
letter). Then mark appropriate books (I prefer to handle them in batches of no more than 10-20
so I can review the changes easily), right click, expand Edit Medatada Information
submenu and pick Download Metadata (or some other Download option).
Prerequisities
The script has been developed and used on Ubuntu Linux. It should work on other platforms (if necessary tools are installed), including Windows and Mac, but I haven't
tested it.
Calibre must be installed, properly configured and have
some books in the database (otherwise it does not make sense to run the script).
The calibredb
command must be in PATH (alternatively CALIBREDB variable on the beginning
of the script can be modified
to contain full path to calibredb).
Tools providing the following commands:
must be installed and present in PATH. On Ubuntu Linux or Debian Linux
those can be installed from standard repositories, just install the
following packages: poppler-utils, catdoc, djvulibre-bin - either using
GUI, or by running
$ sudo apt-get install poppler-utils catdoc djvulibre-bin
Python 2.6 is required (script is using features of tempfile and
subprocess introduced in 2.6). Also, lxml library must be installed.
On Debian or Ubuntu just install the following packages: python2.6
and python-lxml, for example by:
$ sudo apt-get install python2.6 python-lxml
Download and Installation
The script is available here (to download just click raw and save the file as guess_and_add_isbn.py in any folder of your choice).
Usage
Spawn terminal or console, check whether PATH is set properly, then
run:
$ python guess_and_add_isbn.py
and wait for the script to finish.
Note: it may take some time, especially on bigger databases.
The script can be run while Calibre is running (it will notify
running Calibre about data changes). There is minor annoyance in such case
(every time some book is updated, Calibre refreshes the book list and forgets
which books were selected), so I do not recommend searching or editing books
while the script is running.
The script can be safely re-run again (for example after new books are added).
Source code
Official repository: http://bitbucket.org/Mekk/calibre_utils
I got this script to work on Windows 7 64 bit. It took some doing, but it does work. Thanks for your work!
Install djvu
install xpdf
install catdoc
add all to path
install python 2.6 install reg hack to make setuptools installer work
http://bugs.python.org/setuptools/file65/python-fix.txt
install python setuptools 2.2.4
install lxml 2.2.4 with "easy_install lxml==2.2.4"
I've noticed that some ISBNs it will not pull out. Would it be possible to add various other regexes with an OR so that it might pick up other ISBN variations?
Glad to hear!
Different regexps can be tried, in many ways. Which syntaxes do you mean (just show some examples)?
I ran it on 5 books, and 2 of them put the publisher in the ISBN field while 3 worked.
One book had this format:
Includes index. ISBN 0-7897-3669-1 1. Database management. 2. Microsoft Access. I. Title. QA76.9.D3M395252 2007
and it ended up with Que in the ISBN field.
another book had ISBN spelled out instead of abbreviated. etc.
I suppose some checking could be done as well, so that only 10 or 13 digit values are allowed to be put in the isbn field.
I will run some more books through the script and post better results.
All of these formats failed to be detected.
ISBN-13 (pbk): 978-1-59059-982-2 ISBN-10 (pbk): 1-59059-982-9 ISBN-13 (electronic): 978-1-4302-0634-7
ISBN-13 (pbk): 978-1-4302-1632-2 ISBN-13 (electronic): 978-1-4302-1633-9
ISBN, print ed. 0-936348-07-0 ISBN, PDF ed. 0-936348-08-9
I updated the script to handle all cases you quoted. It's available in the same place as previously
Are you able to modify the script to look for ISBN numbers in *.chm files in calibre library?
It should be possible, I will take a look
I was wondering about using archmage, extracting the contents and then parse through the first page, because usually(never have I seen contrary) ISBN is on the first page.
I already dit it (used archmage), will publish this code soon.
Done
sweeet! It retrieves data for some chm books, but it
crashes:
Traceback (most recent call last): File "guess_and_add_isbn.py", line 309, in for item, isbn in find_files_with_new_isbn(): File "guess_and_add_isbn.py", line 276, in find_files_with_new_isbn txt = grab_file_text_for_analysis(fl) File "guess_and_add_isbn.py", line 213, in grab_file_text_for_analysis return routine(file_path) File "guess_and_add_isbn.py", line 196, in grab_file_text_chm return "\n".join( html.fromstring( "".join(lines) ).itertext() ) File "/usr/lib/python2.6/dist-packages/lxml/html/init.py", line 603, in fromstring doc = document_fromstring(html, parser=parser, base_url=base_url, **kw) File "/usr/lib/python2.6/dist-packages/lxml/html/__init__.py", line 511, in document_fromstring value = etree.fromstring(html, parser, **kw) File "lxml.etree.pyx", line 2532, in lxml.etree.fromstring (src/lxml/lxml.etree.c:48270) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812) File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673) File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745) File "parser.pxi", line 576, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64260) lxml.etree.XMLSyntaxError: None
If you could add some verbose to the script (what file is the script going to parse) I might isolate the books and mayby see what is wrong with them, or just isolate them.
I fixed this problem (or at least I hope so as I don't have any chm file which fails this way), check the current version.
Details:
My script parses HTML output of archmage to throw away HTML tags (so I don't need to consider them in regular expressions). As your case proves, this parsing happens to fail, so I decided to just catch such exceptions, print warning if they happen, then search for isbn in raw HTML text. It means that in such case it is a little bit less likely to be found, but I think it is a reasonable compromise.
Reporting bugs on this site looks so bad, so I decided to use bitbucket for reporting hope it improves our communication ;]
Not sure what exactly do you mean, but bitbucket is a good place to report bugs, this is why I enabled bugtracking for the project (let me also mention, that it is even better place to offer patches).
Here comes my, really ugly solution, but since I have calibre on windows, I used powergrep in order to parse all the .pdf files in calibre subfolders I collect data with this search:
and collecting the following:
%MATCHFILEN% UPDATE "books" set isbn='\2' where id=%FILE%;
Thus
\2is the second "back reference" I useMATCHFILENto filter later (I use only the first match)FILEincludes the path+filenameThe matches are collected in a text file, and PSPAD is good enough to manipulate it.
Thanks to calibre the path includes de ID, I use a stupid trim to extract the ID:
1) eliminate annoying text between "()",
2) search and eliminate "([A-z]*)"
3) replace remaining ").*;" with ";"
4) replace "id=.*(" with "id="
A bit of supervision and a spreadsheet to test the isbn. Fix the errors and merge with isbn in calibre.
Now come the clowns with sqlitespy. Open the db. First copy the creation steps for the 3rd trigger in the table "books" and then drop it.
Execute the Updates
Create the trigger
And that's all.
Some books use a line just with the numbers of ISBN. Then I use:
(|ISBN-10|ISBN-13|ISBN[\: \(\)A-z]*?)([0-9]+[0-9_\-X ]*)$Hope this helps
Thank you so much for this, Marcin. It really is awesome. I got it working 3 or 4 days ago and it was able to find ISBNs for almost all my books. I just updated Calibre to 0.70 (it was 0.69-something) and the script has now stopped working. I'm going to revert back to 0.69 for the timebeing ('cos between a choice of 0.70 or your script, the answer's obvious), but here's the error:
C:\Users\Andrew\Desktop>python c:\python26\guess_and_add_isbn.py Usage: calibredb.exe list [options]
List the books available in the calibre database.
Whenever you pass arguments to calibredb.exe that have spaces in them, enclose t he arguments in quotation marks.
calibredb.exe: error: no such option: --output-format Traceback (most recent call last): File "c:\python26\guess_and_add_isbn.py", line 314, in for item, isbn in find_files_with_new_isbn(): File "c:\python26\guess_and_add_isbn.py", line 278, in find_files_with_new_isbn for item in locate_potential_isbn_files(): File "c:\python26\guess_and_add_isbn.py", line 249, in locate_potential_isbn_files tree = objectify.parse(StringIO(out)) File "lxml.objectify.pyx", line 1860, in lxml.objectify.parse (src/lxml/lxml.objectify.c:18814) File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594) File "parser.pxi", line 1517, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71540) File "parser.pxi", line 1545, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:71812) File "parser.pxi", line 1424, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:70673) File "parser.pxi", line 938, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:67442) File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:63824) File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745) File "parser.pxi", line 576, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64260) lxml.etree.XMLSyntaxError: None
Looks like calibre author removed XML output option from list command.
I will take a look at it, but it may take a few days before I have time to solve it.
I believe the problem is now solved. See the new article