cb2Bib: PDF Reference Import
Articles in PDF or other formats that can be converted to plain text can be
processed and indexed by the cb2Bib. Files can be selected using the Select Files
button, or dragging them from the desktop or the file manager to the PDFImport
dialog panel. Files are converted to plain text by using any external translation
tool or script. This tool, and optionally its parameters, are set in the cb2Bib
configure dialog. See the Configuring Utilities section for
details.
Once the file is converted, the text, and optionally, the preparsed metadata, is
sent to the cb2Bib for reference recognition. This is the usual, two step process.
First, text is optionally preprocessed, using a simple set of rules and/or any
external script.or tool. See Configuring Clipboard. Second, text is
processed for reference extraction. The cb2Bib so far uses two methods. One
considers the text as a full pattern, which is checked against the user's set of
regular expressions. The better designed are these rules, the best and most
reliable will be the extraction. The second method, used when no regular expression
matches the text, considers instead a set of predefined subpatterns. See Field Recognition Rules.
At this point users can interact and supervise their references, right before
saving them. Allowing user intervention is and has been a design goal in the
cb2Bib. Thus, at this point, the cb2Bib invites users to check their references.
Poorly translated characters, accented letters, 'forgotten' words, or some minor
formatting in the titles might be worth considering. See Text Extraction for a
description on the intricacies of PDF to text conversions. In addition, if too few
fields were extracted, one might perform a network query. Say, only the DOI was
catch, then there are chances that such a query will fill the remaining fields.
The references are saved from the cb2Bib main panel. Once Save is pressed, and
depending on the configuration, see Configuring Documents, the document
file will be either renamed, copied, moved or simply linked onto the
file field of the reference. If Insert BibTeX metadata to document
files is checked, the current reference will also be inserted into the document
itself.
When several files are going to be indexed, the sequence can be as follows:
- Process next after saving
Once files are load and Process is pressed, the PDFImport dialog can be minimized
(but not closed) for convenience. All required operations to completely fill the
desired fields (e.g. dynamic bookmarks, open DOI, etc, which might be required if
the data in document is not complete) are at this point accessible from the main
panel. The link in the file field will be permanent, without
regard to which operations (e.g. clipboard copying) are needed, until the
reference is saved. The source file can be open at any time by right clicking the
file line edit. Once the reference is saved, the next file will be
automatically processed. To skip a given document file from saving its reference,
press the Process button.
- Unsupervised processing
In this operation mode, all files will be sequentially processed, following the
chosen steps and rules. If the processes is successful, the reference is
automatically saved, and the next file is processed. If it is not, the file
is skipped and no reference is saved. While processing, the clipboard is disabled
for safety. Once finished, this box is unchecked, to avoid a possible accidental
saving of a void reference. Network queries that require intervention, i.e., whose
result is launching a given page, are skipped. The processes follows until all
files are processed. However, it will stop to avoid a file being overwritten, as a
result of a repeated key. In this case, it will resume after manual renaming and
saving. See also The cb2Bib Command
Line, commands '--txt2bib' and '--doc2bib'.
Automatic Extraction: Questions and Answers
- When does cb2Bib do automatic extractions? The cb2Bib is conceived as a
lightweight tool to extract references and manage bibliographies in a simple,
fast, and accurate way. Accuracy is better achieved in semi-automatic extractions.
Such extractions are handy, and allow user intervention and verification. However,
in cases where one has accumulated a large number of unindexed documents,
automatic processing can be convenient. The cb2Bib does automatic extraction when,
in PDFImport mode, 'Unsupervised processing' is checked, or, in command line mode,
when typing
cb2bib --doc2bib *.pdf tmp_references.bib, or, on
Windows, c2bconsole.exe instead of cb2bib.
- Are PDFImport and command line modes equivalent? Yes. There are,
however, two minor differences. First, PDFImport adds each reference to the
current BibTeX file, as this behavior is the normal one in cb2Bib. On the other
hand, command line mode will, instead, overwrite
tmp_references.bib
if it exists, as this is the expected behavior for almost all command line tools.
Second, as for now, command line mode does not follow the configuration option
'Check Repeated On Save'.
- How do I do automatic extraction? To test and learn about automatic
extractions, the cb2Bib distribution includes a set of four PDF files that mimic a
paper title page. For these files, distribution also includes a regular
expression, in file
regexps.txt, capable of extracting the reference
fields, provided the pdftotex flags are set to their default values.
Processing these files, should, therefore, be automatic, and four messages stating
Processed as 'PDF Import Example' should be seen in the logs. Note
that extractions are configurable. A reading of Configuration will provide additional, useful
information.
- Why some entries are not saved and files not renamed? Once you move
from the fabricated examples to real cases, you will realize that some of the
files, while being processed, are not renamed and their corresponding BibTeX data
is not written. For each document file, cb2Bib converts its first page to text,
and from this text it attempts to extract the bibliographic reference. By design,
when extraction fails, cb2Bib does nothing: no file is moved, no BibTeX is
written. This way, you know that the remaining files in the origin directory need
special, manual attention. Extractions are always seen as failed, unless
reliable data is found in the text.
- What is reliable data? Note that computer processing of
natural texts, as extracting the bibliographic data from a title page, is nowadays
an approximated procedure. The cb2Bib tries several strategies: 1) allow
for including user regular expressions very specific to the extraction at hand,
2) use metadata if available, 3) guess what is reasonable, and,
based on this, make customized queries. Then, cb2Bib considers extracted data
is reliable if i) data comes from a match to an user supplied regular
expression ii) document contains BibTeX metadata, or iii) a guess is
transformed through a query to formatted bibliographic data. As formatted
bibliographic data, cb2Bib only understands BibTeX, and, as an exception, PubMed
XML data. However, it allows external processing if needed. Any other data,
metadata, guesses, and guesses on query results are considered unreliable
data.
- Is metadata reliable data? No. Only author, title, and keywords in
standard PDF metadata can be mapped to their corresponding bibliographic fields.
Furthermore, publishers most often misuse these three keys, placing, for instance,
DOI in title, or setting author to, perhaps, the document typesetter. Only BibTeX
XMP metadata is considered reliable, and only documents already processed with
cb2Bib or JabRef will have it. If you consider that a set of PDF files does
contain reliable data, you may force to accept it using the command line switch
--sloppy together with --doc2bib.
- How successful is automatic extraction? As it follows from the given
definition of reliable data, running automatic extractions without adhoc
regexps.txt and netqinf.txt files will certainly give a
zero success ratio. In practice, scenario 3) often applies: cb2Bib guesses several
fields, and, based on the out-of-the-box netqinf.txt file, it obtains
from the web either BibTeX or PubMed XML data. Thus, biologists, for instance,
usually have success ratios close to 100%, since PubMed is almost complete for
them, and its data is extremely accurate.
-
What can I do to increase success ratio? First, set your favorite journals
in file
abbreviations.txt. Besides increasing the chances of journal
name recognition, it will provide consistency across your BibTeX database. In
general, do not write regular expressions to extract directly from the PDF text.
Conversion is often poor. Special characters often break lines, thus breaking
your regular expressions too. Write customized queries. For instance, if your
PDFs have DOI in title page, set the simple query
journal=The Journal of Everything|
query=http://dx.doi.org/<<doi>>
capture_from_query=
referenceurl_prefix=
referenceurl_sufix=
pdfurl_prefix=
pdfurl_sufix=
action=htm2txt_query
then, if it is feasible to extract the reference from the document's web
page using a regular expression, include it in file regexps.txt.
Note that querying in cb2Bib had been designed having in mind minority fields of
research, for which, established databases might not be available. If cb2Bib
failed to make reasonable guesses, then, you might consider writing very simple
regular expressions to extract directly from the PDF text. For instance, obtain
title only. Then, the posterior query step can provide the remaining information.
Note also, especially for old documents, journal name is often missing from the
paper title page. If in need of processing a series of those papers, consider
using a simple script, that, in the cb2Bib preprocessing step, adds this missing
information.
- Does successful extraction mean accurate extraction? No. An extraction
is successful if reliable data, as defined above, is found in the text, in the
metadata, or in the text returned by a query. Reference accuracy relies on whether
or not user regular expressions are robust, BibTeX metadata is correct, a guess is
appropriate, a set of queries can correct a partially incorrect guess, and the
text returned by a query is accurate. In general, well designed sets of regular
expressions are accurate. Publisher's abstract pages and PubMed are accurate. But,
some publishers are still using images for non-ASCII characters, and PubMed
algorithms may drop author middle names if a given author has 'too many names'.
Expect convenience over accuracy on other sources.
|