Metadata in scientific documents is, unfortunately, rarely appreciated and not widely used. When it comes to bibliographic metadata, the situation is even quite deceiving: there is no accepted format specification, and the reliability of publishers’ metadata, if any at all, is questionable in many cases.
cb2Bib reads all XMP (a specific XML standard devised for metadata
storage) packets found in the document. It then parses the XML strings
looking for nodes and attributes with key names meaningful to
bibliographic references. If a given bibliographic field is found in
multiple packets, cb2Bib will take the last one, which most often, and
according to the PDF specs, is the most updated one. The fields
which would be the document itself, and
pages, which is usually the
actual number of pages, are skipped.
The metadata is then summarized in cb2Bib clipboard panel as, for instance
[Bibliographic Metadata <title>arXiv:0705.0751v1 [cs.IR] 5 May 2007</title> /Bibliographic Metadata]
This data, whenever the user considers it to be correct, can be easily
imported by the build-in ‘Heuristic Guess’ capability. On the other
hand, if keys are found with the prefix
bibtex, cb2Bib will assume the
document does contain bibliographic metadata, and it will only consider
the keys having this prefix. Assuming therefore that metadata is
bibliographic, cb2Bib will automatically import the reference. This way,
if using PDFImport, BibTeX-aware documents will be processed as
successfully recognized, without requiring any user supplied regular
Once an extracted reference is saved and there is a document attached to it, cb2Bib will optionally insert the bibliographic metadata into the document itself. cb2Bib writes an XMP packet as, for instance
<bibtex:author>P. Constans</bibtex:author> <bibtex:journal>arXiv 0705.0751</bibtex:journal> <bibtex:title>Approximate textual retrieval</bibtex:title> <bibtex:type>article</bibtex:type> <bibtex:year>2007</bibtex:year>
which is similar to JabRef but differs on that cb2Bib strictly sticks to BibTeX and avoids (perhaps unnecessary) syntax specialization in author strings.
The BibTeX fields
id are skip from writing. The former for
the reason mentioned above, and the latter because it is easily
generated by specialized BibTeX software according to each user
preferences. LaTeX escaped characters for non Ascii letters are
converted to UTF-8, as XMP already specifies this codec.
The actual writing of the packet into the document is performed by ExifTool, an excellent Perl program written by Phil Harvey. See https://sno.phy.queensu.ca/~phil/exiftool/. ExifTool supports several document formats for writing. The most relevant here are Postscript and PDF. For PDF documents, metadata is written as an incremental update of the document. This exactly preserves the binary structure of the document, and changes can be easily reversed or modified if so desired. Whenever ExifTool is unable to insert metadata, e.g., because the document format is not supported or it has structural errors, cb2Bib will issue an information message, and the document will remain untouched.