cb2Bib: Search BibTeX files for references
Description
- Search pattern
Patterns and composite patterns can be either approximate strings,
strings, regular expressions, or wildcard filters. Patterns admit Unicode
characters. The scope of each pattern can be the reference as a whole or be
focused on a particular reference field. The fields year,
file, and journal are treated specifically. The field
year has the qualifiers Exact, Newer, and
Older. The field file can optionally refer to either the
filename or the contents of such a file. Finally, for journal, the
input pattern is duplicated to the, if available, journal fullname, and they two
are checked against the journal actual field contents and, if
available, its expanded contents. For example, typing 'ijqc' retrieves all
references with journal being 'Int. J. Quantum Chem.'. Or, typing
'chemistry' retrieves any of 'J. Math. Chem.', 'J. Phys. Chem.', etc. This
expansion is not performed when the pattern scope is set to all.
- Search scope
By default, searches are performed on the current BibTeX output file. If Scan
all BibTeX files is checked the search will extend to all BibTeX files,
extension .bib, present in the current directory. It might be therefore convenient
to group all reference files in one common directory, or have them linked to that
directory. When Scan linked documents is checked, and one or more pattern
scope is all or file, the contents of the file in
file is converted to text and scanned for that given pattern. See
Configuring
Utilities section to configure the external to text converter.
- Search modifier
The cb2Bib converts TeX encoded characters to Unicode when parsing the references.
This permits, for instance, for the pattern 'Møller' to retrieve either
'Møller' or 'M{\o}ller', without regard to how the BibTeX reference is
written. By checking Simplify source, the reference and the converted PDF
files are simplified to plain Ascii. Thus, the pattern '\bMoller\b' will hit any
of 'Møller', 'M{\o}ller', or 'Moller'. Additionally, all non-word
characters are removed, preserving only the Ascii, word structure of the source.
Note that source simplification is only performed for the patterns whose scope is
all or file contents, and that and so far, the cb2Bib has only a
subset of such conversions. Implemented TeX to Unicode conversions can be easily
checked by entering a reference. The Unicode to Ascii letter-only conversion, on
the other hand, is the one that the cb2Bib also uses to write the reference IDs
and, hence, the renaming of dropped files. The cb2Bib can also understand minor
sub and superscript formatting. For instance, the pattern 'H2O' will retrieve
'H2O' from a BibTeX string 'H$_{2}$O'.
Notes
- The cb2Bib uses an internal cache to speed up the search of linked files. By
default data is stored as
current_file.bib.c2b. It might be more
convenient, however, to setup a temporary directory out of the user data backup
directories. See Search In Files Cache Directory in Configuring Files. When a linked file is
processed for the first time, the cb2Bib does several string manipulations, such
as removing end of line hyphenations. This process is time consuming for very
large files.
- The approximate string search is described in reference http://arxiv.org/abs/0705.0751v1. It reduces the chance of missing a
hit due to transcription and decoding errors in the document files. Approximate
string is also a form of serendipitous information retrieval.
|