Releases: PolMine/cwbtools
Releases · PolMine/cwbtools
Blackbird
Flying Panda
New features
- New method
encode()
to prospectively supersedCorpusData
class. Includes argumentproperties
#13. - New function
corpus_reload()
for convenient unloading/reloading corpora #68. - New utility function
registry_set_name()
#13.
Minor improvements
cwb_get_url()
will get CWB v3.5 installation files #63.corpus_remove()
returnsFALSE
(rather than failing with ERROR) when corpus
does not exist. More telling messages.p_attribute_encode()
has new argumentquietly
passed into RcppCWB functions
cwb_compress()
cwb_huffcode()
andcwb_compress_rdx()
to control verbosity.- Method
$encode()
ofCorpusData
class has new argumentquietly
passed into
p_attribute_encode()
. - Method
$encode()
has new argumentreload
to trigger unloading and reloading
corpus, to make s-attributes available #57. - The
CorpusData$encode()
method uses messages from the cli package #59. - Outdated documentation of
p_attribute_encode()
rewritten, including explanation
of argumentcompress
and simplification of sample code #61. - Corrected inconsistencies in the vignette #55.
s_attribute_encode()
coerces inputvalues
tocharacter
(rather than failing) #62.- The validity of attribute names is checked by
s_attribute_encode()
,
p_attribute_encode()
andCorpusData$encode()
using a new (internal)
function, a telling message is issued if non-ASCII or uppercase characters are
used. The documentation has been augmented accordingly #48. - For method "R",
p_attribute_encode()
checks whether files for encoded p-attribute
exist and fails gracefully with telling error message if yes #4. - Argument
compress
defaults toFALSE
as corpus compression is not stable on Windows #3. - function
corpus_as_tarball()
andcorpus_copy()
now haveregistry_file_parse(corpus, registry_dir)[["home"]]
as default value, so that values are more consistent acrosscorpus_*
functions #18. cwb_get_bindir()
tries to findcwb-config
system utility, if it is on the path.s_attribute_encode()
issues warning on Windows when using s-attribute 'id' #69.- Replaced
normalizePath()
byfs::path()
inp_attribute_encode()
#65.
Improved documentataion
Solid Path
- Package names, software names and API are wrapped in single quotes in the
DESCRIPTION files, to follow section 1.1.1 of 'Writing R extensions' #43. - References in the description of the DESCRIPTION file have been standardized
#44. - To meet CRAN requirements, any remaining usage of
install.packages()
has
been removed from the package. Using argumentpkg
ofcorpus_install()
will
install corpora found in a package as system corpora defined in the default
registry directory #46. - The vignettes 'opennnlp.Rmd' and 'sentences.Rmd' have been removed from the
package; they are now part of the PolMine Cookbook repository at
https://github.com/PolMine/cookbook
. Packages 'NLP' and 'openNLP' are no
longer suggested and theinstall.packages()
call (though not evaluated) is
omitted. Part of the fix for #46. - The
fs::path()
function replaces base Rfile.path()
throughout to solidify
the generation of paths and to improve the readability of the code throughout. p_attribute_encode()
checks that thecharacter
vectortoken_stream
does
not exceed the CWB corpus size limit (2^31 - 1) #40.
Houston Calling
- Ensure that
zenodo_get_tarball()
fails gracefully if Zenodo is temporarily
not available.
Secret Spell
- New function
p_attribute_rename()
, corresponding tos_attribute_rename()
. p_attribute_encode()
will remove the [p_attr].corpus file as suggested by
cwb-makeall (ifcompress
isTRUE
).- Assumptions about the statement of an info file in registry files are relaxed,
the line starting with "INFO" is not required. - Internally, functionality from the
fs
package for a consistent handling of
paths (such asfs::path()
) is used more widely (#36). - Assumptions about the definition of a version in the name of a corpus tarball
are relaxed. If possible, the version is taken from the properties (i.e. the
registry file). - New function
zenodo_get_tarball()
for downloading corpus tarballs from
Zenodo. Restricted access can be handled too (personalized URL with token). - Function
corpus_install()
has new argumentload
to control whether corpus
is loaded after installation.
Hemicycle
NEW FEATURES
- Assumptions about the directory structure in a corpus tarball are somewhat relaxed: The name of the data directory may also be "data" (not just "indexed_corpora") and data files need not be necessarily in a subdirectory of the data directory. This makes downloading and installing the Europarl and the Dickens corpus possible.
MINOR IMPROVEMENTS
- The dependency on the devtools package can be dropped as one consequence of removing the Europarl vignette.
- The dependency on the usethis package has been removed.
- The sentences-vignette is more robust by explicitly creating a temporary registry directory.
BUX FIXES
- A unit test that involves calling
cwb_install()
is skipped on Solaris to ensure that Solaris CRAN tests will not fail: A CWB binary is not available for Solaris.
DOCUMENTATION FIXES
- The vignette "europarl.Rmd" is dropped altogether: Putting corpora into packages is not the recommended approach any more.
Il Postino
NEW FEATURES
- It is now possible to install a corpus from S3 by stating a S3-URI as argument
tarball
ofcorpus_install()
. - A new argument
checksum
for thecorpus_install()
function introduces functionality to check the integrity of a downloaded corpus tarball. If the tarball is downloaded from Zenodo (by stating a DOI using argumentdoi
), the md5 checksum included in the record's metadata is extracted internally and used for checking. - A new vignette explains how an existing CWB corpus can be enhanced using openNLP.
- The function
corpus_copy()
will accept a new argumentremove
. IfTRUE
(the default value isFALSE
), files that have been copied will be removed. Removing files is reasonable to handle disk space parsimonously if the source corpus is at a temporary location where nobody will miss it.
MINOR IMPROVEMENTS
- The
corpus_install()
function will abort with a warning and return valueFALSE
rather than an error if the DOI is not offered by Zenodo. - If
corpus_install()
is used to install a corpus from a tarball present locally, a somewhat confusing message suggested that the tarball was downloaded. This message is not shown any more. - Extracting a corpus tarball present locally involved copying the tarball to a temporary location before extracting it. This step consuming more disk space than necessary (inefficient and potentially problematic with large corpora) is now omitted.
- The function
cwb_install()
now replaces an internally hardcoded argumentcwb_dir
with an argumentcwb_dir
; the function returns the directory where the CWB is installed rather thanNULL
value. - The function
cwb_get_bindir()
now introduces an argumentbindir
. - Argument
compress
ofp_attribute_encode(
now has default valueFALSE
(#29). - Examples in documentation of
p_attribute_encode()
have been adapted so that GitHub Action unit test passes on Windows. - A user abort if an existing corpus would be removed by installing the same version anew will not result in an error message any more, but in return value
FALSE
(#25).
BUG FIXES
- To avoid an issue with a false negative issued by
RCurl::url.exists()
, this function has been replaced byhttr::http_error()
(#31). - The
corpus_install()
function still showed some progress messages even whenverbose
was set asFALSE
(argument not passed tocorpus_copy()
. Fixed. - The code in the vignette on adding a sentence annotation was not executed when building the package and a bug in the code went unnoticed. Fixed (#17).
- The
get_encoding()
method would returnNA
iflocaleToCharset()
fails to infer charset from locale. In this case, UTF-8 is assumed.
DOCUMENTATION FIXES
- A misleading, deprecated example in a dontrun section of the general package documentation has been removed (#23). The vignette includes a working and tested example how to encode the REUTERS corpus.
Straight No Chaser
NEW FEATURES
- The (weak) dependency on the polmineR package (it was in the 'Suggests:' section of the DESCRIPTION file) has been removed. Changes are purely internal (higher-level polmineR functions have been replaced by lower-level RcppCWB functions, some tests were re-written). Dropping the dependency has the advantage that there is a much clearer structure of dependencies now (RcppCWB -> cwbtools -> polmineR).
MINOR IMPROVEMENTS
- A remaining CLI formatting issue has been removed from the user dialogue for modifying the .Renviron file.
- Unit tests used a test download of the United Nations General Assembly (UNGA) corpus from Zenodo. To reduce the time required for testing the package, a test download of the (much smaller) GermaParlSample copus is performed.
Apple Picker
NEW FEATURES
- The
corpus_install()
gives much better and nicer reports on steps performed during
corpus downloads. User dialogues have been reworked thoroughly to provide better user guidance. - The
use_corpus_registry_envvar()
function is called bycorpus_install()
and will
amend the .Renviron file as appropriate if the user so desires. - To resolve a DOI, the 'zen4R' package is used, to extract information on the whereabouts
of a corpus tarball efficiently from the Zenodo API. - A
corpus_testload()
has been implemented to check whether a (newly installed) corpus
is accessible.
MINOR IMPROVEMENTS
- Extracting the version number from the corpus tarball is somewhat more forgiving if the
version number does not start with "v". - The registry file for a newly downloaded corpus is refreshed only if a temporary registry directory is used.
- To remedy the fairly common error that the path to the info file is not stated correctly in the registry file, a fallback mechanism will look up potential alternatives to an info file stated wrongly.
BUG FIXES
- The json string returned from Zenodo may include newline strings that are escaped such
that they cannot be processed byjsonlite::fromJSON()
. The auxiliary function to get and
process information from Zenodo now ensures that newline characters are escaped such that
they can be processed. - The
corpus_copy()
function did not set the path to the info file to the new data directory - corrected. - The
corpus_install()
function failed when theregistry_dir
got aNULL
value from the default call tocwbtools::cwb_registry_dir()
. But if the directories are created, the registry directory is there. Fixed. - Removed a bug (faulty assignment) that would prevent that the path of a registry file
is handled correctly (i.e. wrapped in quotation marks) byregistry_file_compose()
when the
path includes any whitespace characters.
DOCUMENTATION FIXES
- A problem with updating the
curl
dependency ofcwbtools
that may arise whendevtools::install_github()
is used is addressed in an extended explanation in the README.md file how to install the development version ofcwbtools
usingremotes::install_github()
(#21).
Late Vintage
This is a minor release that anticipates an upcoming change in R's matrix
class, that will inherit from the array
class starting with R 4.0.
MINOR IMPROVEMENTS
- The
pkg_add_corpus()
function will now create the cwb directories (registry and data directory) if necessary. Previously, these directories were required to exist before moving a corpus into a package, making it necessary to put dummy files into packages to keep R CMD build from issuing warnings and git from dropping these directories. Creating the directories on demand is a precondition for a CRAN release of data packages (#11).
BUG FIXES
- In the upcoming R version 4.0, the
matrix
class will inherit from classarray
. The new package version now takes into account thatlength(class(matrix(1:4,2,2)))
will return the value 2.
DOCUMENTATION FIXES
- The NEWS file now follows the styleguide such that
pkgdown::build_site()
will generate a proper changelog page.