Title: | Import and Handling for Plain and Formatted Text Files |
---|---|
Description: | Functions for importing and handling text files and formatted text files with additional meta-data, such including '.csv', '.tab', '.json', '.xml', '.html', '.pdf', '.doc', '.docx', '.rtf', '.xls', '.xlsx', and others. |
Authors: | Kenneth Benoit [aut, cre, cph], Adam Obeng [aut], Kohei Watanabe [ctb], Akitaka Matsuo [ctb], Paul Nulty [ctb], Stefan Müller [ctb] |
Maintainer: | Kenneth Benoit <[email protected]> |
License: | GPL-3 |
Version: | 0.91 |
Built: | 2024-11-11 04:45:50 UTC |
Source: | https://github.com/quanteda/readtext |
A set of functions for importing and handling text files and formatted text files with additional meta-data, such including .csv, .tab, .json, .xml, .xls, .xlsx, and others.
readtext makes it easy to import text files in various formats, including using operating system filemasks to load in groups of files based on glob pattern matches, including files in multiple directories or sub-directories. readtext can also read multiple files into R from compressed archive files such as .gz, .zip, .tar.gz, etc. Finally readtext reads in the document-level meta-data associated with texts, if those texts are in a format (e.g. .csv, .json) that includes additional, non-textual data.
readtext_verbosity
Default
verbosity for messages produced when reading files. See
readtext()
.
Ken Benoit, Adam Obeng, and Paul Nulty
Useful links:
An accessor function to return the texts from a readtext object as a character vector, with names matching the document names.
## S3 method for class 'readtext' as.character(x, ...)
## S3 method for class 'readtext' as.character(x, ...)
x |
the readtext object whose texts will be extracted |
... |
further arguments passed to or from other methods |
data_char_encodedtexts
is a 10-element character vector with 10
different encodings
data_char_encodedtexts
data_char_encodedtexts
An object of class character
of length 10.
## Not run: Encoding(data_char_encodedtexts) data.frame(labelled = names(data_char_encodedtexts), detected = encoding(data_char_encodedtexts)$all) ## End(Not run)
## Not run: Encoding(data_char_encodedtexts) data.frame(labelled = names(data_char_encodedtexts), detected = encoding(data_char_encodedtexts)$all) ## End(Not run)
A set of translations of the Universal Declaration of Human Rights, plus one or two other miscellaneous texts, for testing the text input functions that need to translate different input encodings.
The Universal Declaration of Human Rights resources, https://www.un.org/en/about-us/universal-declaration-of-human-rights
## Not run: # unzip the files to a temporary directory FILEDIR <- tempdir() unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR) # get encoding from filename filenames <- list.files(FILEDIR, "\\.txt$") # strip the extension filenames <- gsub(".txt$", "", filenames) parts <- strsplit(filenames, "_") fileencodings <- sapply(parts, "[", 3) fileencodings # find out which conversions are unavailable (through iconv()) cat("Encoding conversions not available for this platform:") notAvailableIndex <- which(!(fileencodings %in% iconvlist())) fileencodings[notAvailableIndex] # try readtext require(quanteda) txts <- readtext(paste0(FILEDIR, "/", "*.txt")) substring(texts(txts)[1], 1, 80) # gibberish substring(texts(txts)[4], 1, 80) # hex substring(texts(txts)[40], 1, 80) # hex # read them in again txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings) substring(texts(txts)[1], 1, 80) # English substring(texts(txts)[4], 1, 80) # Arabic, looking good substring(texts(txts)[40], 1, 80) # Cyrillic, looking good substring(texts(txts)[7], 1, 80) # Chinese, looking good substring(texts(txts)[26], 1, 80) # Hindi, looking good txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings, docvarsfrom = "filenames", docvarnames = c("document", "language", "inputEncoding")) encodingCorpus <- corpus(txts, source = "Created by encoding-tests.R") summary(encodingCorpus) ## End(Not run)
## Not run: # unzip the files to a temporary directory FILEDIR <- tempdir() unzip(system.file("extdata", "data_files_encodedtexts.zip", package = "readtext"), exdir = FILEDIR) # get encoding from filename filenames <- list.files(FILEDIR, "\\.txt$") # strip the extension filenames <- gsub(".txt$", "", filenames) parts <- strsplit(filenames, "_") fileencodings <- sapply(parts, "[", 3) fileencodings # find out which conversions are unavailable (through iconv()) cat("Encoding conversions not available for this platform:") notAvailableIndex <- which(!(fileencodings %in% iconvlist())) fileencodings[notAvailableIndex] # try readtext require(quanteda) txts <- readtext(paste0(FILEDIR, "/", "*.txt")) substring(texts(txts)[1], 1, 80) # gibberish substring(texts(txts)[4], 1, 80) # hex substring(texts(txts)[40], 1, 80) # hex # read them in again txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings) substring(texts(txts)[1], 1, 80) # English substring(texts(txts)[4], 1, 80) # Arabic, looking good substring(texts(txts)[40], 1, 80) # Cyrillic, looking good substring(texts(txts)[7], 1, 80) # Chinese, looking good substring(texts(txts)[26], 1, 80) # Hindi, looking good txts <- readtext(paste0(FILEDIR, "/", "*.txt"), encoding = fileencodings, docvarsfrom = "filenames", docvarnames = c("document", "language", "inputEncoding")) encodingCorpus <- corpus(txts, source = "Created by encoding-tests.R") summary(encodingCorpus) ## End(Not run)
Detect the encoding of texts in a character readtext object and report
on the most likely encoding for each document. Useful in detecting the
encoding of input texts, so that a source encoding can be (re)specified when
inputting a set of texts using readtext()
, prior to constructing
a corpus.
encoding(x, verbose = TRUE, ...)
encoding(x, verbose = TRUE, ...)
x |
character vector, corpus, or readtext object whose texts' encodings will be detected. |
verbose |
if |
... |
additional arguments passed to stri_enc_detect |
Based on stri_enc_detect, which is in turn based on the ICU libraries. See the ICU User Guide, https://unicode-org.github.io/icu/userguide/.
## Not run: encoding(data_char_encodedtexts) # show detected value for each text, versus known encoding data.frame(labelled = names(data_char_encodedtexts), detected = encoding(data_char_encodedtexts)$all) # Russian text, Windows-1251 myreadtext <- readtext("https://kenbenoit.net/files/01_er_5.txt") encoding(myreadtext) ## End(Not run)
## Not run: encoding(data_char_encodedtexts) # show detected value for each text, versus known encoding data.frame(labelled = names(data_char_encodedtexts), detected = encoding(data_char_encodedtexts)$all) # Russian text, Windows-1251 myreadtext <- readtext("https://kenbenoit.net/files/01_er_5.txt") encoding(myreadtext) ## End(Not run)
Read texts and (if any) associated document-level meta-data from one or more source files. The text source files come from the textual component of the files, and the document-level metadata ("docvars") come from either the file contents or filenames.
readtext( file, ignore_missing_files = FALSE, text_field = NULL, docid_field = NULL, docvarsfrom = c("metadata", "filenames", "filepaths"), dvsep = "_", docvarnames = NULL, encoding = NULL, source = NULL, cache = TRUE, verbosity = readtext_options("verbosity"), ... )
readtext( file, ignore_missing_files = FALSE, text_field = NULL, docid_field = NULL, docvarsfrom = c("metadata", "filenames", "filepaths"), dvsep = "_", docvarnames = NULL, encoding = NULL, source = NULL, cache = TRUE, verbosity = readtext_options("verbosity"), ... )
file |
the complete filename(s) to be read. This is designed to automagically handle a number of common scenarios, so the value can be a "glob"-type wildcard value. Currently available filetypes are: Single file formats:
Reading multiple files and file types: In addition,
|
ignore_missing_files |
if |
text_field , docid_field
|
a variable (column) name or column number
indicating where to find the texts that form the documents for the corpus
and their identifiers. This must be specified for file types |
docvarsfrom |
used to specify that docvars should be taken from the
filenames, when the |
dvsep |
separator (a regular expression character string) used in
filenames to delimit docvar elements if |
docvarnames |
character vector of variable names for |
encoding |
vector: either the encoding of all files, or one encoding for each files |
source |
used to specify specific formats of some input file types, such
as JSON or HTML. Currently supported types are |
cache |
if |
verbosity |
|
... |
additional arguments passed through to low-level file reading
function, such as |
a data.frame consisting of a columns doc_id
and text
that contain a document identifier and the texts respectively, with any
additional columns consisting of document-level variables either found
in the file containing the texts, or created through the
readtext
call.
## Not run: ## get the data directory if (!interactive()) pkgload::load_all() DATA_DIR <- system.file("extdata/", package = "readtext") ## read in some text data # all UDHR files (rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*"))) # manifestos with docvars from filenames (rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"), docvarsfrom = "filenames", docvarnames = c("unit", "context", "year", "language", "party"), encoding = "LATIN1")) # recurse through subdirectories (rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"), docvarsfrom = "filepaths", docvarnames = "sentiment")) ## read in csv data (rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"))) ## read in tab-separated data (rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")) ## read in JSON data (rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")) ## read in pdf data # UNHDR (rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"))) Encoding(rt7$text) ## read in Word data (.doc) (rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc"))) Encoding(rt8$text) ## read in Word data (.docx) (rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx"))) Encoding(rt9$text) ## use elements of path and filename as docvars (rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filepaths", dvsep = "[/_.]")) ## End(Not run)
## Not run: ## get the data directory if (!interactive()) pkgload::load_all() DATA_DIR <- system.file("extdata/", package = "readtext") ## read in some text data # all UDHR files (rt1 <- readtext(paste0(DATA_DIR, "/txt/UDHR/*"))) # manifestos with docvars from filenames (rt2 <- readtext(paste0(DATA_DIR, "/txt/EU_manifestos/*.txt"), docvarsfrom = "filenames", docvarnames = c("unit", "context", "year", "language", "party"), encoding = "LATIN1")) # recurse through subdirectories (rt3 <- readtext(paste0(DATA_DIR, "/txt/movie_reviews/*"), docvarsfrom = "filepaths", docvarnames = "sentiment")) ## read in csv data (rt4 <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"))) ## read in tab-separated data (rt5 <- readtext(paste0(DATA_DIR, "/tsv/dailsample.tsv"), text_field = "speech")) ## read in JSON data (rt6 <- readtext(paste0(DATA_DIR, "/json/inaugural_sample.json"), text_field = "texts")) ## read in pdf data # UNHDR (rt7 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filenames", docvarnames = c("document", "language"))) Encoding(rt7$text) ## read in Word data (.doc) (rt8 <- readtext(paste0(DATA_DIR, "/word/*.doc"))) Encoding(rt8$text) ## read in Word data (.docx) (rt9 <- readtext(paste0(DATA_DIR, "/word/*.docx"))) Encoding(rt9$text) ## use elements of path and filename as docvars (rt10 <- readtext(paste0(DATA_DIR, "/pdf/UDHR/*.pdf"), docvarsfrom = "filepaths", dvsep = "[/_.]")) ## End(Not run)
Get or set global options affecting functions across readtext.
readtext_options(..., reset = FALSE, initialize = FALSE)
readtext_options(..., reset = FALSE, initialize = FALSE)
... |
options to be set, as key-value pair, same as
|
reset |
logical; if |
initialize |
logical; if |
Currently available options are:
verbosity
Default
verbosity for messages produced when reading files. See
readtext()
.
When called using a key = value
pair (where key
can be
a label or quoted character name)), the option is set and TRUE
is
returned invisibly.
When called with no arguments, a named list of the package options is returned.
When called with reset = TRUE
as an argument, all arguments are
options are reset to their default values, and TRUE
is returned
invisibly.
## Not run: # save the current options (opt <- readtext_options()) # set higher verbosity readtext_options(verbosity = 3) # read something in here if (!interactive()) pkgload::load_all() DATA_DIR <- system.file("extdata/", package = "readtext") readtext(paste0(DATA_DIR, "/txt/UDHR/*")) # reset to saved options readtext_options(opt) ## End(Not run)
## Not run: # save the current options (opt <- readtext_options()) # set higher verbosity readtext_options(verbosity = 3) # read something in here if (!interactive()) pkgload::load_all() DATA_DIR <- system.file("extdata/", package = "readtext") readtext(paste0(DATA_DIR, "/txt/UDHR/*")) # reset to saved options readtext_options(opt) ## End(Not run)