Title: | Wrapper to the 'spaCy' 'NLP' Library |
---|---|
Description: | An R wrapper to the 'Python' 'spaCy' 'NLP' library, from <https://spacy.io>. |
Authors: | Kenneth Benoit [cre, aut, cph] , Akitaka Matsuo [aut] , Johannes Gruber [ctb] , European Research Council [fnd] (ERC-2011-StG 283794-QUANTESS) |
Maintainer: | Kenneth Benoit <[email protected]> |
License: | GPL-3 |
Version: | 1.3.1 |
Built: | 2024-12-17 05:28:07 UTC |
Source: | https://github.com/quanteda/spacyr |
An R wrapper to the Python (Cython) spaCy NLP system, from https://spacy.io. Nicely integrated with quanteda. spacyr is designed to provide easy access to the powerful functionality of spaCy, in a simple format.
Ken Benoit and Akitaka Matsuo
https://spacy.io, https://spacyr.quanteda.io.
Useful links:
A sample of text from the Irish budget debate of 2010 (531 tokens long).
data_char_paragraph
data_char_paragraph
An object of class character
of length 1.
A character object consisting of 30 short documents in plain text format for testing. Each document is one or two brief sentences.
data_char_sentences
data_char_sentences
An object of class character
of length 30.
From an object parsed by spacy_parse()
, extract the entities as a
separate object, or convert the multi-word entities into single "token"
consisting of the concatenated elements of the multi-word entities.
entity_extract(x, type = c("named", "extended", "all"), concatenator = "_") entity_consolidate(x, concatenator = "_")
entity_extract(x, type = c("named", "extended", "all"), concatenator = "_") entity_consolidate(x, concatenator = "_")
x |
output from |
type |
type of named entities, either |
concatenator |
the character(s) used to join the elements of multi-word named entities |
entity_extract()
returns a data.frame of all named
entities, containing the following fields:
doc_id
name of the document containing the entity
sentence_id
the sentence ID containing the entity, within the document
entity
the named entity
entity_type
the type of named entities (e.g. PERSON, ORG, PERCENT, etc.)
entity_consolidate
returns a modified data.frame
of
parsed results, where the named entities have been combined into a single
"token". Currently, dependency parsing is removed when this consolidation
occurs.
## Not run: spacy_initialize() # entity extraction txt <- "Mr. Smith of moved to San Francisco in December." parsed <- spacy_parse(txt, entity = TRUE) entity_extract(parsed) entity_extract(parsed, type = "all") ## End(Not run) ## Not run: # consolidating multi-word entities txt <- "The House of Representatives voted to suspend aid to South Dakota." parsed <- spacy_parse(txt, entity = TRUE) entity_consolidate(parsed) ## End(Not run)
## Not run: spacy_initialize() # entity extraction txt <- "Mr. Smith of moved to San Francisco in December." parsed <- spacy_parse(txt, entity = TRUE) entity_extract(parsed) entity_extract(parsed, type = "all") ## End(Not run) ## Not run: # consolidating multi-word entities txt <- "The House of Representatives voted to suspend aid to South Dakota." parsed <- spacy_parse(txt, entity = TRUE) entity_consolidate(parsed) ## End(Not run)
From an object parsed by spacy_parse()
, extract the multi-word
noun phrases as a separate object, or convert the multi-word noun phrases
into single "token" consisting of the concatenated elements of the multi-word
noun phrases.
nounphrase_extract(x, concatenator = "_") nounphrase_consolidate(x, concatenator = "_")
nounphrase_extract(x, concatenator = "_") nounphrase_consolidate(x, concatenator = "_")
x |
output from |
concatenator |
the character(s) used to join elements of multi-word noun phrases |
noun
returns a data.frame
of all named
entities, containing the following fields:
doc_id
name of the document containing the noun phrase
sentence_id
the sentence ID containing the noun phrase, within the document
nounphrase
the noun phrase
root
the root token of the noun phrase
nounphrase_consolidate
returns a modified data.frame
of
parsed results, where the noun phrases have been combined into a single
"token". Currently, dependency parsing is removed when this consolidation
occurs.
## Not run: spacy_initialize() # entity extraction txt <- "Mr. Smith of moved to San Francisco in December." parsed <- spacy_parse(txt, nounphrase = TRUE) entity_extract(parsed) ## End(Not run) ## Not run: # consolidating multi-word noun phrases txt <- "The House of Representatives voted to suspend aid to South Dakota." parsed <- spacy_parse(txt, nounphrase = TRUE) nounphrase_consolidate(parsed) ## End(Not run)
## Not run: spacy_initialize() # entity extraction txt <- "Mr. Smith of moved to San Francisco in December." parsed <- spacy_parse(txt, nounphrase = TRUE) entity_extract(parsed) ## End(Not run) ## Not run: # consolidating multi-word noun phrases txt <- "The House of Representatives voted to suspend aid to South Dakota." parsed <- spacy_parse(txt, nounphrase = TRUE) nounphrase_consolidate(parsed) ## End(Not run)
Download spaCy language models
spacy_download_langmodel(lang_models = "en_core_web_sm", force = FALSE)
spacy_download_langmodel(lang_models = "en_core_web_sm", force = FALSE)
lang_models |
character; language models to be installed. Defaults
|
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
Invisibly returns the installation log.
## Not run: # install medium sized model spacy_download_langmodel("en_core_web_md") #' # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install transformer based model spacy_download_langmodel("en_core_web_trf") ## End(Not run)
## Not run: # install medium sized model spacy_download_langmodel("en_core_web_md") #' # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install transformer based model spacy_download_langmodel("en_core_web_trf") ## End(Not run)
Deprecated. spacyr
now always uses a virtual environment,
making this function redundant.
spacy_download_langmodel_virtualenv(...)
spacy_download_langmodel_virtualenv(...)
... |
not used |
This function extracts named entities from texts, based on the entity tag
ent
attributes of documents objects parsed by spaCy (see
https://spacy.io/usage/linguistic-features#section-named-entities).
spacy_extract_entity( x, output = c("data.frame", "list"), type = c("all", "named", "extended"), multithread = TRUE, ... )
spacy_extract_entity( x, output = c("data.frame", "list"), type = c("all", "named", "extended"), multithread = TRUE, ... )
x |
a character object or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
output |
type of returned object, either |
type |
type of named entities, either |
multithread |
logical; If |
... |
unused |
When the option output = "data.frame"
is selected, the
function returns a data.frame
with the following fields.
text
contents of entity
entity_type
type of entity (e.g. ORG
for
organizations)
start_id
serial number ID of starting token.
This number corresponds with the number of data.frame
returned from
spacy_tokenize(x)
with default options.
length
number
of words (tokens) included in a named entity (e.g. for an entity, "New York
Stock Exchange"", length = 4
)
either a list
or data.frame
of tokens
## Not run: spacy_initialize() txt <- c(doc1 = "The Supreme Court is located in Washington D.C.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_entity(txt) spacy_extract_entity(txt, output = "list") ## End(Not run)
## Not run: spacy_initialize() txt <- c(doc1 = "The Supreme Court is located in Washington D.C.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_entity(txt) spacy_extract_entity(txt, output = "list") ## End(Not run)
This function extracts noun phrases from documents, based on the
noun_chunks
attributes of documents objects parsed by spaCy (see
https://spacy.io/usage/linguistic-features#noun-chunks).
spacy_extract_nounphrases( x, output = c("data.frame", "list"), multithread = TRUE, ... )
spacy_extract_nounphrases( x, output = c("data.frame", "list"), multithread = TRUE, ... )
x |
a character object or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
output |
type of returned object, either |
multithread |
logical; If |
... |
unused |
When the option output = "data.frame"
is selected, the
function returns a data.frame
with the following fields.
text
contents of noun-phrase
root_text
contents of root token
start_id
serial number ID of starting token. This number
corresponds with the number of data.frame
returned from
spacy_tokenize(x)
with default options.
root_id
serial number ID of root token
length
number of words (tokens) included in a noun-phrase (e.g.
for a noun-phrase, "individual car owners", length = 3
)
either a list
or data.frame
of tokens
## Not run: spacy_initialize() txt <- c(doc1 = "Natural language processing is a branch of computer science.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_nounphrases(txt) spacy_extract_nounphrases(txt, output = "list") ## End(Not run)
## Not run: spacy_initialize() txt <- c(doc1 = "Natural language processing is a branch of computer science.", doc2 = "Paul earned a postgraduate degree from MIT.") spacy_extract_nounphrases(txt) spacy_extract_nounphrases(txt, output = "list") ## End(Not run)
While running spaCy on Python through R, a Python process is always running
in the background and Rsession will take up a lot of memory (typically over
1.5GB). spacy_finalize()
terminates the Python process and frees up
the memory it was using.
spacy_finalize()
spacy_finalize()
Akitaka Matsuo
Initialize spaCy to call from R.
spacy_initialize(model = "en_core_web_sm", entity = TRUE, ...)
spacy_initialize(model = "en_core_web_sm", entity = TRUE, ...)
model |
Language package for loading spaCy. Example: |
entity |
logical; if |
... |
not used. |
Akitaka Matsuo, Johannes B. Gruber
Install spaCy in a self-contained environment, including specified language models.
spacy_install( version = "latest", lang_models = "en_core_web_sm", ask = interactive(), force = FALSE, ... )
spacy_install( version = "latest", lang_models = "en_core_web_sm", ask = interactive(), force = FALSE, ... )
version |
character; spaCy version to install (see details). |
lang_models |
character; language models to be installed. Defaults
|
ask |
logical; ask whether to proceed during the installation. By default, questions are only asked in interactive sessions. |
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
... |
not used. |
The function checks whether a suitable installation of Python is
present on the system and installs one via
reticulate::install_python()
otherwise. It then creates a
virtual environment with the necessary packages in the default location
chosen by reticulate::virtualenv_root()
.
If you want to install a different version of Python than the default, you
should call reticulate::install_python()
directly. If you want
to create or use a different virtual environment, you can use, e.g.,
Sys.setenv(SPACY_PYTHON = "path/to/directory")
.
## Not run: # install the latest version of spaCy spacy_install() # update spaCy spacy_install(force = TRUE) # install an older version spacy_install(version = "3.1.0") # install with GPU enabled spacy_install(version = "cuda-autodetect") # install on Apple ARM processors spacy_install(version = "apple") # install an old custom version spacy_install(version = "[cuda-autodetect]==3.2.0") # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install spaCy to an existing virtual environment Sys.setenv(RETICULATE_PYTHON = "path/to/python") spacy_install() ## End(Not run)
## Not run: # install the latest version of spaCy spacy_install() # update spaCy spacy_install(force = TRUE) # install an older version spacy_install(version = "3.1.0") # install with GPU enabled spacy_install(version = "cuda-autodetect") # install on Apple ARM processors spacy_install(version = "apple") # install an old custom version spacy_install(version = "[cuda-autodetect]==3.2.0") # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install spaCy to an existing virtual environment Sys.setenv(RETICULATE_PYTHON = "path/to/python") spacy_install() ## End(Not run)
Deprecated. spacy_install
now installs to a virtual environment by default.
spacy_install_virtualenv(...)
spacy_install_virtualenv(...)
... |
not used |
The spacy_parse()
function calls spaCy to both tokenize and tag the
texts, and returns a data.table of the results. The function provides options
on the types of tagsets (tagset_
options) either "google"
or
"detailed"
, as well as lemmatization (lemma
). It provides a
functionalities of dependency parsing and named entity recognition as an
option. If "full_parse = TRUE"
is provided, the function returns the
most extensive list of the parsing results from spaCy.
spacy_parse( x, pos = TRUE, tag = FALSE, lemma = TRUE, entity = TRUE, dependency = FALSE, nounphrase = FALSE, multithread = TRUE, additional_attributes = NULL, ... )
spacy_parse( x, pos = TRUE, tag = FALSE, lemma = TRUE, entity = TRUE, dependency = FALSE, nounphrase = FALSE, multithread = TRUE, additional_attributes = NULL, ... )
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
pos |
logical whether to return universal dependency POS tagset https://universaldependencies.org/u/pos/) |
tag |
logical whether to return detailed part-of-speech tags, for the
language model |
lemma |
logical; include lemmatized tokens in the output (lemmatization may not work properly for non-English models) |
entity |
logical; if |
dependency |
logical; if |
nounphrase |
logical; if |
multithread |
logical; If |
additional_attributes |
a character vector; this option is for
extracting additional attributes of tokens from spaCy. When the names of
attributes are supplied, the output data.frame will contain additional
variables corresponding to the names of the attributes. For instance, when
|
... |
not used directly |
a data.frame
of tokenized, parsed, and annotated tokens
## Not run: spacy_initialize() # See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html txt <- "And now for something completely different." spacy_parse(txt) spacy_parse(txt, pos = TRUE, tag = TRUE) spacy_parse(txt, dependency = TRUE) txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_parse(txt2, entity = TRUE, dependency = TRUE) txt3 <- "We analyzed the Supreme Court with three natural language processing tools." spacy_parse(txt3, entity = TRUE, nounphrase = TRUE) spacy_parse(txt3, additional_attributes = c("like_num", "is_punct")) ## End(Not run)
## Not run: spacy_initialize() # See Chap 5.1 of the NLTK book, http://www.nltk.org/book/ch05.html txt <- "And now for something completely different." spacy_parse(txt) spacy_parse(txt, pos = TRUE, tag = TRUE) spacy_parse(txt, dependency = TRUE) txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_parse(txt2, entity = TRUE, dependency = TRUE) txt3 <- "We analyzed the Supreme Court with three natural language processing tools." spacy_parse(txt3, entity = TRUE, nounphrase = TRUE) spacy_parse(txt3, additional_attributes = c("like_num", "is_punct")) ## End(Not run)
Efficient tokenization (without POS tagging, dependency parsing, lemmatization, or named entity recognition) of texts using spaCy.
spacy_tokenize( x, what = c("word", "sentence"), remove_punct = FALSE, remove_url = FALSE, remove_numbers = FALSE, remove_separators = TRUE, remove_symbols = FALSE, padding = FALSE, multithread = TRUE, output = c("list", "data.frame"), ... )
spacy_tokenize( x, what = c("word", "sentence"), remove_punct = FALSE, remove_url = FALSE, remove_numbers = FALSE, remove_separators = TRUE, remove_symbols = FALSE, padding = FALSE, multithread = TRUE, output = c("list", "data.frame"), ... )
x |
a character object, a quanteda corpus, or a TIF-compliant corpus data.frame (see https://github.com/ropenscilabs/tif) |
what |
the unit for splitting the text, available alternatives are:
|
remove_punct |
remove punctuation tokens. |
remove_url |
remove tokens that look like a url or email address. |
remove_numbers |
remove tokens that look like a number (e.g. "334", "3.1415", "fifty"). |
remove_separators |
remove spaces as separators when
all other remove functionalities (e.g. |
remove_symbols |
remove symbols. The symbols are either |
padding |
if |
multithread |
logical; If |
output |
type of returning object. Either |
... |
not used directly |
either list
or data.frame
of tokens
## Not run: spacy_initialize() txt <- "And now for something completely different." spacy_tokenize(txt) txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_tokenize(txt2) ## End(Not run)
## Not run: spacy_initialize() txt <- "And now for something completely different." spacy_tokenize(txt) txt2 <- c(doc1 = "The fast cat catches mice.\\nThe quick brown dog jumped.", doc2 = "This is the second document.", doc3 = "This is a \\\"quoted\\\" text." ) spacy_tokenize(txt2) ## End(Not run)
Removes the virtual environment created by spacy_install()
spacy_uninstall(confirm = interactive())
spacy_uninstall(confirm = interactive())
confirm |
logical; confirm before uninstalling spaCy? |
Upgrade spaCy (to a specific version).
spacy_upgrade( version = "latest", lang_models = NULL, ask = interactive(), force = TRUE, ... )
spacy_upgrade( version = "latest", lang_models = NULL, ask = interactive(), force = TRUE, ... )
version |
character; spaCy version to install (see details). |
lang_models |
character; language models to be installed. Defaults
|
ask |
logical; ask whether to proceed during the installation. By default, questions are only asked in interactive sessions. |
force |
ignore if spaCy/the lang_models is already present and install it anyway. |
... |
passed on to |
The function checks whether a suitable installation of Python is
present on the system and installs one via
reticulate::install_python()
otherwise. It then creates a
virtual environment with the necessary packages in the default location
chosen by reticulate::virtualenv_root()
.
If you want to install a different version of Python than the default, you
should call reticulate::install_python()
directly. If you want
to create or use a different virtual environment, you can use, e.g.,
Sys.setenv(SPACY_PYTHON = "path/to/directory")
.
## Not run: # install the latest version of spaCy spacy_install() # update spaCy spacy_install(force = TRUE) # install an older version spacy_install(version = "3.1.0") # install with GPU enabled spacy_install(version = "cuda-autodetect") # install on Apple ARM processors spacy_install(version = "apple") # install an old custom version spacy_install(version = "[cuda-autodetect]==3.2.0") # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install spaCy to an existing virtual environment Sys.setenv(RETICULATE_PYTHON = "path/to/python") spacy_install() ## End(Not run)
## Not run: # install the latest version of spaCy spacy_install() # update spaCy spacy_install(force = TRUE) # install an older version spacy_install(version = "3.1.0") # install with GPU enabled spacy_install(version = "cuda-autodetect") # install on Apple ARM processors spacy_install(version = "apple") # install an old custom version spacy_install(version = "[cuda-autodetect]==3.2.0") # install several models with spaCy spacy_install(lang_models = c("en_core_web_sm", "de_core_news_sm")) # install spaCy to an existing virtual environment Sys.setenv(RETICULATE_PYTHON = "path/to/python") spacy_install() ## End(Not run)