Title: | Multilingual Stopword Lists |
---|---|
Description: | Provides multiple sources of stopwords, for use in text analysis and natural language processing. |
Authors: | Kenneth Benoit [aut, cre], David Muhr [aut], Kohei Watanabe [aut] |
Maintainer: | Kenneth Benoit <[email protected]> |
License: | MIT + file LICENSE |
Version: | 2.4 |
Built: | 2024-11-22 02:50:29 UTC |
Source: | https://github.com/quanteda/stopwords |
Provides a stopwords()
function to return character vectors of
stopwords for different languages, using the ISO-639-1 language codes,
and allows for different sources of stopwords to be defined.
snowball
The Snowball stopword lists sources for multiple languages. Most of these have been ported from the quanteda stopword lists (in versions <1.0 of that package).
stopwords-iso
The collection taken from https://github.com/stopwords-iso/stopwords-iso/.
smart
The English-language stopword list from the SMART information retrieval system.
misc
A few additional stopword lists, including the non-Snowball word lists from quanteda versions < 1.0.
marimo
Stopword lists compiled by Kohei Watanabe.
Kenneth Benoit, David Muhr, and Kohei Watanabe
Stopword lists for ancient Greek and Latin. These lists are far more extensive than the Perseus lists for ancient Greek and Latin from the Perseus Digital Library.
An object of class list
of length 2.
As there is no 2-letter code for ancient Greek in ISO-639-1, we use "grc" to denote Greek (as per ISO-639-3).
stopwords(language = "grc", source = "ancient")
stopwords(language = "la", source = "ancient")
Aurélien Berra, Ancient Greek and Latin stopwords,
doi: 10.5281/zenodo.1165205
. See
https://github.com/aurelberra/stopwords/blob/master/rationale.md.
Stopword lists that include specific parts of speech, maintained by Kohei Watanabe.
An object of class list
of length 9.
These are multi-level lists, in the original data. If you wish to use them as lists, please access the data object directly.
stopwords(language = "en", source = "marimo")
The English version was adopted from the Snowball collection, and then extended and translated into other languages by contributors. Names of contributors are in the header of the original YAML files.
# access English pronouns directly stopwords::data_stopwords_marimo$en$pronoun
# access English pronouns directly stopwords::data_stopwords_marimo$en$pronoun
Other, miscellaneous stopword lists.
An object of class list
of length 5.
stopwords(language, source = "misc")
The Arabic stopwords come from https://sites.google.com/site/kevinbouge/stopwords-lists.
The Catalan stopwords come from http://latel.upf.edu/morgana/altres/pub/ca_stop.htm.
The Greek stopwords were supplied by Carsten Schwemmer (see https://github.com/quanteda/quanteda/issues/282).
The Gujarati stopwords are taken from https://github.com/gujarati-ir/Gujarati-Stop-Words and modified by Chandrakant Bhogayata.
The Chinese stopwords are taken from the Baidu stopword list (see http://www.baiduguide.com/baidu-stopwords/).
Stopword lists for 23 languages from the Python NLTK library.
An object of class list
of length 23.
stopwords(language = "en", source = "nltk")
https://github.com/nltk/nltk_data/blob/gh-pages/packages/corpora/stopwords.zip
Bird, Steven, Edward Loper and Ewan Klein (2009). Natural Language Processing with Python. O'Reilly Media Inc.
Stopword lists for ancient Greek and Latin. As there is no 2-letter code for ancient Greek in ISO-639-1, we use "grc" to denote Greek (as per ISO-639-3).
An object of class list
of length 2.
stopwords(language = "grc", source = "perseus")
stopwords(language = "la", source = "perseus")
The Perseus Digital Library. See https://wiki.digitalclassicist.org/Stopwords_for_Greek_and_Latin and https://wiki.digitalclassicist.org/Perseus_Digital_Library.
The stopword lists based on the SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System, an information retrieval system developed at Cornell University in the 1960s.
An object of class list
of length 1.
stopwords(language = "en", source = "smart")
The English stopword list is taken from the online appendix 11 of Lewis et. al. (2004).
Lewis, David D., et al. (2004) "Rcv1: A new benchmark collection for text categorization research." Journal of machine learning research 5: 361-397.
snowball stopword list
An object of class list
of length 15.
Provides stopword lists in multiple languages, based on the Snowball stemmer's word lists.
stopwords(language, source = "snowball")
The main stopword lists are taken from the Snowball stemmer project in different languages (see https://snowballstem.org/projects.html).
The stopword lists can be found in http://snowball.tartarus.org/dist/snowball_all.tgz.
The Stopwords ISO Dataset is the most comprehensive collection of stopwords for multiple languages. The collection follows the ISO 639-1 language code.
A named list of length 57, of character vectors that represent
stopwords in 57 languages. To see the languages available, use
stopwords_getlanguages()
.
stopwords(language, source = "stopwords-iso")
https://github.com/stopwords-iso/stopwords-iso/
This function returns character vectors of stopwords for different languages, using the ISO-639-1 language codes, and allows for different sources of stopwords to be defined.
The default source is the Snowball()
stopwords collection but other()
sources are
also available.
stopwords(language = "en", source = "snowball", simplify = TRUE)
stopwords(language = "en", source = "snowball", simplify = TRUE)
language |
specify language of stopwords by ISO 639-1 code |
source |
specify a stopwords source. To list the currently
available options, use |
simplify |
logical; if |
The language codes for each stopword list use the two-letter ISO code from https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes. For backwards compatibility, the full English names of the stopwords from the quanteda package may also be used, although these are deprecated.
a character vector containing the stopwords, or a list
of characters simplify = FALSE
stopwords("en") stopwords("de")
stopwords("en") stopwords("de")
Lists the available stopwords country codes for a given stopwords source. See https://en.wikipedia.org/wiki/ISO_639-1 for details of the language code.
stopwords_getlanguages(source)
stopwords_getlanguages(source)
source |
the source of the stopwords |
Returns a character vector of the stopword sources available from the stopwords package.
stopwords_getsources()
stopwords_getsources()