Title: | Natural Language Processing for Meta Analysis |
---|---|
Description: | Given a CSV file with titles and abstracts, the package creates a document-term matrix that is lemmatized and stemmed and can directly be used to train machine learning methods for automatic title-abstract screening in the preparation of a meta analysis. |
Authors: | Nico Bruder [aut] , Samuel Zimmermann [aut] , Johannes Vey [aut] , Maximilian Pilz [aut, cre] , Institute of Medical Biometry - University of Heidelberg [cph] |
Maintainer: | Maximilian Pilz <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.3.9000 |
Built: | 2025-01-06 06:05:05 UTC |
Source: | https://github.com/imbi-heidelberg/metanlp |
Usually, stop words do not offer useful information in the classification whether a paper should be included or excluded from a meta-analysis. Thus, such words should not be part of the document-term matrix. This function allows the user to automatically delete stop words.
delete_stop_words(object, ...) ## S4 method for signature 'MetaNLP' delete_stop_words(object, ...)
delete_stop_words(object, ...) ## S4 method for signature 'MetaNLP' delete_stop_words(object, ...)
object |
A MetaNLP object, whose data frame is to be modified. |
... |
Language of the stop words. Defaults to "english". |
This function allows to delete stop words from different languages. Supported
languages are english
, french
, german
, russian
and
spanish
. Language names are case sensitive.
An object of class MetaNLP
.
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj <- delete_stop_words(obj, "english")
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj <- delete_stop_words(obj, "english")
There can be words that do not offer additional information in the classification whether a paper should be included or excluded from a meta-analysis. Thus, such words should not be part of the document-term matrix. This function allows the user to remove these columns of the word count matrix by specifying a vector of words to delete.
delete_words(object, delete_list) ## S4 method for signature 'MetaNLP,character' delete_words(object, delete_list)
delete_words(object, delete_list) ## S4 method for signature 'MetaNLP,character' delete_words(object, delete_list)
object |
A MetaNLP object, whose data frame is to be modified |
delete_list |
A character vector containing the words to be deleted |
The words in delete_list
can be given like they appear in the
text. They are lemmatized and stemmed by delete_words
to match the
columns of the document-term matrix.
An object of class MetaNLP
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) del_words <- c("beautiful", "considering", "found") obj <- delete_words(obj, del_words)
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) del_words <- c("beautiful", "considering", "found") obj <- delete_words(obj, del_words)
The MetaNLP package provides methods to quickly transform a CSV-file with titles and abstracts to an R data frame that can be used for automatic title-abstract screening using machine learning.
A MetaNLP
object is the base class of the package MetaNLP.
It is initialized by passing the path to a CSV file and constructs
a data frame whose column names are the words that occur in the titles
and abstracts and whose cells contain the word frequencies for each
paper.
MetaNLP( file, bounds = c(2, Inf), word_length = c(3, Inf), language = "english", ... )
MetaNLP( file, bounds = c(2, Inf), word_length = c(3, Inf), language = "english", ... )
file |
Either the path to the CSV file or a data frame containing the abstracts |
bounds |
An integer vector of length 2. The first value specifies
the minimum number of appearances of a word to become a column of the word
count matrix, the second value specifies the maximum number.
Defaults to |
word_length |
An integer vector of length 2. The first value specifies
the minimum number of characters of a word to become a column of the word
count matrix, the second value specifies the maximum number.
Defaults to |
language |
The language for lemmatization and stemming. Supported
languages are |
... |
Additional arguments passed on to |
An object of class MetaNLP
contains a slot data_frame where
the document-term matrix is stored as a data frame.
The CSV file must have a column ID
to identify each paper, a column
title
with the belonging titles of the papers and a column
abstract
which contains the abstracts. If the CSV stores training data,
a column decision
should exist, indicating whether an abstract
is included in the meta analysis. This column does not need to exist, because
there is no decision for test data yet. Allowed values in this column are
either "yes" and "no" or "include" and "exclude" or "maybe". The value "maybe"
is handled as a "yes"/"include".
An object of class MetaNLP
To ensure correct processing of the data when there are special characters
(e.g. "é" or "ü"), make sure that the csv-file is correctly encoded
as UTF-8
.
The stemming algorithm makes use of the C libstemmer library generated by
Snowball. When german texts are stemmed, umlauts are replaced by their
non-umlaut equivalent, so "ä" becomes "a" etc.
Maintainer: Maximilian Pilz [email protected] (ORCID)
Authors:
Nico Bruder [email protected] (ORCID)
Samuel Zimmermann [email protected] (ORCID)
Johannes Vey [email protected] (ORCID)
Other contributors:
Institute of Medical Biometry - University of Heidelberg [copyright holder]
Useful links:
Report bugs at https://github.com/imbi-heidelberg/MetaNLP/issues
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path)
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path)
This method creates a word cloud from a MetaNLP object. The word size indicates the frequency of the words.
## S4 method for signature 'MetaNLP,missing' plot( x, y = NULL, max.words = 70, colors = c("snow4", "darkgoldenrod1", "turquoise4", "tomato"), decision = c("total", "include", "exclude"), ... )
## S4 method for signature 'MetaNLP,missing' plot( x, y = NULL, max.words = 70, colors = c("snow4", "darkgoldenrod1", "turquoise4", "tomato"), decision = c("total", "include", "exclude"), ... )
x |
A MetaNLP object to plot |
y |
not used |
max.words |
Maximum number of words in the word cloud |
colors |
Character vector with the colors in |
decision |
Stratify word cloud by decision. Default is no stratification. |
... |
Additional parameters for wordcloud |
nothing
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) plt <- plot(obj)
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) plt <- plot(obj)
This function takes a MetaNLP object (the training data) and the test data. The function creates the document-term matrix from the test data and matches the columns of the given training MetaNLP object with the columns of the test document-term matrix. This means that columns, which do appear in the test document-term matrix but not in the training document-term matrix are removed; columns that appear in the training document-term matrix but not in the test document-term matrix are added as a column consisting of zeros.
read_test_data(object, ...) ## S4 method for signature 'MetaNLP' read_test_data(object, file, ...)
read_test_data(object, ...) ## S4 method for signature 'MetaNLP' read_test_data(object, file, ...)
object |
The MetaNLP object created from the training data. |
... |
Further arguments to |
file |
Either the path to the test data csv, the data frame containing the papers or a MetaNLP object |
An object of class MetaNLP
path_train <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) path_test <- system.file("extdata", "test_data_changed.csv", package = "MetaNLP", mustWork = TRUE) obj_train <- MetaNLP(path_train) obj_test <- MetaNLP(path_test) to_test_obj <- read_test_data(obj_train, obj_test)
path_train <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) path_test <- system.file("extdata", "test_data_changed.csv", package = "MetaNLP", mustWork = TRUE) obj_train <- MetaNLP(path_train) obj_test <- MetaNLP(path_test) to_test_obj <- read_test_data(obj_train, obj_test)
When using non-english languages, the column names of the document-term matrix can contain special characters. These might lead to encoding problems, when this matrix is used to train a machine learning model. This functions automatically replaces all special characters by the nearest equivalent character, e.g. "é" would be replaced by "e".
replace_special_characters(object) ## S4 method for signature 'MetaNLP' replace_special_characters(object)
replace_special_characters(object) ## S4 method for signature 'MetaNLP' replace_special_characters(object)
object |
An object of class MetaNLP. |
An object of class MetaNLP, where the column names do not have special characters anymore.
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path, language = "french") obj <- replace_special_characters(obj)
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path, language = "french") obj <- replace_special_characters(obj)
As the document-term matrix quickly grows with an increasing number of abstracts, it can easily reach several thousand columns. Thus, it can be important to extract the columns that carry most of the information in the decision making process. This function uses a generalized linear model combined with elasticnet regularization to extract these features. In contrast to a usual regression model or a L2 penalty (ridge regression), elasticnet (and LASSO) sets some regression parameters to 0. Thus, the selected features are exactly the features with a non-zero entry.
select_features(object, ...) ## S4 method for signature 'MetaNLP' select_features(object, alpha = 0.8, lambda = "avg", seed = NULL, ...)
select_features(object, ...) ## S4 method for signature 'MetaNLP' select_features(object, alpha = 0.8, lambda = "avg", seed = NULL, ...)
object |
An object of class |
... |
Additional arguments for cv.glmnet. An important
option might be |
alpha |
The elastic net mixing parameter, with |
lambda |
The weight parameter of the penalty. The possible values are
|
seed |
A numeric value which is used as a local seed for this function.
Default is |
The computational aspects are executed by the glmnet
package. At first, a model is fitted via glmnet. The
elastic net parameter can be specified by the user. The
parameter
, which determines the weight of the penalty, can
either be chosen via cross validation (using cv.glmnet or by
giving a numeric value.
An object of class MetaNLP
, where the columns were selected
via elastic net.
By using a fix value for lambda
, the number of features which should
be selected can easily be adjusted by the parameter alpha
. The smaller
one chooses alpha
, the more columns will still be present in the
resulting data frame, the higher one chooses alpha
, the less
columns will be chosen.
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj2 <- select_features(obj, alpha = 0.7, lambda = "min")
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj2 <- select_features(obj, alpha = 0.7, lambda = "min")
Returns a quick overview over the most frequent word stems structured
into included and excluded papers.
## S4 method for signature 'MetaNLP' summary(object, n = 5, stop_words = FALSE, ...)
## S4 method for signature 'MetaNLP' summary(object, n = 5, stop_words = FALSE, ...)
object |
An object of class MetaNLP. |
n |
Number of most frequent words to be displayed. |
stop_words |
Boolean to decide whether stop words shall be included in
the summary. |
... |
Additional parameters for |
A list of most frequent words.
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) summary(obj, n = 8)
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) summary(obj, n = 8)
This function can be used to save the document-term matrix of a MetaNLP object as a csv-file.
write_csv(object, ...) ## S4 method for signature 'MetaNLP' write_csv(object, path, type = c("train", "test"), ...)
write_csv(object, ...) ## S4 method for signature 'MetaNLP' write_csv(object, path, type = c("train", "test"), ...)
object |
An object of class MetaNLP. |
... |
Additional arguments for write.table, e.g. encoding
as |
path |
Path where to save the csv. |
type |
Specifies if the document-term matrix should be saved as
"train_wcm.csv" or "test_wcm.csv". If the user wants to use another file name,
the whole path including the file name should be given as the |
If a path to a specific folder is given (but the path name does not end with ".csv"), the file is saved in this folder as "train_wcm.csv" or "test_wcm.csv". By providing a path ending with ".csv", the user can override the default naming convention and the file is saved according to this path.
nothing
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj2 <- delete_stop_words(obj) write_path <- tempdir() write_csv(obj2, path = write_path) file.remove(file.path(write_path, "train_wcm.csv"))
path <- system.file("extdata", "test_data.csv", package = "MetaNLP", mustWork = TRUE) obj <- MetaNLP(path) obj2 <- delete_stop_words(obj) write_path <- tempdir() write_csv(obj2, path = write_path) file.remove(file.path(write_path, "train_wcm.csv"))