Title: | Analyse Text Documents Using Ecological Tools |
---|---|
Description: | A set of functions to analyse and compare texts, using classical text mining functions, as well as those from theoretical ecology. |
Authors: | Rebaudo Francois (IRD, UMR EGCE, IRD, CNRS, Univ. ParisSaclay) |
Maintainer: | Rebaudo Francois <[email protected]> |
License: | GPL-2 |
Version: | 0.1.12 |
Built: | 2025-02-14 03:11:31 UTC |
Source: | https://github.com/frareb/inpdfr |
Performs a cluster analysis on the basis of the word-occurrence data.frame
using hclust
function.
doCluster( wordF, myMethod = "ward.D2", gp = FALSE, nbGp = 5, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
doCluster( wordF, myMethod = "ward.D2", gp = FALSE, nbGp = 5, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
wordF |
The data.frame containing word occurrences. |
myMethod |
The method to compute distances, see |
gp |
A logical to specify if groups should be made. |
nbGp |
An intger to specify the number of groups. Ignored if |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
... |
Additional arguments from the |
An object of class hclust
.
data("wordOccuDF") doCluster(wordF = wordOccuDF, myMethod = "ward.D2", getPlot = FALSE)
data("wordOccuDF") doCluster(wordF = wordOccuDF, myMethod = "ward.D2", getPlot = FALSE)
Performs a k-means cluster analysis on the basis of the word-occurrence data.frame
using kmeans
function.
doKmeansClust( wordF, nbClust = 4, nbIter = 10, algo = "Hartigan-Wong", getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
doKmeansClust( wordF, nbClust = 4, nbIter = 10, algo = "Hartigan-Wong", getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
wordF |
The data.frame containing word occurrences. |
nbClust |
The number of clusters. |
nbIter |
The number of iterations allowed. |
algo |
The algoritm used (see |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
... |
Additional arguments from the |
An object of class kmeans (see kmeans
).
data("wordOccuDF") doKmeansClust(wordF = wordOccuDF, nbClust = 2, getPlot = FALSE)
data("wordOccuDF") doKmeansClust(wordF = wordOccuDF, nbClust = 2, getPlot = FALSE)
Uses the entropart-package
to analyse the word-occurrence data.frame,
considering words as species and documents as communities.
doMetacomEntropart( wordF, getPlot = c(TRUE, TRUE, TRUE, TRUE), getTextSink = c(TRUE, TRUE, TRUE, TRUE), mwidth = 800, mheight = 800, formatType = "png" )
doMetacomEntropart( wordF, getPlot = c(TRUE, TRUE, TRUE, TRUE), getTextSink = c(TRUE, TRUE, TRUE, TRUE), mwidth = 800, mheight = 800, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
getPlot |
A vector with four logical values. If |
getTextSink |
A vector with four logical values. If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
A MetaCommunity
object (see entropart-package
).
## Not run: data("wordOccuDF") doMetacomEntropart(wordF = wordOccuDF) ## End(Not run)
## Not run: data("wordOccuDF") doMetacomEntropart(wordF = wordOccuDF) ## End(Not run)
Use the package Metacommunity
to analyse the word-occurrence data.frame,
considering words as species and documents as communities.
doMetacomMetacom( wordF, numSim = 10, limit = "Inf", getPlot = TRUE, getTextSink = TRUE, mwidth = 800, mheight = 800, formatType = "png" )
doMetacomMetacom( wordF, numSim = 10, limit = "Inf", getPlot = TRUE, getTextSink = TRUE, mwidth = 800, mheight = 800, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
numSim |
Number of simulated null matrices, see |
limit |
An integer to limit the number of words to use in the analysis. |
getPlot |
If |
getTextSink |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
An object of class Metacommunity
.
data("wordOccuDF") doMetacomMetacom(wordF = wordOccuDF, getPlot = FALSE, getTextSink = FALSE)
data("wordOccuDF") doMetacomMetacom(wordF = wordOccuDF, getPlot = FALSE, getTextSink = FALSE)
Exclude StopWords form the word occurrences data.frame. excludeStopWords
uses parallel
to perform parallel computation.
excludeStopWords(wordF, lang = "English")
excludeStopWords(wordF, lang = "English")
wordF |
The data.frame containing word occurrences. |
lang |
The language used ("French", "English", "Spanish"). |
The word-occurrence data.frame.
## Not run: excludeStopWords(wordF = myDF, lang = "French") ## End(Not run)
## Not run: excludeStopWords(wordF = myDF, lang = "French") ## End(Not run)
A vector containing stop words in French.
exclusionList_FR
exclusionList_FR
A vector with 173 elements (character), with UTF-8 characters escaped
using stringi::stri_escape_unicode(exclusionList_FR)
.
Adapted from www.ranks.nl/stopwords.
A vector containing stop words in Spanish
exclusionList_SP
exclusionList_SP
A vector with 190 elements (character), with UTF-8 characters escaped
using stringi::stri_escape_unicode(exclusionList_SP)
.
Adapted from www.ranks.nl/stopwords.
A vector containing stop words in English.
exclusionList_UK
exclusionList_UK
A vector with 542 elements (character).
Adapted from www.ranks.nl/stopwords.
A quick way to compute a set of analysis from the word-occurrence data.frame.
getAllAnalysis( dataset, wcloud = TRUE, sumStats = TRUE, freqW = TRUE, clust = TRUE, metacom = TRUE )
getAllAnalysis( dataset, wcloud = TRUE, sumStats = TRUE, freqW = TRUE, clust = TRUE, metacom = TRUE )
dataset |
A single word-occurrrence data.frame. |
wcloud |
A logical to for word cloud analysis. |
sumStats |
A logical to for summary statistics analysis. |
freqW |
A logical to for word frequency analysis. |
clust |
A logical to for cluster analysis. |
metacom |
A logical to for metacommunity analysis. |
A set of analyses available from the inpdfr
package.
## Not run: data("wordOccuDF") getAllAnalysis(dataset = wordOccuDF, wcloud = FALSE, sumStats = FALSE) ## End(Not run)
## Not run: data("wordOccuDF") getAllAnalysis(dataset = wordOccuDF, wcloud = FALSE, sumStats = FALSE) ## End(Not run)
List files in a specified directory sorted by extension. The function takes
into account .txt and .pdf files based on strsplit
function.
getListFiles(mywd)
getListFiles(mywd)
mywd |
A string containing the working directory. |
A list of length 2 with file names sorted by extension (pdf and txt).
getListFiles(mywd = getwd())
getListFiles(mywd = getwd())
Returns most frequent words and plots their frequencies per document.
getMostFreqWord( wordF, numWords, getPlot = TRUE, mwidth = 1024, mheight = 800, formatType = "png" )
getMostFreqWord( wordF, numWords, getPlot = TRUE, mwidth = 1024, mheight = 800, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
numWords |
The number of words to be returned. |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
The numWords
most frequent words.
data("wordOccuDF") getMostFreqWord(wordF = wordOccuDF, numWords = 5, getPlot = FALSE)
data("wordOccuDF") getMostFreqWord(wordF = wordOccuDF, numWords = 5, getPlot = FALSE)
Test for correlation between the most frequent words.
getMostFreqWordCor( wordF, numWords, getPlot = c(TRUE, TRUE), getTextSink = TRUE, mwidth = 1024, mheight = 1024, formatType = "png" )
getMostFreqWordCor( wordF, numWords, getPlot = c(TRUE, TRUE), getTextSink = TRUE, mwidth = 1024, mheight = 1024, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
numWords |
The number of words to be returned. |
getPlot |
A vector with two logical values. If |
getTextSink |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
A list with the correlation matrix and the p-value matrix.
data("wordOccuDF") getMostFreqWordCor( wordF = wordOccuDF, numWords = 5, getPlot = c(FALSE, FALSE), getTextSink = FALSE)
data("wordOccuDF") getMostFreqWordCor( wordF = wordOccuDF, numWords = 5, getPlot = c(FALSE, FALSE), getTextSink = FALSE)
getPDF
returns a word-occurrence data.frame from PDF files.
It needs XPDF
in order to run (http://www.foolabs.com/xpdf/download.html),
and uses parallel
to perform parallel computation.
getPDF( myPDFs, minword = 1, maxword = 20, minFreqWord = 1, pathToPdftotext = "" )
getPDF( myPDFs, minword = 1, maxword = 20, minFreqWord = 1, pathToPdftotext = "" )
myPDFs |
A character vector containing PDF file names. |
minword |
An integer specifying the minimum number of letters per word into the returned data.frame. |
maxword |
An integer to specifying the maximum number of letters per word into the returned data.frame. |
minFreqWord |
An integer specifying the minimum word frequency into the returned data.frame. |
pathToPdftotext |
A character containing an alternative path to XPDF
|
getPDF
uses XPDF pdftotext
function to extract the
content of PDF files into a TXT file. If pdftotext
is not in the
PATH
, an alternative is to provide the full path of the program into
the pathToPdftotext
parameter.
A list of list with word-occurrence data.frame and file name.
## Not run: getPDF(myPDFs = "mypdf.pdf") ## End(Not run)
## Not run: getPDF(myPDFs = "mypdf.pdf") ## End(Not run)
getStopWords
returns a list of stopwords.
getStopWords()
getStopWords()
A list of vectors with stopwords for French, English, and Spanish languages.
getStopWords()
getStopWords()
Perform a barplot with the number of unique words per document using barplot
function.
getSummaryStatsBARPLOT( wordF, getPlot = TRUE, mwidth = 480, mheight = 480, formatType = "png", ... )
getSummaryStatsBARPLOT( wordF, getPlot = TRUE, mwidth = 480, mheight = 480, formatType = "png", ... )
wordF |
The data.frame containing word occurrences. |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
... |
Additional arguments from |
The number of unique words per document.
data("wordOccuDF") getSummaryStatsBARPLOT(wordF = wordOccuDF, getPlot = FALSE)
data("wordOccuDF") getSummaryStatsBARPLOT(wordF = wordOccuDF, getPlot = FALSE)
Plot a histogram with the number of words excluding stop words using hist
function.
getSummaryStatsHISTO( wordF, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
getSummaryStatsHISTO( wordF, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png", ... )
wordF |
The data.frame containing word occurrences. |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
... |
Additional arguments from |
data("wordOccuDF") getSummaryStatsHISTO(wordF = wordOccuDF, getPlot = FALSE)
data("wordOccuDF") getSummaryStatsHISTO(wordF = wordOccuDF, getPlot = FALSE)
Plot a scatter plot with the proportion of documents using similar words.
getSummaryStatsOCCUR( wordF, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png" )
getSummaryStatsOCCUR( wordF, getPlot = TRUE, mwidth = 800, mheight = 800, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
getPlot |
If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
A data.frame containing the proportion of documents and the number of similar words.
## Not run: getSummaryStatsOCCUR(wordF = myDF) ## End(Not run)
## Not run: getSummaryStatsOCCUR(wordF = myDF) ## End(Not run)
Extract text from TXT files and return a word-occurrence data.frame.
getTXT(myTXTs)
getTXT(myTXTs)
myTXTs |
A character vector containing TXT file names (or complete path to these files). |
A list of list with word-occurrence data.frame and file name.
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE)) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE)) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
A quick way to obtain the word-occurrence data.frame from a set of documents.
getwordOccuDF(mywd, language = "English", excludeSW = TRUE)
getwordOccuDF(mywd, language = "English", excludeSW = TRUE)
mywd |
A character variable containing the working directory. |
language |
The language used ("French", "English", "Spanish"). |
excludeSW |
A logical to exclude stop words. |
A single word-occurrrence data.frame.
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuDF <- getwordOccuDF(mywd = paste0(getwd(), "/RESULTS"), excludeSW = FALSE) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuDF <- getwordOccuDF(mywd = paste0(getwd(), "/RESULTS"), excludeSW = FALSE) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
Returns most frequent words
getXFreqWord(wordF, occuWords)
getXFreqWord(wordF, occuWords)
wordF |
The data.frame containing word occurrences. |
occuWords |
The minimum number of occurrences for words to be returned. |
A vector with most frequent words.
data("wordOccuDF") getXFreqWord(wordF = wordOccuDF, occuWords = 5)
data("wordOccuDF") getXFreqWord(wordF = wordOccuDF, occuWords = 5)
Identifies structure (or quasi-structure) and outputs a classification.
IdentifyStructure(metacom.obj)
IdentifyStructure(metacom.obj)
metacom.obj |
The result of the 'Metacommunity' function, containing a list of 4 elements; the empirical matrix being tested, and results for coherence, turnover, and boundary clumping. |
Tad Dallas <[email protected]> identifyStructure function no longer maintained in metacom package. see https://github.com/taddallas/metacom. This function was copy-pasted from version 1.4.4 of package metacom with minor modification (fix warning: the condition has length > 1 and only the first element will be used).
Ouputs a classification of the metacommunity.
Quasi structures, as well as 'random' and 'Gleasonian' structures, may not strictly be discernable through the EMS approach, as they rely on inferring a result from a non-significant test ('accepting the null'), which is typically a bad idea.
The inpdfr package allows analysing and comparing PDF/TXT documents using both classical text mining tools and those from theoretical ecolgy. In the later, words are considered as species and documents as communities, therefore allowing analysis at the community and metacommunity levels. The inpdfr package provides three cathegories of functions: functions to extract and process text into a word-occurrence data.frame, functions to analyse the word-occurrence data.frame with standard and ecological tools, and functions to use inpdfr through a Gtk2 Graphical User Interface (GitHub version only).
A vector containing a Lorem Ipsum text for testing purposes.
loremIpsum
loremIpsum
A vector with 556 elements, each element corresponds to a line in the original text (character).
Plot a word cloud from the word-occurrence data.frame using wordcloud
function.
makeWordcloud( wordF, wcFormat = "png", wcminFreq = 3, wcmaxWords = Inf, wcRandOrder = FALSE, wcCol = RColorBrewer::brewer.pal(8, "Dark2"), getPlot = c(TRUE, TRUE), mwidth = 1000, mheight = 1000, formatType = "png" )
makeWordcloud( wordF, wcFormat = "png", wcminFreq = 3, wcmaxWords = Inf, wcRandOrder = FALSE, wcCol = RColorBrewer::brewer.pal(8, "Dark2"), getPlot = c(TRUE, TRUE), mwidth = 1000, mheight = 1000, formatType = "png" )
wordF |
The data.frame containing word occurrences. |
wcFormat |
Output format for the word cloud (deprecated, only "png"). |
wcminFreq |
Minimum word frequency for words to be ploted (see |
wcmaxWords |
Maximum number of words to be ploted (see |
wcRandOrder |
Plot words in random order (see |
wcCol |
Color words (see |
getPlot |
A vector with two logical values. If |
mwidth |
The width of the plot in pixels. |
mheight |
The height of the plot in pixels. |
formatType |
The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp"). |
## Not run: makeWordcloud(wordF = myDF) ## End(Not run)
## Not run: makeWordcloud(wordF = myDF) ## End(Not run)
Merge word-occurrence data.frames into a single data.frame.
mergeWordFreq(wordF)
mergeWordFreq(wordF)
wordF |
The data.frame containing word occurrences. |
A single word-occurrrence data.frame with each column corresponding to a text file.
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE)) wordOccuDF <- mergeWordFreq(wordF = wordOccuFreq) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
## Not run: data("loremIpsum") loremIpsum01 <- loremIpsum[1:100] loremIpsum02 <- loremIpsum[101:200] loremIpsum03 <- loremIpsum[201:300] loremIpsum04 <- loremIpsum[301:400] loremIpsum05 <- loremIpsum[401:500] subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt") write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt") write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt") write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt") write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt") wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE)) wordOccuDF <- mergeWordFreq(wordF = wordOccuFreq) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
Prossess vectors containing words into a data.frame of word occurrences.
postProcTxt(txt, minword = 1, maxword = 20, minFreqWord = 1)
postProcTxt(txt, minword = 1, maxword = 20, minFreqWord = 1)
txt |
A vector containing text. |
minword |
An integer specifying the minimum number of letters per word into the returned data.frame. |
maxword |
An integer to specifying the maximum number of letters per word into the returned data.frame. |
minFreqWord |
An integer specifying the minimum word frequency into the returned data.frame. |
A data.frame (freq = occurrences, stem = stem words, word = words), sorted by word occurrences.
Extract text from txt files and pre-process content.
preProcTxt(filetxt, encodingIn = "UTF-8", encodingOut = "UTF-8")
preProcTxt(filetxt, encodingIn = "UTF-8", encodingOut = "UTF-8")
filetxt |
A character containing the name of a txt file. |
encodingIn |
Encoding of the text file (default = "UTF-8"). |
encodingOut |
Encoding of the text extracted (default = "UTF-8"). |
A character vector with the content of the pre-process txt file (one element per line).
## Not run: data("loremIpsum") subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum, file = "RESULTS/loremIpsum.txt") preProcTxt(filetxt = paste0(getwd(), "/RESULTS/loremIpsum.txt")) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
## Not run: data("loremIpsum") subDir <- "RESULTS" dir.create(file.path(getwd(), subDir), showWarnings = FALSE) write(x = loremIpsum, file = "RESULTS/loremIpsum.txt") preProcTxt(filetxt = paste0(getwd(), "/RESULTS/loremIpsum.txt")) file.remove(list.files(full.names = TRUE, path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum")) ## End(Not run)
Delete spaces in file names located in the current working directory.
quitSpaceFromChars(vectxt)
quitSpaceFromChars(vectxt)
vectxt |
A vector containing character entries corresponding to the names of files in the current working directory. |
The function returns a logical for each file, with TRUE if the file has been found, and FALSE otherwise.
## Not run: quitSpaceFromChars(c("my pdf.pdf","my other pdf.pdf")) ## End(Not run)
## Not run: quitSpaceFromChars(c("my pdf.pdf","my other pdf.pdf")) ## End(Not run)
Truncate the word-occurrence data.frame.
truncNumWords(wordF, maxWords)
truncNumWords(wordF, maxWords)
wordF |
The data.frame containing word occurrences. |
maxWords |
The maximum number of words in the data.frame. |
The data.frame containing word occurrences.
## Not run: truncNumWords(wordF = myWordOccurrenceDF, maxWords = 50) ## End(Not run)
## Not run: truncNumWords(wordF = myWordOccurrenceDF, maxWords = 50) ## End(Not run)
Lorem Ipsum word occurrences.
wordOccuDF
wordOccuDF
A data.frame containing word name and occurences for testing purposes.