Package 'inpdfr'

Title: Analyse Text Documents Using Ecological Tools
Description: A set of functions to analyse and compare texts, using classical text mining functions, as well as those from theoretical ecology.
Authors: Rebaudo Francois (IRD, UMR EGCE, IRD, CNRS, Univ. ParisSaclay)
Maintainer: Rebaudo Francois <[email protected]>
License: GPL-2
Version: 0.1.12
Built: 2025-02-14 03:11:31 UTC
Source: https://github.com/frareb/inpdfr

Help Index


Performs a cluster analysis on the basis of the word-occurrence data.frame.

Description

Performs a cluster analysis on the basis of the word-occurrence data.frame using hclust function.

Usage

doCluster(
  wordF,
  myMethod = "ward.D2",
  gp = FALSE,
  nbGp = 5,
  getPlot = TRUE,
  mwidth = 800,
  mheight = 800,
  formatType = "png",
  ...
)

Arguments

wordF

The data.frame containing word occurrences.

myMethod

The method to compute distances, see dist function.

gp

A logical to specify if groups should be made.

nbGp

An intger to specify the number of groups. Ignored if gp=FALSE.

getPlot

If TRUE, save the cluster plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

...

Additional arguments from the hclust function.

Value

An object of class hclust.

Examples

data("wordOccuDF")
doCluster(wordF = wordOccuDF, myMethod = "ward.D2", getPlot = FALSE)

Performs a k-means cluster analysis on the basis of the word-occurrence data.frame.

Description

Performs a k-means cluster analysis on the basis of the word-occurrence data.frame using kmeans function.

Usage

doKmeansClust(
  wordF,
  nbClust = 4,
  nbIter = 10,
  algo = "Hartigan-Wong",
  getPlot = TRUE,
  mwidth = 800,
  mheight = 800,
  formatType = "png",
  ...
)

Arguments

wordF

The data.frame containing word occurrences.

nbClust

The number of clusters.

nbIter

The number of iterations allowed.

algo

The algoritm used (see kmeans).

getPlot

If TRUE, save the k-means cluster plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

...

Additional arguments from the kmeans function.

Value

An object of class kmeans (see kmeans).

Examples

data("wordOccuDF")
doKmeansClust(wordF = wordOccuDF, nbClust = 2, getPlot = FALSE)

Performs an analysis of ecological diversity and structure.

Description

Uses the entropart-package to analyse the word-occurrence data.frame, considering words as species and documents as communities.

Usage

doMetacomEntropart(
  wordF,
  getPlot = c(TRUE, TRUE, TRUE, TRUE),
  getTextSink = c(TRUE, TRUE, TRUE, TRUE),
  mwidth = 800,
  mheight = 800,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

getPlot

A vector with four logical values. If getPlot[1]==TRUE, the MetaCommunity object is plotted and saved in the RESULTS directory. If getPlot[2]==TRUE, the DivPart analisis is plotted and saved in the RESULTS directory. If getPlot[3]==TRUE, the DivEst analisis is plotted and saved in the RESULTS directory. If getPlot[4]==TRUE, the DivProfile analisis is plotted and saved in the RESULTS directory.

getTextSink

A vector with four logical values. If getTextSink[1]==TRUE, the MetaCommunity object is saved in the RESULTS directory. If getTextSink[2]==TRUE, the DivPart analisis is saved in the RESULTS directory. If getTextSink[3]==TRUE, the DivEst analisis is saved in the RESULTS directory. If getTextSink[4]==TRUE, the DivProfile analisis is saved in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Value

A MetaCommunity object (see entropart-package).

Examples

## Not run: 
data("wordOccuDF")
doMetacomEntropart(wordF = wordOccuDF)

## End(Not run)

Performs a metacomunity analysis.

Description

Use the package Metacommunity to analyse the word-occurrence data.frame, considering words as species and documents as communities.

Usage

doMetacomMetacom(
  wordF,
  numSim = 10,
  limit = "Inf",
  getPlot = TRUE,
  getTextSink = TRUE,
  mwidth = 800,
  mheight = 800,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

numSim

Number of simulated null matrices, see Metacommunity.

limit

An integer to limit the number of words to use in the analysis.

getPlot

If TRUE, save the plot in the RESULTS directory.

getTextSink

If TRUE, save the console output in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Value

An object of class Metacommunity.

Examples

data("wordOccuDF")
doMetacomMetacom(wordF = wordOccuDF, getPlot = FALSE, getTextSink = FALSE)

Exclude StopWords form the word-occurrence data.frame.

Description

Exclude StopWords form the word occurrences data.frame. excludeStopWords uses parallel to perform parallel computation.

Usage

excludeStopWords(wordF, lang = "English")

Arguments

wordF

The data.frame containing word occurrences.

lang

The language used ("French", "English", "Spanish").

Value

The word-occurrence data.frame.

Examples

## Not run: 
excludeStopWords(wordF = myDF, lang = "French")

## End(Not run)

Stop words in French.

Description

A vector containing stop words in French.

Usage

exclusionList_FR

Format

A vector with 173 elements (character), with UTF-8 characters escaped using stringi::stri_escape_unicode(exclusionList_FR).

Source

Adapted from www.ranks.nl/stopwords.


Stop words in Spanish.

Description

A vector containing stop words in Spanish

Usage

exclusionList_SP

Format

A vector with 190 elements (character), with UTF-8 characters escaped using stringi::stri_escape_unicode(exclusionList_SP).

Source

Adapted from www.ranks.nl/stopwords.


Stop words in English.

Description

A vector containing stop words in English.

Usage

exclusionList_UK

Format

A vector with 542 elements (character).

Source

Adapted from www.ranks.nl/stopwords.


A quick way to compute a set of analysis from the word-occurrence data.frame.

Description

A quick way to compute a set of analysis from the word-occurrence data.frame.

Usage

getAllAnalysis(
  dataset,
  wcloud = TRUE,
  sumStats = TRUE,
  freqW = TRUE,
  clust = TRUE,
  metacom = TRUE
)

Arguments

dataset

A single word-occurrrence data.frame.

wcloud

A logical to for word cloud analysis.

sumStats

A logical to for summary statistics analysis.

freqW

A logical to for word frequency analysis.

clust

A logical to for cluster analysis.

metacom

A logical to for metacommunity analysis.

Value

A set of analyses available from the inpdfr package.

Examples

## Not run: 
data("wordOccuDF")
getAllAnalysis(dataset = wordOccuDF, wcloud = FALSE, sumStats = FALSE)

## End(Not run)

List files in a specified directory sorted by extension.

Description

List files in a specified directory sorted by extension. The function takes into account .txt and .pdf files based on strsplit function.

Usage

getListFiles(mywd)

Arguments

mywd

A string containing the working directory.

Value

A list of length 2 with file names sorted by extension (pdf and txt).

Examples

getListFiles(mywd = getwd())

Returns most frequent words.

Description

Returns most frequent words and plots their frequencies per document.

Usage

getMostFreqWord(
  wordF,
  numWords,
  getPlot = TRUE,
  mwidth = 1024,
  mheight = 800,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

numWords

The number of words to be returned.

getPlot

If TRUE, save a scatter plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Value

The numWords most frequent words.

Examples

data("wordOccuDF")
getMostFreqWord(wordF = wordOccuDF, numWords = 5, getPlot = FALSE)

Test for correlation between the most frequent words.

Description

Test for correlation between the most frequent words.

Usage

getMostFreqWordCor(
  wordF,
  numWords,
  getPlot = c(TRUE, TRUE),
  getTextSink = TRUE,
  mwidth = 1024,
  mheight = 1024,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

numWords

The number of words to be returned.

getPlot

A vector with two logical values. If plots[1]==TRUE, an image of the correlation matrix is saved in the RESULTS directory. If plots[2]==TRUE, the image of the p-value matrix associated with the correlation is saved in the RESULTS directory.

getTextSink

If TRUE, save the correlation matrix and the associated p-values in a text file in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Value

A list with the correlation matrix and the p-value matrix.

Examples

data("wordOccuDF")
getMostFreqWordCor(
  wordF = wordOccuDF, 
  numWords = 5, 
  getPlot = c(FALSE, FALSE), 
  getTextSink = FALSE)

Extract text from PDF files and return a word-occurrence data.frame.

Description

getPDF returns a word-occurrence data.frame from PDF files. It needs XPDF in order to run (http://www.foolabs.com/xpdf/download.html), and uses parallel to perform parallel computation.

Usage

getPDF(
  myPDFs,
  minword = 1,
  maxword = 20,
  minFreqWord = 1,
  pathToPdftotext = ""
)

Arguments

myPDFs

A character vector containing PDF file names.

minword

An integer specifying the minimum number of letters per word into the returned data.frame.

maxword

An integer to specifying the maximum number of letters per word into the returned data.frame.

minFreqWord

An integer specifying the minimum word frequency into the returned data.frame.

pathToPdftotext

A character containing an alternative path to XPDF pdftotext function, see Details section.

Details

getPDF uses XPDF pdftotext function to extract the content of PDF files into a TXT file. If pdftotext is not in the PATH, an alternative is to provide the full path of the program into the pathToPdftotext parameter.

Value

A list of list with word-occurrence data.frame and file name.

Examples

## Not run: 
getPDF(myPDFs = "mypdf.pdf")

## End(Not run)

Load a list of stopwords.

Description

getStopWords returns a list of stopwords.

Usage

getStopWords()

Value

A list of vectors with stopwords for French, English, and Spanish languages.

Examples

getStopWords()

Perform a barplot with the number of unique words per document

Description

Perform a barplot with the number of unique words per document using barplot function.

Usage

getSummaryStatsBARPLOT(
  wordF,
  getPlot = TRUE,
  mwidth = 480,
  mheight = 480,
  formatType = "png",
  ...
)

Arguments

wordF

The data.frame containing word occurrences.

getPlot

If TRUE, save the bar plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

...

Additional arguments from barplot function.

Value

The number of unique words per document.

Examples

data("wordOccuDF")
getSummaryStatsBARPLOT(wordF = wordOccuDF, getPlot = FALSE)

Plot an histogram with the number of words excluding stop words

Description

Plot a histogram with the number of words excluding stop words using hist function.

Usage

getSummaryStatsHISTO(
  wordF,
  getPlot = TRUE,
  mwidth = 800,
  mheight = 800,
  formatType = "png",
  ...
)

Arguments

wordF

The data.frame containing word occurrences.

getPlot

If TRUE, save the plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

...

Additional arguments from hist function.

Examples

data("wordOccuDF")
getSummaryStatsHISTO(wordF = wordOccuDF, getPlot = FALSE)

Plot a scatter plot with the proportion of documents using similar words.

Description

Plot a scatter plot with the proportion of documents using similar words.

Usage

getSummaryStatsOCCUR(
  wordF,
  getPlot = TRUE,
  mwidth = 800,
  mheight = 800,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

getPlot

If TRUE, save the scatter plot in the RESULTS directory.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Value

A data.frame containing the proportion of documents and the number of similar words.

Examples

## Not run: 
getSummaryStatsOCCUR(wordF = myDF)

## End(Not run)

Extract text from TXT files and return a word-occurrence data.frame.

Description

Extract text from TXT files and return a word-occurrence data.frame.

Usage

getTXT(myTXTs)

Arguments

myTXTs

A character vector containing TXT file names (or complete path to these files).

Value

A list of list with word-occurrence data.frame and file name.

Examples

## Not run: 
data("loremIpsum")
loremIpsum01 <- loremIpsum[1:100]
loremIpsum02 <- loremIpsum[101:200]
loremIpsum03 <- loremIpsum[201:300]
loremIpsum04 <- loremIpsum[301:400]
loremIpsum05 <- loremIpsum[401:500]
subDir <- "RESULTS"
dir.create(file.path(getwd(), subDir), showWarnings = FALSE)
write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt")
write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt")
write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt")
write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt")
write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt")
wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), 
  "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE))
file.remove(list.files(full.names = TRUE, 
  path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum"))

## End(Not run)

A quick way to obtain the word-occurrence data.frame from a set of documents.

Description

A quick way to obtain the word-occurrence data.frame from a set of documents.

Usage

getwordOccuDF(mywd, language = "English", excludeSW = TRUE)

Arguments

mywd

A character variable containing the working directory.

language

The language used ("French", "English", "Spanish").

excludeSW

A logical to exclude stop words.

Value

A single word-occurrrence data.frame.

Examples

## Not run: 
data("loremIpsum")
loremIpsum01 <- loremIpsum[1:100]
loremIpsum02 <- loremIpsum[101:200]
loremIpsum03 <- loremIpsum[201:300]
loremIpsum04 <- loremIpsum[301:400]
loremIpsum05 <- loremIpsum[401:500]
subDir <- "RESULTS"
dir.create(file.path(getwd(), subDir), showWarnings = FALSE)
write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt")
write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt")
write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt")
write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt")
write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt")
wordOccuDF <- getwordOccuDF(mywd = paste0(getwd(), "/RESULTS"),
  excludeSW = FALSE)
file.remove(list.files(full.names = TRUE, 
  path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum"))

## End(Not run)

Returns most frequent words

Description

Returns most frequent words

Usage

getXFreqWord(wordF, occuWords)

Arguments

wordF

The data.frame containing word occurrences.

occuWords

The minimum number of occurrences for words to be returned.

Value

A vector with most frequent words.

Examples

data("wordOccuDF")
getXFreqWord(wordF = wordOccuDF, occuWords = 5)

Copy of the identifyStructure function from Tad Dallas metacom package.

Description

Identifies structure (or quasi-structure) and outputs a classification.

Usage

IdentifyStructure(metacom.obj)

Arguments

metacom.obj

The result of the 'Metacommunity' function, containing a list of 4 elements; the empirical matrix being tested, and results for coherence, turnover, and boundary clumping.

Details

Tad Dallas <[email protected]> identifyStructure function no longer maintained in metacom package. see https://github.com/taddallas/metacom. This function was copy-pasted from version 1.4.4 of package metacom with minor modification (fix warning: the condition has length > 1 and only the first element will be used).

Value

Ouputs a classification of the metacommunity.

Note

Quasi structures, as well as 'random' and 'Gleasonian' structures, may not strictly be discernable through the EMS approach, as they rely on inferring a result from a non-significant test ('accepting the null'), which is typically a bad idea.


inpdfr: A package to analyse PDF Files Using Ecological Tools.

Description

The inpdfr package allows analysing and comparing PDF/TXT documents using both classical text mining tools and those from theoretical ecolgy. In the later, words are considered as species and documents as communities, therefore allowing analysis at the community and metacommunity levels. The inpdfr package provides three cathegories of functions: functions to extract and process text into a word-occurrence data.frame, functions to analyse the word-occurrence data.frame with standard and ecological tools, and functions to use inpdfr through a Gtk2 Graphical User Interface (GitHub version only).


Lorem Ipsum text.

Description

A vector containing a Lorem Ipsum text for testing purposes.

Usage

loremIpsum

Format

A vector with 556 elements, each element corresponds to a line in the original text (character).

Source

https://lipsum.com/.


Word cloud based on the word-occurrence data.frame.

Description

Plot a word cloud from the word-occurrence data.frame using wordcloud function.

Usage

makeWordcloud(
  wordF,
  wcFormat = "png",
  wcminFreq = 3,
  wcmaxWords = Inf,
  wcRandOrder = FALSE,
  wcCol = RColorBrewer::brewer.pal(8, "Dark2"),
  getPlot = c(TRUE, TRUE),
  mwidth = 1000,
  mheight = 1000,
  formatType = "png"
)

Arguments

wordF

The data.frame containing word occurrences.

wcFormat

Output format for the word cloud (deprecated, only "png").

wcminFreq

Minimum word frequency for words to be ploted (see wordcloud).

wcmaxWords

Maximum number of words to be ploted (see wordcloud).

wcRandOrder

Plot words in random order (see wordcloud).

wcCol

Color words (see wordcloud).

getPlot

A vector with two logical values. If plots[1]==TRUE, a word cloud is made for each document. If plots[2]==TRUE, a word cloud is made for the combinaison of all documents.

mwidth

The width of the plot in pixels.

mheight

The height of the plot in pixels.

formatType

The format for the output file ("eps", "pdf", "png", "svg", "tiff", "jpeg", "bmp").

Examples

## Not run: 
makeWordcloud(wordF = myDF)

## End(Not run)

Merge word-occurrence data.frames into a single data.frame.

Description

Merge word-occurrence data.frames into a single data.frame.

Usage

mergeWordFreq(wordF)

Arguments

wordF

The data.frame containing word occurrences.

Value

A single word-occurrrence data.frame with each column corresponding to a text file.

Examples

## Not run: 
data("loremIpsum")
loremIpsum01 <- loremIpsum[1:100]
loremIpsum02 <- loremIpsum[101:200]
loremIpsum03 <- loremIpsum[201:300]
loremIpsum04 <- loremIpsum[301:400]
loremIpsum05 <- loremIpsum[401:500]
subDir <- "RESULTS"
dir.create(file.path(getwd(), subDir), showWarnings = FALSE)
write(x = loremIpsum01, file = "RESULTS/loremIpsum01.txt")
write(x = loremIpsum02, file = "RESULTS/loremIpsum02.txt")
write(x = loremIpsum03, file = "RESULTS/loremIpsum03.txt")
write(x = loremIpsum04, file = "RESULTS/loremIpsum04.txt")
write(x = loremIpsum05, file = "RESULTS/loremIpsum05.txt")
wordOccuFreq <- getTXT(myTXTs = list.files(path = paste0(getwd(), 
  "/RESULTS/"), pattern = "loremIpsum", full.names = TRUE))
wordOccuDF <- mergeWordFreq(wordF = wordOccuFreq)
file.remove(list.files(full.names = TRUE, 
  path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum"))

## End(Not run)

Prossess vectors containing words into a data.frame of word occurrences.

Description

Prossess vectors containing words into a data.frame of word occurrences.

Usage

postProcTxt(txt, minword = 1, maxword = 20, minFreqWord = 1)

Arguments

txt

A vector containing text.

minword

An integer specifying the minimum number of letters per word into the returned data.frame.

maxword

An integer to specifying the maximum number of letters per word into the returned data.frame.

minFreqWord

An integer specifying the minimum word frequency into the returned data.frame.

Value

A data.frame (freq = occurrences, stem = stem words, word = words), sorted by word occurrences.


Extract text from txt files and pre-process content.

Description

Extract text from txt files and pre-process content.

Usage

preProcTxt(filetxt, encodingIn = "UTF-8", encodingOut = "UTF-8")

Arguments

filetxt

A character containing the name of a txt file.

encodingIn

Encoding of the text file (default = "UTF-8").

encodingOut

Encoding of the text extracted (default = "UTF-8").

Value

A character vector with the content of the pre-process txt file (one element per line).

Examples

## Not run: 
data("loremIpsum")
subDir <- "RESULTS"
dir.create(file.path(getwd(), subDir), showWarnings = FALSE)
write(x = loremIpsum, file = "RESULTS/loremIpsum.txt")
preProcTxt(filetxt = paste0(getwd(), "/RESULTS/loremIpsum.txt"))
file.remove(list.files(full.names = TRUE, 
  path = paste0(getwd(), "/RESULTS"), pattern = "loremIpsum"))

## End(Not run)

Delete spaces in file names.

Description

Delete spaces in file names located in the current working directory.

Usage

quitSpaceFromChars(vectxt)

Arguments

vectxt

A vector containing character entries corresponding to the names of files in the current working directory.

Value

The function returns a logical for each file, with TRUE if the file has been found, and FALSE otherwise.

Examples

## Not run: 
quitSpaceFromChars(c("my pdf.pdf","my other pdf.pdf"))

## End(Not run)

Truncate the word-occurrence data.frame.

Description

Truncate the word-occurrence data.frame.

Usage

truncNumWords(wordF, maxWords)

Arguments

wordF

The data.frame containing word occurrences.

maxWords

The maximum number of words in the data.frame.

Value

The data.frame containing word occurrences.

Examples

## Not run: 
truncNumWords(wordF = myWordOccurrenceDF, maxWords = 50)

## End(Not run)

Lorem Ipsum word occurrences.

Description

Lorem Ipsum word occurrences.

Usage

wordOccuDF

Format

A data.frame containing word name and occurences for testing purposes.