The inpdfr
package is
primarily designed for analysing and comparing PDF and/or TXT documents.
For this Vignette we used PDF articles from the Journal of Statistical
Software available at https://www.jstatsoft.org. Specifically we used the 10
articles from volume 68 (2015), which can be freely downloaded:
We used these files for the purpose of this vignette, but I encourage
you to test inpdfr
package with your own publications, or
those of your lab and colleagues.
The package uses XPDF (http://www.xpdfreader.com/download.html) for PDF to text
extraction. You need to install XPDF before using inpdfr
package. Depending on your operating system, you may need to restart
your computer after installing XPDF. If you do not want to use XPDF, you
can extract the content of your PDF files with the method of your choice
and then store the content in TXT files. The only function making use of
XPDF is getPDF
which can be substituted with the
getTXT
function.
To extract text from PDF files, you need to specify the directory where your files are located:
mywd <- "/home/user/myWD/JSS/"
Then list your PDF files using getListFiles
function:
listFilesExt <- getListFiles(mywd)
#> $pdf
#> [1] "v68i01.pdf" "v68i02.pdf" "v68i03.pdf" "v68i04.pdf"
#> [5] "v68i05.pdf" "v68i06.pdf" "v68i07.pdf" "v68i08.pdf"
#> [9] "v68i09.pdf" "v68i10.pdf"
#>
#> $txt
#> NULL
To extract text from PDF files, use the getPDF
function:
wordFreqPDF <- getPDF(myPDFs = listFilesExt$pdf)
#> [[1]]$wc
#> freq stem word
#> 626 264 the the
#> 32 141 and and
#> ... ... ... ...
#>
#> [[1]]$name
#> [1] "v68i01"
#>
#> ...
You will get a list where each element corresponds to a list composed
of a data.frame (freq = word frequency; stem = stem word; word = word)
and the name of the original PDF file without the extension. If you also
have TXT files, you can use the getTXT
function which works
similarly. To merge the results of the PDF and TXT extraction, use the
append
function as shown bellow:
wordFreqPDF <- getPDF(myPDFs = listFilesExt$pdf)
wordFreqTXT <- getTXT(myTXTs = listFilesExt$txt)
wordFreq <- append(wordFreqPDF, wordFreqTXT)
In order to exclude stop words, use the excludeStopWords
function which takes the list previously created and the language as
arguments:
wordFreq <- excludeStopWords(wordF = wordFreq, lang = "English")
#> [[1]]$wc
#> freq stem word
#> 135 101 covari covariates
#> 144 46 data data
#> ... ... ... ...
#>
#> [[1]]$name
#> [1] "v68i01"
#>
#> ...
In our case, “the” and “and” where supressed from the data.frame.
Optionally, you can truncate the number of words in each data.frame
using the truncNumWords
function:
wordFreq <- truncNumWords(maxWords = Inf, wordF = wordFreq)
Specifying maxWords = Inf
won’t truncate the
data.frames.
To obtain a word occurence data.frame, each element of the wordFreq
list must be merged. This operation is performed with the
mergeWordFreq
function:
mergedD <- mergeWordFreq(wordF = wordFreq)
#> word v68i01 v68i02 v68i03 v68i04 v68i05 ...
#> stem2076 model 11 25 89 420 253 ...
#> stem722 data 46 95 11 43 18 ...
#> ...
All theses tasks can be performed with the getwordOccuDF
function which takes the working directory and the language as
arguments:
mergedD <- getwordOccuDF(mywd = "/home/user/myWD/JSS/", language = "English")
A folder named “RESULTS” is created in your working directory and contains the output files for each analysis performed.
Simple manipulations can be easily performed from the word occurrence data.frame. The number of words (excluding stop words) can be computed as following:
numWords <- apply(mergedD[,2:ncol(mergedD)], MARGIN = 2, FUN = sum)
Or the number of unique words:
numUniqueWords <- apply(mergedD[,2:ncol(mergedD)],
MARGIN = 2, FUN = function(i) {length(i[i > 0])})
Considering the number of words as an “area”, and the number of unique words as “species”, we can easily build a “species-area relationships” analysis (which is commonly Log-Log transformed):
plot(x = log(numWords), y = log(numUniqueWords), pch = 16)
text(x = log(numWords), y = log(numUniqueWords),
labels=names(mergedD[,2:ncol(mergedD)]), cex= 0.7,pos = 3)
lmSAR <- lm(log(numUniqueWords) ~ log(numWords))
summary(lmSAR)
abline(lmSAR)
#> Call:
#> lm(formula = log(numUniqueWords) ~ log(numWords))
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -0.31186 -0.03016 0.02944 0.05401 0.15724
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 1.4693 1.2606 1.166 0.27738
#> log(numWords) 0.6246 0.1524 4.099 0.00344 **
#> ---
#> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#>
#> Residual standard error: 0.1316 on 8 degrees of freedom
#> Multiple R-squared: 0.6775, Adjusted R-squared: 0.6372
#> F-statistic: 16.8 on 1 and 8 DF, p-value: 0.003441
The slope is significantly different from zero, so that longer
articles have a higher specific richness in terms of number of unique
words. Additional analyses from this perspective can be found in
numerous R ecological packages. A good place to start is the
vegan
package.
Assuming your word occurrence data.frame is named “mergedD”, you can
compute word clouds with the makeWordcloud
function. The
getPlot argument controls which word clouds should be computed. If
getPlot[1] == TRUE, then a word cloud is made for each file. If
getPlot[2] == TRUE, then a word cloud is made for the set of
documents.
makeWordcloud(wordF = mergedD, wcminFreq = 50, wcmaxWords = Inf,
wcRandOrder = FALSE, wcCol = RColorBrewer::brewer.pal(8,"Dark2"),
getPlot = c(FALSE,TRUE))
The function getSummaryStatsBARPLOT
allows you to
compute a barplot of the number of unique words per document. It returns
the number of unique words per document.
getSummaryStatsBARPLOT(wordF = mergedD)
#> [1] 597 800 969 1019 838 788 541 822 823 568
The function getSummaryStatsHISTO
allows you to compute
an histogram of the number of word per document.
getSummaryStatsHISTO(wordF = mergedD)
The function getSummaryStatsOCCUR
allows you to compute
a scatter plot with the proportion of documents using similar words, and
returns the corresponding table.
getSummaryStatsOCCUR(wordF = mergedD)
#> dfTableP[, 3] dfTableP[, 2]
#> 1 (0,0.1] 2296
#> 2 (0.1,0.2] 467
#> 3 (0.2,0.3] 217
#> 4 (0.3,0.4] 142
#> 5 (0.4,0.5] 128
#> 6 (0.5,0.6] 101
#> 7 (0.6,0.7] 58
#> 8 (0.7,0.8] 67
#> 9 (0.8,0.9] 62
#> 10 (0.9,1] 57
In our example, we can see that 57 words were used in 90 to 100% of the articles, or that 58 words were used in 60 to 70% of the articles, while 2296 words were specific to an article. Comparison with a special issue should give a different repartition with more words common to all articles.
The function getMostFreqWord
returns the most frequent
words in the word occurrence data.frame. It also compute a scatter plot
with the frequency of each word for each document.
getMostFreqWord(wordF = mergedD, numWords = 10)
#> [1] "model" "data" "prior" "package" "variable"
#> [6] "test" "values" "estimate" "statistical" "set"
You may want to normalize the frequency by the number of words in each document. This can be easily done (in our example, the most frequent words are the same, but the corresponding scatter plots differ):
mergedDNorm <- data.frame(word = as.character(mergedD[,1]),
t(t(mergedD[,2:ncol(mergedD)]) /
apply(mergedD[,2:ncol(mergedD)], MARGIN=2, FUN=sum))
* 100)
getMostFreqWord(wordF = mergedDNorm, numWords = 10)
#> [1] "model" "data" "prior" "package" "variable"
#> [6] "test" "values" "estimate" "statistical" "set"
The function getMostFreqWordCor
compute the correlation
between most frequent words. Images of the correlation matrices are also
provided in the “RESULTS” folder. In our set of PDFs, we can see for
example that “model” is significantly correlated with “prior”, or that
“statistical” is significantly correlated with “varaible”:
getMostFreqWordCor(wordF = mergedD, numWords = 10)
#> $cor
#> model data prior package variable ...
#> model 1.0000000 -0.3835130 0.9564451 0.194231566 0.2746420 ...
#> data -0.3835130 1.0000000 -0.2039196 0.108141861 0.5050901 ...
#> prior 0.9564451 -0.2039196 1.0000000 0.188332000 0.2856416 ...
#> package 0.1942316 0.1081419 0.1883320 1.000000000 0.4097255 ...
#> variable 0.2746420 0.5050901 0.2856416 0.409725475 1.0000000 ...
#> test -0.1517949 -0.3209755 -0.1651302 0.278446147 -0.3004002 ...
#> values -0.3621747 0.5553868 -0.3198548 -0.232902022 0.5250301 ...
#> estimate -0.1736209 -0.3753552 -0.1576511 -0.009370034 -0.5549054 ...
#> statistical 0.4792956 0.2101346 0.5621689 0.183642865 0.6776569 ...
#> set 0.2755856 -0.2963624 0.2279321 -0.130621530 0.2758461 ...
#>
#> $pval
#> model data prior package variable ...
#> model 0.000000e+00 0.27394969 1.493632e-05 0.5907879 0.44252229 ...
#> data 2.739497e-01 0.00000000 5.720168e-01 0.7661868 0.13646367 ...
#> prior 1.493632e-05 0.57201681 0.000000e+00 0.6023278 0.42369314 ...
#> package 5.907879e-01 0.76618681 6.023278e-01 0.0000000 0.23963804 ...
#> variable 4.425223e-01 0.13646367 4.236931e-01 0.2396380 0.00000000 ...
#> test 6.754944e-01 0.36584180 6.484673e-01 0.4359677 0.39903200 ...
#> values 3.037403e-01 0.09557406 3.676131e-01 0.5172745 0.11916335 ...
#> estimate 6.314473e-01 0.28514377 6.635822e-01 0.9795048 0.09592275 ...
#> statistical 1.610148e-01 0.56009587 9.074744e-02 0.6115571 0.03130403 ...
#> set 4.408923e-01 0.40570937 5.265050e-01 0.7190909 0.44044280 ...
The function getXFreqWord
returns the words which have
been found at leat X times in the set of documents.
getXFreqWord(wordF = mergedD, occuWords = 200)
#> [1] "model" "data" "prior"
#> [4] "package" "variable" "test"
#> [7] "values" "estimate" "statistical"
#> [10] "set" "miss" "parameter"
#> [13] "covariance" "coefficient" "imputation"
#> [16] "number" "journal" "equating"
#> [19] "function" "results" "method"
#> [22] "software" "average"
The function doCA
performs a correspondance analysis on
the basis of the word occurrence data.frame, with the associated
plot.
doCA(wordF = mergedD)
#> Principal inertias (eigenvalues):
#> 1 2 3 4 5 6 7 ...
#> Value 0.500619 0.472982 0.442737 0.411129 0.401536 0.389836 0.374116 ...
#> Percentage 13.83% 13.07% 12.24% 11.36% 11.1% 10.77% 10.34% ...
#>
#> Rows:
#> ...
#>
#> Columns:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 ...
#> Mass 0.062176 0.095228 0.126130 0.165061 0.104936 0.097500 ...
#> ChiDist 2.405088 2.046214 1.685774 1.438954 1.636650 1.864347 ...
#> Inertia 0.359653 0.398717 0.358441 0.341773 0.281083 0.338890 ...
#> Dim. 1 -0.011166 0.768156 0.513466 0.475236 0.383365 0.238685 ...
#> Dim. 2 0.794833 1.180390 -0.146855 -1.655962 -1.109124 0.946243 ...
The function doCluster
performs a cluster analysis with
the associated dendrogram.
doCluster(wordF = mergedD, myMethod = "ward.D2", gp = FALSE, nbGp = 3)
#> Call:
#> stats::hclust(...)
#>
#> Cluster method : ward.D2
#> Distance : euclidean
#> Number of objects: 10
The function doKmeansClust
performs a k-means cluster
analysis with the associated cluster plot.
doKmeansClust(wordF = mergedD, nbClust = 4, nbIter = 10, algo = "Hartigan-Wong")
#> K-means clustering with 4 clusters of sizes 1, 2, 6, 1
#>
#> Cluster means:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 v68i07 ...
#> 1 501.5376 549.1102 589.0849 738.2107 569.1072 593.9343 0.0000 ...
#> 2 392.8950 422.6009 221.9471 526.7186 221.9471 484.8954 579.0960 ...
#> 3 257.0480 290.6130 431.3375 632.9743 418.1930 320.8966 536.3371 ...
#> 4 622.2564 641.1404 632.0633 0.0000 421.3739 680.6240 738.2107 ...
#>
#> Clustering vector:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 v68i07 v68i08 v68i09 v68i10
#> 3 3 2 4 2 3 1 3 3 3
#>
#> Within cluster sum of squares by cluster:
#> [1] 0.0 220295.9 672154.6 0.0
#> (between_SS / total_SS = 67.2 %)
#>
#> Available components:
#>
#> [1] "cluster" "centers" "totss" "withinss" ...
Discussing the analyses performed here are out of the scope of this
vignette. Briefly, the function doMetacomEntropart
uses the
entropart
package and the functions DivEst
,
DivPart
, DivProfile
, and
MetaCommunity
. Results are provided as plots or TXT files
in the “RESULTS” folder. Words are considered as species, word
occurrences as abundances, and documents as communities.
doMetacomEntropart(wordF = mergedD)
#> Meta-community (class 'MetaCommunity') made of 25170 individuals in 10
#> communities and 3595 species.
#>
#> Its sample coverage is 0.973223513822329
#>
#> Community weights are:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 v68i07 v68i08 v68i09 v68i10
#> 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
#> Community sample numbers of individuals are:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 v68i07 v68i08 v68i09 v68i10
#> 2517 3855 5106 6682 4248 3947 3723 4334 3439 2631
#> Community sample coverages are:
#> v68i01 v68i02 v68i03 v68i04 v68i05 v68i06 v68i07 ...
#> 0.9070729 0.9250581 0.9265726 0.9440365 0.9268099 0.9176794 0.9444160 ...
#>
#> ...
Discussing the analyses performed here are out of the scope of this
vignette. Briefly, the function doMetacomMetacom
uses the
metacom
package and the metacom
function. Just
like before, words are considered as species, word occurrences as
abundances, and documents as communities (allowing the metacommunity
analysis).
doMetacomMetacom(wordF = mergedD, numSim = 10, limit = "Inf")
#> [1] "Identified community structure: Random"
#> $Comm
#> ...
#>
#> $Coherence
#> output
#> embedded absences 20251
#> z 1.91875604260688
#> pval 0.0550152153018832
#> sim.mean 22109.1
#> sim.sd 968.387829791809
#> method r1
#>
#> $Turnover
#> output
#> replacements 5961653
#> z -2.67166401618644
#> pval 0.00754761772110864
#> sim.mean 3470121.1
#> sim.sd 932576.80790133
#> method r1
#>
#> $Boundary
#> index P df
#> 1 0 0.4234722 3592
All theses tasks can be performed with the
getAllAnalysis
function which takes the word-occurrence
data.frame as argument:
getAllAnalysis(dataset = mergedD)
To load the RGtk2 GUI, use the function loadGUI
available only in the gitHub webpage (https://github.com/frareb/inpdfr):
loadGUI()
All function used to build the GUI were made available so that any developer can easily access its content. They are not intended to be used by end users, but given the scarcity of RGtk2 resources in the web, I thought they should be available in this package in the hope they could be usefull for other projects. Please feel free to use them under the terms of the package licence, but do not expect backward compatibility in future versions of this package. These functions are listed below:
From this point, considering words as species, word occurrences as abundances, and documents as communities, an incredible amount of analyses from theoretical ecology are available in R. Some examples are Rank-abundance curve, Species-Area Relationships, or Single large or Several small analyses. All of them provide interesting points to compare and analyse sets of documents.