Trying to figure out how encoding work in đŠ for writing french DESCRIPTION and manual files, I was wondering if others packages on CRAN as been wrote in French or any other non english language.
library(tidyverse)
Looking into the cran database
First, we load the cran database with all information about description fields. We keep only some fields that can helps us look for not english packages. (For a list of all the fields meaning, read Writing R Extensions)
to_keep <- c("Package", "Title", "Description", "Encoding", "Language")
cran_db <- tools::CRAN_package_db()[to_keep] %>% as_tibble()
glimpse(cran_db)
## Rows: 17,802
## Columns: 5
## $ Package <chr> "A3", "aaSEA", "AATtools", "ABACUS", "abbyyR", "abc", "abc~
## $ Title <chr> "Accurate, Adaptable, and Accessible Error Metrics for Pre~
## $ Description <chr> "Supplies tools for tabulating and analyzing the results o~
## $ Encoding <chr> NA, "UTF-8", "UTF-8", "UTF-8", NA, NA, NA, "UTF-8", "UTF-8~
## $ Language <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA~
Letâs see how many package set their Encoding field in their DESCRIPTION
nb_of_na <- cran_db %>% filter(is.na(Encoding)) %>% count(Encoding) %>% pull(n)
ggplot(cran_db %>% filter(!is.na(Encoding))) +
geom_bar(aes(x = Encoding), fill = "skyblue") +
labs(title = "Nb of package with Encoding field", subtitle = glue::glue("{nb_of_na} packages with unset Encoding field")) +
theme_light()
Is there any package that uses a Language field in their DESCRIPTION ?
nb_of_na <- cran_db %>% filter(is.na(Language)) %>% count(Language) %>% pull(n)
ggplot(cran_db %>% filter(!is.na(Language))) +
geom_bar(aes(x = Language), fill = "skyblue") +
labs(title = "Nb of package with Language field", subtitle = glue::glue("{nb_of_na} packages with unset Language field")) +
coord_flip() +
theme_light()
So there are some package that declares their Language field in the DESCRIPTION. These field means:
A âLanguageâ field can be used to indicate if the package documentation is not in English: this should be a comma-separated list of standard (not private use or grandfathered) IETF language tags as currently defined by RFC 5646 (https://tools.ietf.org/html/rfc5646, see also https://en.wikipedia.org/wiki/IETF_language_tag), i.e., use language subtags which in essence are 2-letter ISO 639-1 (https://en.wikipedia.org/wiki/ISO_639-1) or 3-letter ISO 639-3 (https://en.wikipedia.org/wiki/ISO_639-3) language codes.
Which are those packages ?
cran_db %>%
filter(Language == "fr") %>%
pander::pandoc.table(split.tables = Inf)
Package | Title | Description | Encoding | Language |
---|---|---|---|---|
SARP.moodle | XML Output Functions for Easy Creation of Moodle Questions | Provides a set of basic functions for creating Moodle XML output files suited for importing questions in Moodle (a learning management system, see https://moodle.org/ for more information). | UTF-8 | fr |
We see that neither of these packages have their title or description field in french. Moreover, one package has UTF8 encoding and the other Latin1.
So this Language field is not very useful. I am not sure if this field is useful for any other purpose than information
Trying to detect languages of Packages
How can we have some insight on CRAN package language ?
We can analyse the cran database using the Googleâs Compact Language Detector. ROpensci has made 2 wrappers for using it in version 2 (cld2
) and version 3 (cld3
).
We will be using the version 2, as cld3
is still experimental.
cran_db_with_lang <- cran_db %>%
mutate(detected = cld2::detect_language(Description, lang_code = FALSE))
nb_of_lg_en <- cran_db_with_lang %>% filter(detected == "ENGLISH") %>% count(detected) %>% pull(n)
ggplot(cran_db_with_lang %>% filter(detected != "ENGLISH")) +
geom_bar(aes(x = detected), fill = "skyblue") +
labs(title = "Language of packages", subtitle = glue::glue("{nb_of_lg_en} packages detected as English")) +
coord_flip() +
theme_light()
There are very few packages that seems to have non english description field.
Package | Title | Description | Encoding | Language | detected |
---|---|---|---|---|---|
casen | Metodos De Estimacion Con Disenio Probabilistico y Estratificado en Encuesta CASEN (Estimation Methods with Probabilistic Stratified Sampling in CASEN Survey) | Funciones para realizar estadistica descriptiva e inferencia con el disenio complejo de la Encuesta CASEN (Encuesta de Caracterizacion Socio-Economica). Incluye datasets que permiten armonizar los codigos de comunas que cambian entre anios y permite convertir a los codigos oficiales de SUBDERE. (Functions to compute descriptive and inferential statistics with CASEN Survey [Socio-Economic Characterization Survey] complex design. Includes datasets to harmonize commune codes that change across years and allows to convert to official SUBDERE codes.) | UTF-8 | NA | SPANISH |
censo2017 | Base de Datos de Facil Acceso del Censo 2017 de Chile (2017 Chilean Census Easy Access Database) | Provee un acceso conveniente a mas de 17 millones de registros de la base de datos del Censo 2017. Los datos fueron importados desde el DVD oficial del INE usando el Convertidor REDATAM creado por Pablo De Grande. Esta paquete esta documentado intencionalmente en castellano asciificado para que funcione sin problema en diferentes plataformas. (Provides convenient access to more than 17 million records from the Chilean Census 2017 database. The datasets were imported from the official DVD provided by the Chilean National Bureau of Statistics by using the REDATAM converter created by Pablo De Grande and in addition it includes the maps accompanying these datasets.) | UTF-8 | NA | SPANISH |
chilemapas | Mapas de las Divisiones Politicas y Administrativas de Chile (Maps of the Political and Administrative Divisions of Chile) | Mapas terrestres con topologias simplificadas. Estos mapas no tienen precision geodesica, por lo que aplica el DFL-83 de 1979 de la Republica de Chile y se consideran referenciales sin validez legal. No se incluyen los territorios antarticos y bajo ningun evento estos mapas significan que exista una cesion u ocupacion de territorios soberanos en contra del Derecho Internacional por parte de Chile. Esta paquete esta documentado intencionalmente en castellano asciificado para que funcione sin problema en diferentes plataformas. (Terrestrial maps with simplified toplogies. These maps lack geodesic precision, therefore DFL-83 1979 of the Republic of Chile applies and are considered to have no legal validity. Antartic territories are excluded and under no event these maps mean there is a cession or occupation of sovereign territories against International Laws from Chile. This package was intentionally documented in asciified spanish to make it work without problem on different platforms.) | UTF-8 | NA | SPANISH |
confreq | Configural Frequencies Analysis Using Log-Linear Modeling | Offers several functions for Configural Frequencies Analysis (CFA), which is a useful statistical tool for the analysis of multiway contingency tables. CFA was introduced by G. A. Lienert as âKonfigurations Frequenz Analyse - KFAâ. Lienert, G. A. (1971). Die Konfigurationsfrequenzanalyse: I. Ein neuer Weg zu Typen und Syndromen. Zeitschrift fĂŒr Klinische Psychologie und Psychotherapie, 19(2), 99â115. | UTF-8 | NA | GERMAN |
covid19italy | The 2019 Novel Coronavirus COVID-19 (2019-nCoV) Italy Dataset | Provides a daily summary of the Coronavirus (COVID-19) cases in Italy by country, region and province level. Data source: Presidenza del Consiglio dei Ministri - Dipartimento della Protezione Civile http://www.protezionecivile.it/. | UTF-8 | NA | ITALIAN |
datos | Traduce al Español Varios Conjuntos de Datos de PrĂĄctica | Provee una versiĂłn traducida de los siguientes conjuntos de datos: âairlinesâ, âairportsâ, âAwardsManagersâ, âbabynamesâ, âBattingâ, âdiamondsâ, âfaithfulâ, âfueleconomyâ, âFieldingâ, âflightsâ, âgapminderâ, âgss_catâ, âirisâ, âManagersâ, âmpgâ, âmtcarsâ, âatmosâ, âPeople, âPitchingâ, âplanesâ, âpresidentialâ, âtable1â, âtable2â, âtable3â, âtable4aâ, âtable4bâ, âtable5â, âvehiclesâ, âweatherâ, âwhoâ. English: It provides a Spanish translated version of the datasets listed above. | UTF-8 | es | SPANISH |
ExpDes.pt | Pacote Experimental Designs (Portugues) | Pacote para anĂĄlise de delineamentos experimentais (DIC, DBC e DQL), experimentos em esquema fatorial duplo (em DIC e DBC), experimentos em parcelas subdivididas (em DIC e DBC), experimentos em esquema fatorial duplo com um tratamento adicional (em DIC e DBC), experimentos em fatorial triplo (em DIC e DBC) e experimentos em esquema fatorial triplo com um tratamento adicional (em DIC e DBC), fazendo analise de variancia e comparacao de multiplas medias (para tratamentos qualitativos), ou ajustando modelos de regressao ate a terceira potencia (para tratamentos quantitativos); analise de residuos (Ferreira, Cavalcanti and Nogueira, 2014) doi:10.4236/am.2014.519280. | UTF-8 | NA | PORTUGUESE |
geouy | Geographic Information of Uruguay | The toolbox have functions to load and process geographic information for Uruguay. And extra-function to get address coordinates and orthophotos through the uruguayan âIDEâ API https://www.gub.uy/infraestructura-datos-espaciales/tramites-y-servicios/servicios/servicio-direcciones-geograficas. | UTF-8 | en, es | SPANISH |
guaguas | Nombres Inscritos en Chile (1920 - 2019) | Datos de nombres inscritos en Chile entre 1920 y 2019, de acuerdo al Servicio de Registro Civil. Este paquete incluye todos los nombres con al menos 15 ocurrencias anuales. English: Chilean baby names registered in the Civil Registry Service. This package contains all names used at least 15 times per year, from 1920 to 2019. | UTF-8 | NA | SPANISH |
HBV.IANIGLA | Modular Hydrological Model | The HBV hydrological model (Bergström, S. and Lindström, G., (2015) doi:10.1002/hyp.10510) has been split in modules to allow the user to build his/her own model. This version was developed by the author in IANIGLA-CONICET (Instituto Argentino de Nivologia, Glaciologia y Ciencias Ambientales - Consejo Nacional de Investigaciones Cientificas y Tecnicas) for hydroclimatic studies in the Andes. HBV.IANIGLA incorporates routines for clean and debris covered glacier melt simulations. | UTF-8 | NA | SPANISH |
ibb | R Wrapper for Istanbul Municipality Open Data Portal | Call wrappers for Istanbul Metropolitan Municipalityâs Open Data Portal (Turkish: Istanbul BĂŒyĂŒksehir Belediyesi Açik Veri Portali) at https://data.ibb.gov.tr/en/. | UTF-8 | NA | TURKISH |
KenSyn | Knowledge Synthesis in Agriculture - From Experimental Network to Meta-Analysis | Demo and dataset accompaying the books : De lâanalyse des rĂ©seaux expĂ©rimentaux Ă la mĂ©ta-analyse: MĂ©thodes et applications avec le logiciel R pour les sciences agronomiques et environnementales (Published 2018-06-28, Quae, for french version) by David Makowski, Francois Piraux and Francois Brun - https://www.quae.com/produit/1514/9782759228164/de-l-analyse-des-reseaux-experimentaux-a-la-meta-analyse Knowledge Synthesis in Agriculture : from Experimental Network to Meta-Analysis (in preparation for 2018-06, Springer , for English version) by David Makowski, Francois Piraux and Francois Brun A full description of all the material is in both books. ACKNOWLEDGMENTS : The French network âRMT modeling and data analysis for agricultureâ (http://www.modelia.org) have contributed to the development of this R package. This project and network are lead by ACTA (French Technical Institute for Agriculture) and was funded by a grant from the Ministry of Agriculture and Fishing of France. | UTF-8 | NA | FRENCH |
LabRS | Laboratorio di âRicerca Sociale con Râ | Dati, scripts e funzioni per il libro âRicerca sociale con R. Concetti e funzioni base per la ricerca socialeâ (Datasets, scripts and functions to support the book âRicerca sociale con R. Concetti e funzioni base per la ricerca socialeâ). | UTF-8 | NA | ITALIAN |
labstatR | Libreria Del Laboratorio Di Statistica Con R | Insieme di funzioni di supporto al volume âLaboratorio di Statistica con Râ, Iacus-Masarotto, MacGraw-Hill Italia, 2006. This package contains sets of functions defined in âLaboratorio di Statistica con Râ, Iacus-Masarotto, MacGraw-Hill Italia, 2006. Function names and docs are in italian as well. | NA | NA | ITALIAN |
LSAmitR | Daten, Beispiele und Funktionen zu âLarge-Scale Assessment mit Râ | Dieses R-Paket stellt Zusatzmaterial in Form von Daten, Funktionen und R-Hilfe-Seiten fĂŒr den Herausgeberband Breit, S. und Schreiner, C. (Hrsg.). (2016). âLarge-Scale Assessment mit R: Methodische Grundlagen der österreichischen BildungsstandardĂŒberprĂŒfung.â Wien: facultas. (ISBN: 978-3-7089-1343-8, https://www.bifie.at/node/3770) zur VerfĂŒgung. | UTF-8 | de | GERMAN |
MSMwRA | Multivariate Statistical Methods with R Applications | Data sets in the book entitled âMultivariate Statistical Methods with R Applicationsâ, H.Bulut (2018). The book will be published in Turkish and the original name of this book will be âR Uygulamalari ile Cok Degiskenli Istatistiksel Yontemlerâ. | NA | NA | TURKISH |
MultivariateAnalysis | Pacote Para Analise Multivariada | Package with multivariate analysis methodologies for experiment evaluation. The package estimates dissimilarity measures, builds dendrograms, obtains MANOVA, principal components, canonical variables, etc. (Pacote com metodologias de analise multivariada para avaliação de experimentos. O pacote estima medidas de dissimilaridade, construi de dendogramas, obtem a MANOVA, componentes principais, variåveis canÎnicas, etc.) | UTF-8 | pt-BR | PORTUGUESE |
MVar.pt | Analise multivariada (brazilian portuguese) | Pacote para analise multivariada, tendo funcoes que executam analise de correspondencia simples (CA) e multipla (MCA), analise de componentes principais (PCA), analise de correlacao canonica (CCA), analise fatorial (FA), escalonamento multidimensional (MDS), analise discriminante linear (LDA) e quadratica (QDA), analise de cluster hierarquico e nao hierarquico, regressao linear simples e multipla, analise de multiplos fatores (MFA) para dados quantitativos, qualitativos, de frequencia (MFACT) e dados mistos, biplot, scatter plot, projection pursuit (PP), grant tour e outras funcoes uteis para a analise multivariada. | NA | pt_BR | PORTUGUESE |
orloca.es | Spanish version of orloca package. Modelos de localizacion en investigacion operativa | Help and demo in Spanish of the orloca package. (Ayuda y demo en espanol del paquete orloca.) Objetos y metodos para manejar y resolver el problema de localizacion de suma minima, tambien conocido como problema de Fermat-Weber. El problema de localizacion de suma minima busca un punto tal que la suma ponderada de las distancias a los puntos de demanda se minimice. Vease âThe Fermat-Weber location problem revisitedâ por Brimberg, Mathematical Programming, 1, pag. 71-76, 1995. <DOI: 10.1007/BF01592245>. Se usan algoritmos generales de optimizacion global para resolver el problema, junto con el metodo adhoc Weiszfeld, vease âSur le point pour lequel la Somme des distance de n points donnes est minimumâ, por Weiszfeld, Tohoku Mathematical Journal, First Series, 43, pag. 355-386, 1937 o âOn the point for which the sum of the distances to n given points is minimumâ, por E. Weiszfeld y F. Plastria, Annals of Operations Research, 167, pg. 7-41, 2009. DOI:10.1007/s10479-008-0352-z. | NA | es | SPANISH |
PortalHacienda | Acceder Con R a Los Datos Del Portal De Hacienda | Obtener listado de datos, acceder y extender series del Portal de Datos de Hacienda.Las proyecciones se realizan con âforecastâ, Hyndman RJ, Khandakar Y (2008) doi:10.18637/jss.v027.i03. Search, download and forecast time-series from the Ministry of Economy of Argentina. Forecasts are built with the âforecastâ package, Hyndman RJ, Khandakar Y (2008) doi:10.18637/jss.v027.i03. | UTF-8 | NA | SPANISH |
praktikum | Kvantitatiivsete meetodite praktikumi asjad / Functions used in the course âQuantitative methods in behavioural sciencesâ (SHPH.00.004), University of Tartu | Kasulikud funktsioonid kvantitatiivsete mudelite kursuse (SHPH.00.004) jaoks | NA | et | ESTONIAN |
proustr | Tools for Natural Language Processing in French | Tools for Natural Language Processing in French and texts from Marcel Proustâs collection âA La Recherche Du Temps Perduâ. The novels contained in this collection are âDu cote de chez Swannâ, âA lâombre des jeunes filles en fleursâ,âLe Cote de Guermantesâ, âSodome et Gomorrhe I et IIâ, âLa Prisonniereâ, âAlbertine disparueâ, and âLe Temps retrouveâ. | UTF-8 | NA | FRENCH |
qha | Qualitative Harmonic Analysis | Multivariate description of the state changes of a qualitative variable by Correspondence Analysis and Clustering. See: Deville, J.C., & Saporta, G. (1983). Correspondence analysis, with an extension towards nominal time series. Journal of econometrics, 22(1-2), 169-189. Corrales, M.L., & Pardo, C.E. (2015) doi:10.15332/s2027-3355.2015.0001.01. Analisis de datos longitudinales cualitativos con analisis de correspondencias y clasificacion. Comunicaciones en Estadistica, 8(1), 11-32. | latin1 | NA | SPANISH |
qqr | Data from Brazilian Soccer Championship | Get data about the Brazilian soccer championship since 2014. Official data can be found at https://www.cbf.com.br/futebol-brasileiro/competicoes/campeonato-brasileiro-serie-a/. | UTF-8 | NA | PORTUGUESE |
Rarefy | Rarefaction Methods | Includes functions for the calculation of spatially and non-spatially explicit rarefaction curves using different indices of taxonomic, functional and phylogenetic diversity. The user can also rarefy any biodiversity metric as provided by a self-written function (or an already existent one) that gives as output a vector with the values of a certain index of biodiversity calculated per plot (Ricotta, C., Acosta, A., Bacaro, G., Carboni, M., Chiarucci, A., Rocchini, D., Pavoine, S. (2019) doi:10.1016/j.ecolind.2019.105606; Bacaro, G., Altobelli, A., Cameletti, M., Ciccarelli, D., Martellos, S., Palmer, M. W., ⊠Chiarucci, A. (2016) doi:10.1016/j.ecolind.2016.04.026; Bacaro, G., Rocchini, D., Ghisla, A., Marcantonio, M., Neteler, M., & Chiarucci, A. (2012) doi:10.1016/j.ecocom.2012.05.007). | UTF-8 | NA | ITALIAN |
rasterdiv | Diversity Indices for Numerical Matrices | Providing functions to calculate indices of diversity on numerical matrices based on information theory. The rationale behind the package is described in Rocchini, Marcantonio and Ricotta (2017) doi:10.1016/j.ecolind.2016.07.039 and Rocchini, Marcantonio,âŠ, Ricotta (2021) doi:10.1101/2021.01.23.427872. | UTF-8 | en-GB | ITALIAN |
RcmdrPlugin.EACSPIR | Plugin de R-Commander para el Manual âEACSPIRâ | Este paquete proporciona una interfaz grafica de usuario (GUI) para algunos de los procedimientos estadisticos detallados en un curso de âEstadistica aplicada a las Ciencias Sociales mediante el programa informatico Râ (EACSPIR). LA GUI se ha desarrollado como un Plugin del programa R-Commander. | NA | es | SPANISH |
Sofi | Interfaz interactiva con fines didacticos | Este paquete tiene la finalidad de ayudar a aprender de una forma interactiva, teniendo ejemplos y la posibilidad de resolver nuevos al mismo tiempo. Apuntes de clase interactivos. | UTF-8 | es | SPANISH |
SpatialRegimes | Spatial Constrained Clusterwise Regression | A collection of functions for estimating spatial regimes, aggregations of neighboring spatial units that are homogeneous in functional terms. The term spatial regime, therefore, should not be understood as a synonym for cluster. More precisely, the term cluster does not presuppose any functional relationship between the variables considered, while the term regime is linked to a regressive relationship underlying the spatial process. For more information, please see Postiglione, P., Andreano, M.S., Benedetti R. (2013) doi:10.1007/s10614-012-9325-z , Andreano, M.S., Benedetti, R., and Postiglione, P. (2017) doi:10.1007/s11135-016-0415-1 , Billeâ, A.G., Benedetti, R., and Postiglione, P. (2017) doi:10.1080/17421772.2017.1286373. | NA | NA | ITALIAN |
swissdd | Get Swiss Federal and Cantonal Vote Results from Opendata.swiss | Builds upon the real time data service as well as the archive for national votes https://opendata.swiss/api/3/action/package_show?id=echtzeitdaten-am-abstimmungstag-zu-eidgenoessischen-abstimmungsvorlagen and cantonal votes https://opendata.swiss/api/3/action/package_show?id=echtzeitdaten-am-abstimmungstag-zu-kantonalen-abstimmungsvorlagen. It brings the results of Swiss popular votes, aggregated at the geographical level of choice, into R. Additionally, it allows to retrieve data from the Swissvotes-Database, one of the most comprehensive data platforms on Swiss referendums and initiatives https://swissvotes.ch/page/dataset/swissvotes_dataset.csv. | UTF-8 | NA | GERMAN |
Tratamentos.ad | âPacote Para Analise De Experimentos Com Testemunhas Adicionaisââ | Pacote para a analise de experimentos com um ou dois fatores com testemunhas adicionais conduzidos no delineamento inteiramente casualizado ou em blocos casualizados. âPackage for the analysis of one or two-factor experiments with additional controls conducted in a completely randomized design or in a randomized block designâ. | UTF-8 | pt-BR | PORTUGUESE |
The detection by cld2
does not seem to be 100% correct.
However, it helps us isolate some non english packages and it highlights that English is the language to go to avoid encoding issues and specific configurations in package. I will now continue to look into those packages I just found to see if they have the same encoding issue as French package⊠and it seems so !