install.packages("reticulate")
::install_miniconda() reticulate
Reticulate
Use R and Python together! Note: this lesson is for reticulate specifically. It’s not a Python lesson. But if you don’t know Python you should still be able to follow along with a friend.
Setup
Create an RStudio project, then create a Quarto document.
Install Python
So many ways. Too many ways! Here’s one. Do this from the R console.
Create a virtual environment
You likely keep all your installed R packages in one library. The standard practice in Python is to create separate environments for projects instead. Conda helps us do that. Still in the console!
::conda_create(
reticulate"intro-reticulate",
packages = c("jupyter", "numpy", "pandas", "scikit-learn")
)::use_condaenv("intro-reticulate")
reticulate::py_config() # Are you in the right place? reticulate
We just created a Conda environment and installed some useful Python packages. jupyter is the Python equivalent to knitr, numpy lets you work with numbers, pandas lets you work with data frames, and scikit-learn is for machine learning.
Tidy Tuesday
Learn reticulate with a Tidy Tuesday exercise. Specifically Numbats.
Load the data
We can do that with R. In your Quarto document, create an R code chunk and download the Tidy Tuesday data.
# install.packages("tidytuesdayR")
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
✔ ggplot2 3.4.0 ✔ purrr 1.0.1
✔ tibble 3.1.7 ✔ dplyr 1.1.0
✔ tidyr 1.3.0 ✔ stringr 1.5.0
✔ readr 2.1.2 ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
<- tidytuesdayR::tt_load(2023, week = 10) numbats_tidytuesday
--- Compiling #TidyTuesday Information for 2023-03-07 ----
--- There is 1 file available ---
--- Starting Download ---
Downloading file 1 of 1: `numbats.csv`
--- Download complete ---
<- numbats_tidytuesday$numbats
numbats numbats
# A tibble: 805 × 16
decimalLatitude decimalLongitude eventDate scientificName
<dbl> <dbl> <dttm> <chr>
1 -37.6 146. NA Myrmecobius fasciatus
2 -35.1 150. 2014-06-05 02:00:00 Myrmecobius fasciatus
3 -35 118. NA Myrmecobius fasciatus
4 -34.7 118. NA Myrmecobius fasciatus
5 -34.6 117. NA Myrmecobius fasciatus
6 -34.6 117. NA Myrmecobius fasciatus
7 -34.6 118. NA Myrmecobius fasciatus
8 -34.6 117. NA Myrmecobius fasciatus
9 -34.6 117. NA Myrmecobius fasciatus
10 -34.6 117. NA Myrmecobius fasciatus
# … with 795 more rows, and 12 more variables: taxonConceptID <chr>,
# recordID <chr>, dataResourceName <chr>, year <dbl>, month <chr>,
# wday <chr>, hour <dbl>, day <date>, dryandra <lgl>, prcp <dbl>, tmax <dbl>,
# tmin <dbl>
Numbat circadian rhythms
Were numbats sighted during the day or at night? Create another R chunk and categorize sighting time of day into day and night. The numbers look pretty even! But could environmental covariates influence that?
<- numbats %>%
numbats mutate(is_day = hour >= 6 & hour <= 18) %>%
drop_na(is_day, prcp, tmax)
count(numbats, is_day)
# A tibble: 2 × 2
is_day n
<lgl> <int>
1 FALSE 31
2 TRUE 26
R to Python
Let’s use Python to figure out if precipitation and temperature affects the likelihood of seeing numbats at night. Create a Python code chunk.
# This is like "library()" in R
from sklearn.linear_model import LogisticRegression
# Use `r.___` to access R objects
= r.numbats[["prcp", "tmax"]]
enviro = r.numbats["is_day"]
is_day # Fit a classifier (clf)
= LogisticRegression(random_state=0).fit(enviro, is_day) clf
Well done! You just fit a classifier in Python to data you loaded and cleaned with R. That’s pretty cool! Let’s make some predictions on a reference grid.
import numpy as np
import pandas as pd
from sklearn.utils.extmath import cartesian
# The precipitation and temperature values we want to make predictions for
= np.arange(r.numbats["prcp"].min(), r.numbats["prcp"].max(), 0.1)
prcp = np.arange(r.numbats["tmax"].min(), r.numbats["tmax"].max(), 0.2)
tmax # All combinations of prcp and tmax
= pd.DataFrame(cartesian((prcp, tmax)), columns=["prcp", "tmax"])
ref_grid # Predict the probability of "is_day". predict_proba() returns two columns,
# p(!is_day) and p(is_day). Remember `[:, 1]` gets the *second* column
# because Python uses 0-indexing
"is_day"] = clf.predict_proba(ref_grid)[:, 1]
ref_grid[ ref_grid
prcp tmax is_day
0 0.0 13.7 0.837053
1 0.0 13.9 0.832279
2 0.0 14.1 0.827395
3 0.0 14.3 0.822400
4 0.0 14.5 0.817291
... ... ... ...
14701 11.3 38.5 0.000025
14702 11.3 38.7 0.000024
14703 11.3 38.9 0.000024
14704 11.3 39.1 0.000023
14705 11.3 39.3 0.000022
[14706 rows x 3 columns]
Python to R
Now we’ll use R to visualize the model predictions generated in Python. Make another R code chunk.
library(reticulate)
# Use py$____ to get python objects
<- py$ref_grid %>%
predict_prcp # For each precipitation value, get the prediction at the median tmax
group_by(prcp) %>%
summarize(is_day = is_day[tmax = median(numbats$tmax)])
%>%
numbats mutate(is_day = as.numeric(is_day)) %>%
ggplot(aes(prcp, is_day)) +
geom_point() +
geom_line(data = predict_prcp) +
theme_classic()
Is this a bad model? You know it! But that’s not the point.
Wrap up
The reticulate package connects R and Python
Python installations are a lot more variable (and tricky!) than R
You can use both R and Python code, sharing data, in Quarto documents
R -> Python with
r.___
Python -> R with
py$___
(mustlibrary(reticulate)
first!)