An important note: In this section we read in the
connectivity data, process it, and save it to a directory. To do so, the
code must read the data in model_depth_and_distance.nc
and
save the data we download in predictable places. To do this, it is
important that you run the code in a predictable way
from a known location. In Rstudio, the easiest way to do this is to go
to the “Session” menu, “Set Working Directory” item, and choose “To
Source File Location.” Your data files will then be saved into the
directory this file is in, and the code will find the
model_depth_and_distance.nc
in the
connectivityData
directory within it. You can also
explicitly call the setwd(‘/directory/with/data’) function to specify
where the data will be kept.
The functions to retrieve the connectivity data is contained in the
file connectivityUtilities.R
, so we must source that file
to load the functions into R. If you look at this file, you will see
where the variable rootDataURL
is defined.
rootDataURL
gives the location where the connectivity data
is stored on a server. It is here that you can change from downloading
the 2007-2022 global data or the 2007-2020 North and South America data.
The later data was the initial test data, and was the first to be run
for all depths and vertical behaviors.
source('connectivityUtilities.R')
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Linking to GEOS 3.11.0, GDAL 3.5.3, PROJ 9.1.0; sf_use_s2() is TRUE
##
##
## Attaching package: 'foreach'
##
##
## The following objects are masked from 'package:purrr':
##
## accumulate, when
##
##
## Loading required package: parallel
## [1] "file connectivityData/EZfateFiles/model_depth_and_distance.nc exists, no download"
Once this is done, you will see the function
getConnectivityData(regionName,depth,year,verticalBehavior,month,minPLD,maxPLD,dataDir='connecivityData')
defined. This function checks to see if the requested data has been
downloaded and placed in dataDir
. If it has, it reads it
from dataDir
and returns it. If it has not, it downloads it
from the servers at the University of New Hampshire into
dataDir
and then loads it. If the data is ever updated
on the server you must delete it from dataDir
to get
the new data.
The function getConnectivityData
takes the following
arguments:
Plot_OceanRegions.RMD
depth
: An integer. The depth the particles start at.
You can only choose ones that exist on the server. These depths are
given on the EZfate github page, and are currently 1, 10, 20 and 40
meters.year
: This is either an integer for the year (For
example, 2007), or the string “climatology” if you prefer the particle
tracks for all years that have been run. Again, see the github page for
the details.verticalBehavior
: A string that can be “fixed” for
particles that stay at their starting depth, or “starts” for those that
can be advected vertically. Other vertical behaviors may be added in the
future. See the github page for what has been done.month
: The month the particles have been released
over.minPLD
: The minimum time in days the particles drift
with the currents. Again, since the data is pre-computed, you can only
use certain PLD’s. Currently, you can choose 2,4,6,…60 days.maxPLD
: The maximum time the particles drift with the
currents. For now, should be the same as minPLD
.dataDir
: A string that defaults to ‘connectivityData`.
The directory to save the data in. If you do not give it an absolute
path (e.g.’/home/name/..’ or ‘C:’), it will be saved in the same
directory as this file. If you are ok with the default, you may choose
to not provide this argument.So lets get the climatological data for particles released around North and South America in May at 1m, and forced to stay at that depth for the 18 day duration of their drifting.
You must create the directory “connectivityData” in the same directory as this file, and use the Rstudio option in the “Session” menu to “Set Working Directory…” to “To Source File Location”.
The first time you run the code block below, it will download the data. After that it will just read it from the local copy.
regionName<-'theAmericas'
depth<-1
year<-'climatology'
verticalBehavior<-'starts'
month<-5
minPLD<-18; maxPLD<-minPLD
E1<-getConnectivityData(regionName,depth,year,verticalBehavior,month,minPLD,maxPLD)
## [1] "file connectivityData/theAmericas/1m/starts/climatology_month05_minPLD18_maxPLD18.RDS exists, no download"
You will notice the data.frame E1
in the workspace, with
a large number of observations and six variables. The structure of this
data will be discussed in the next section, but for now it is worth
noting that it is big, but (if you look at it), it does
not have any latitude of longitude data. It is only for a single month,
May (month 5). Lets pretend that the critter we are studying releases
larvae in May and June. So in the rest of this section we will:
This is the same as getting the May data, but we change the
month
variable to 6. The rest need not be defined since
they were defined above. The June data is stored in the data.frame
E2
.
month<-6
E2<-getConnectivityData(regionName,depth,year,verticalBehavior,month,minPLD,maxPLD)
## [1] "file connectivityData/theAmericas/1m/starts/climatology_month06_minPLD18_maxPLD18.RDS exists, no download"
To save disk space, and to reduce the amount of data to be transfered
across the network, the latitude and longitude data is not included in
the connectivity data – instead, the data is all stored as indices to
the circulation model grid. However, for manipulating the connectivity
and plotting it, it is necessary to have the latitude and longitude
data. So we must add it, using the addLatLon()
function in
the connectivityUtilities.R
file.
E1<-addLatLon(E1)
E2<-addLatLon(E2)
We will discuss how the connectivity data in E1 and E2 is organized
and how to plot it in the next section
(04_dataStructure_andBasicPlotting.RMD
), but lets make a
quick plot of the spatial extent of E1 and E2 here so you can see what
you have:
library("rnaturalearth")
library("rnaturalearthdata")
##
## Attaching package: 'rnaturalearthdata'
## The following object is masked from 'package:rnaturalearth':
##
## countries110
world <- ne_countries(scale = "medium", returnclass = "sf") #get coastline data
class(world)
## [1] "sf" "data.frame"
#plot(limitPoly)
p<-ggplot(data = world) + geom_sf() +
coord_sf(xlim= c(-170, -22), ylim = c(-70, 80), expand = FALSE)+
geom_point(data = E1, aes(x = lonFrom, y = latFrom), size = 1, shape = 23, fill = "green",color='green')
print(p) #this makes the figure appear
The full american dataset has 468,276 starting points, and
manipulating all of them can be slow and awkward. So the
connectivityUtilities.R file has routines to take subsets of the
connectivity data in many ways. Here the data is trimmed to only include
points within a polygon that contains the East Coast of North America
from Miami to the Gulf of Maine. We start by defining two vectors,
lonPoly
and latPoly
, that contain the points
that define the polygon – these were obtained from Google maps. These
are then mapped to a polygon with the WGS84 datum using the st_
functions. Note that the first and last points of the polygon
must be the same, to close the polygon.
latPoly<-c(25.0,38.2,46.47,45.21,44.21,27.95,25.12,25.0)
lonPoly<-c(-80.5,-85.5,-65.42,-62.57,-65.47,-73.86,-79.53,-80.5)
limitPoly=st_polygon(list(cbind(lonPoly,latPoly)))
limitPoly<-st_sfc(limitPoly,crs=4326) #make into a surface, 4326 is the code for the WGS84 datum in proj
Now the subsetConnectivity_byPolygon()
function can be
used to remove all points that do not lie within the polygon. It can be
set so that we do not include points that leave the polygon
(trimTo=TRUE
) or to include them
(trimTo=False
). The implications of this will be discussed
below, but for now, lets include connectivity that starts within the
polygon but then leaves it.
E1<-subsetConnectivity_byPolygon(E1,limitPoly,trimTo=FALSE)
E2<-subsetConnectivity_byPolygon(E2,limitPoly,trimTo=FALSE)
And now lets plot the starting points that remain, along with the polygon in red, to see what we have left
p<-ggplot(data = world) + geom_sf() + geom_sf(data=limitPoly,fill=NA,color='red') + #add limit polygon
coord_sf(xlim= c(-85, -60), ylim = c(23, 48), expand = FALSE)+
geom_point(data = E1, aes(x = lonFrom, y = latFrom), size = 1, shape = 23, fill = "green",color='black')
print(p) #this makes the figure appear
Now we have two months of connectivity in two data.frames,
E1
and E2
. We would like to combine them into
a single data.frame E
using the
combineConnectivity()
function in
connectivityUtilities.R
. This can take a while, especially
for big connectivity data.frames. This is why we subset the data before
combining it.
E<-combineConnectivityData(E1,E2)
## Time to combine connectivity data:: 31.63 sec elapsed
Since loading, trimming and combining the data takes a significant
amount of time, it is useful to save the connectivity data in
E
to a .RDS file, so we can just reload it in the later
work. This file is called, entirely arbitrarily,
mayJuneEastCoast_connectivity.RDS
.
saveRDS(E,file='mayJuneEastCoast_connectivity.RDS')
In the next section,
04_dataStructure_and_basicPlotting.RMD
, the contents of the
connectivity data.frame E
will be described, and it will be
shown how to plot basic dispersal plots.