Introduction to Data Science: A Comp-Math-Stat Approach

1MS041, 2021

©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)

06. Data and Statistics: New Zealand Earthquakes, 2018 Swedish National Election and Pubs in Open Street Maps of DL & SE

Earthquakes

Here is an extract from https://en.wikipedia.org/wiki/Earthquake:

An earthquake (also known as a quake, tremor or temblor) is the shaking of the surface of the Earth resulting from a sudden release of energy in the Earth's lithosphere that creates seismic waves. Earthquakes can range in size from those that are so weak that they cannot be felt to those violent enough to propel objects and people into the air, and wreak destruction across entire cities. The seismicity, or seismic activity, of an area is the frequency, type, and size of earthquakes experienced over a period of time. The word tremor is also used for non-earthquake seismic rumbling.

At the Earth's surface, earthquakes manifest themselves by shaking and displacing or disrupting the ground. When the epicenter of a large earthquake is located offshore, the seabed may be displaced sufficiently to cause a tsunami. Earthquakes can also trigger landslides and, occasionally, volcanic activity.

In its most general sense, the word earthquake is used to describe any seismic event—whether natural or caused by humans—that generates seismic waves. Earthquakes are caused mostly by rupture of geological faults but also by other events such as volcanic activity, landslides, mine blasts, and nuclear tests. An earthquake's point of initial rupture is called its hypocenter or focus. The epicenter is the point at ground level directly above the hypocenter.

Having gotten some basic background on earthquakes (Seismologists are the domain experts of seismology or the study of earthquakes) as we will be analysing data fetched from New Zealand which is in the Ring of Fire along with Japan, Western American continental coastlines all the way down to Chile as rendered below.

Here are typical epicentres all across the globe.

To make the ideas sink in the following blogpost on a review of probability theory by one of the best mathematicians alive, namely Terry Tao, is highly recommended, especially for students of Mathematics.

Live Data-fetching Exercise Now

Go to https://quakesearch.geonet.org.nz/ and download data on NZ earthquakes.

In my attempt above to zoom out to include both islands of New Zealand (NZ) and get one year of data using the Last Year button choice from this site:

https://quakesearch.geonet.org.nz/csv?bbox=163.52051,-49.23912,182.19727,-32.36140&startdate=2017-06-01&enddate=2018-05-17T14:00:00 https://quakesearch.geonet.org.nz/csv?bbox=163.52051,-49.23912,182.19727,-32.36140&startdate=2017-5-17T13:00:00&enddate=2017-06-01

What should you do now?

Try to DOWNLOAD your own CSV data and store it in a file named my_earthquakes.csv (NOTE: rename the file when you download so you don't replace the file earthquakes.csv!) inside the folder named data that is inside the same directory that this notebook is in.

Browser settings: Make sure you have changed your browser settings so you get asked where a downloaded file should be saved each time you download.

Let's analyse the measured earth quakes in data/earthquakes.csv

This will ensure we are all looking at the same file!

But feel free to play with your own data/my_earthquakes.csv on the side.

Exercise:

Grab lat, lon, magnitude

More on Data and Statistics

Recall that a statistic is any measurable function of the data: $T(x): \mathbb{X} \rightarrow \mathbb{T}$.

Thus, a statistic $T$ is also an RV that takes values in the measurable space $\mathbb{T}$.

When $x \in \mathbb{X}$ is the observed data, $T(x)=t$ is the observed statistic of the observed data $x$.

Can you imagine what actually $\Omega$ is for experiments or statistical models of earthquakes? In a nutshell one can say that $\Omega$ here is everything that has been going on with planet earth.... However, by limiting us to the data space $\mathbb{X}$ of the random variable $X$ we limit ourselves to the probability triple:

$$ (\Omega, \mathcal{F}_{\Omega},\mathbb{P}) \enspace, $$

up to what is measurable by the data $X$

$$ (\mathbb{X}, \mathcal{F}_{\mathbb{X}},\mathbb{P}) \enspace, $$

as induced by the triple $(\Omega, \mathcal{F}_{\Omega},\mathbb{P})$ through $X(\omega)$ and probability of any event $A$ in the sigma-algebra $\mathcal{F}_\mathbb{X}$, i.e.,

$$ \mathbb{P}(A) := \mathbb{P} \{\omega \in \Omega: X^{[-1]}(A) \in \mathcal{F}_{\Omega} \}, \quad \forall A \in \mathcal{F}_{\mathbb{X}} \enspace. $$

Thus, we can leave $\Omega$ and $\mathcal{F}_{\Omega}$ to be "abstract" and focus on what we can actually measure, i.e., the data $X$ with data space $\mathbb{X}$ and its sigma algebra $\mathcal{F}_\mathbb{X}$. This approach allows $(\mathbb{X}, \mathcal{F}_{\mathbb{X}},\mathbb{P})$ to even evolve over time as we make progress in instrumentation (to observe more) and refine our statistical models to evolve $\mathbb{P}$ appropriately for the possibly refined sigma algebra.

A sequence of such refinements where the data space and the sigma algebras of what is measurable improves we think of the following sequence:

$$ (\mathbb{X}_1, \mathcal{F}_{\mathbb{X}_1},\mathbb{P}_1), (\mathbb{X}_2, \mathcal{F}_{\mathbb{X}_2},\mathbb{P}_2), \ldots $$

This is a simple way to understand how two or more experimenters (scientists, engineers, decision-makers in an organisation) may have different statistical models as they have different data spaces. Natural implications on the assumptions the inverse maps $X_1^{[-1]}, X_1^{[-1]}, \ldots$ into $(\Omega, \mathcal{F}_{\Omega},\mathbb{P}_{[\cdot]})$ take, simply follow from the axiomatic logical language of probability theory.

Clearly, all of this matters as we are interested in making decisions (classification, estimation, hypothesis testing, etc.) that aid us in taking actions through appropriate statistics $T$ of the data $X$ which is merely another measurable mapping from

$$ (\mathbb{X}, \mathcal{F}_{\mathbb{X}},\mathbb{P}) \quad \text{to} \quad (\mathbb{T}, \mathcal{F}_{\mathbb{T}},\mathbb{P}) \enspace. $$

We generally assume that $\mathcal{F}_{\mathbb{X}}$ and $\mathcal{F}_{\mathbb{T}$ are Borel sigma algebras generated by Borel sets in topological spaces $\mathbb{X}$ and $\mathbb{T}$, respectively. Recall:

In mathematics, a Borel set is any set in a topological space that can be formed from open sets (or, equivalently, from closed sets) through the operations of countable union, countable intersection, and relative complement. Borel sets are named after Émile Borel.

For a topological space X, the collection of all Borel sets on X forms a σ-algebra, known as the Borel algebra or Borel σ-algebra. The Borel algebra on X is the smallest σ-algebra containing all open sets (or, equivalently, all closed sets).

Borel sets are important in measure theory, since any measure defined on the open sets of a space, or on the closed sets of a space, must also be defined on all Borel sets of that space. Any measure defined on the Borel sets is called a Borel measure.

Let's Play Live with other datasets, shall we?

Swedish 2018 National Election Data

Swedish Election Outcomes 2018

See: http://www.lamastex.org/datasets/public/elections/2018/sv/README!

This was obtained by processing using the scripts at:

You already have this dataset in your /data directory.

Counting total votes per party

Let's quickly load the data using csv.reader and count the number of votes for each party over all of Sweden next.

Geospatial adventures: Pubs in Open Street Maps of DL & SE

Say you want to visit some places of interest in Germany. This tutorial on Open Street Map's Overpass API shows you how to get the locations of "amenity"="biergarten" in Germany.

We may come back to https://www.openstreetmap.org later. If we don't then you know where to go for openly available data for geospatial statistical analysis.

Pubs in Sweden

With a minor modification to the above code we can view amenity=pub in Sweden.