Introduction to Data Science: A Comp-Math-Stat Approach ¶

1MS041, 2020¶

01. BASH Unix Shell¶

Dropping into BASH (Unix Shell) and using basic Shell commands
- pwd --- print working directory
- ls --- list files in current working directory
- mkdir --- make directory
- cd --- change directory
- man ls --- manual pages for any command
Grabbing files from the internet using curl

def showURL(url, ht=500):
    """Return an IFrame of the url to show in notebook with height ht"""
    from IPython.display import IFrame 
    return IFrame(url, width='95%', height=ht) 
showURL('https://en.wikipedia.org/wiki/Bash_(Unix_shell)',400)

1. Dropping into BASH (Unix Shell)¶

Using %%sh in a code cell we can access the BASH (Unix Shell) command prompt.

Let us pwd or print working directory.

%%sh
pwd

/Users/avelin/git/1MS041/master/jp

%%sh
# this is a comment in BASH shell as it is preceeded by '#'
ls # list the contents of this working directory

00.ipynb
01.ipynb
02.ipynb
03.ipynb
04.ipynb
05.ipynb
06.ipynb
07.ipynb
08.ipynb
09.ipynb
10.ipynb
11.ipynb
12.ipynb
13.ipynb
EXAM2019.ipynb
README.md
Untitled.ipynb
data
images
myHist.png

%%sh
mkdir mydir

%%sh
cd mydir
pwd
ls -al

/Users/avelin/git/1MS041/master/jp/mydir
total 0
drwxr-xr-x  2 sage sage  64 Aug 17 11:40 .
drwxr-xr-x 24 sage sage 768 Aug 17 11:40 ..

%%sh
pwd

/Users/avelin/git/1MS041/master/jp

"Use the source" by `man`-ning the unknown `command`¶

By evaluating the next cell, you are using thr manual pages to find more about the command ls. You can learn more about any command called command by typing man command in the BASH shell.

The output of the next cell with command man ls will look something like the following:

LS(1)                            User Commands                           LS(1)

NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci‐
       fied.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..
...
...
...
   Exit status:
       0      if OK,

       1      if minor problems (e.g., cannot access subdirectory),

       2      if serious trouble (e.g., cannot access command-line argument).

AUTHOR
       Written by Richard M. Stallman and David MacKenzie.

REPORTING BUGS
       GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
       Report ls translation bugs to <http://translationproject.org/team/>

COPYRIGHT
       Copyright © 2017 Free Software Foundation, Inc.   License  GPLv3+:  GNU
       GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This  is  free  software:  you  are free to change and redistribute it.
       There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       Full documentation at: <http://www.gnu.org/software/coreutils/ls>
       or available locally via: info '(coreutils) ls invocation'

GNU coreutils 8.28               January 2018                            LS(1)

%%sh
man ls

LS(1)                            User Commands                           LS(1)

NAME
       ls - list directory contents

SYNOPSIS
       ls [OPTION]... [FILE]...

DESCRIPTION
       List  information  about  the FILEs (the current directory by default).
       Sort entries alphabetically if none of -cftuvSUX nor --sort  is  speci‐
       fied.

       Mandatory  arguments  to  long  options are mandatory for short options
       too.

       -a, --all
              do not ignore entries starting with .

       -A, --almost-all
              do not list implied . and ..

       --author
              with -l, print the author of each file

       -b, --escape
              print C-style escapes for nongraphic characters

       --block-size=SIZE
              scale sizes by SIZE before printing them; e.g., '--block-size=M'
              prints sizes in units of 1,048,576 bytes; see SIZE format below

       -B, --ignore-backups
              do not list implied entries ending with ~

       -c     with -lt: sort by, and show, ctime (time of last modification of
              file status information); with -l: show ctime and sort by  name;
              otherwise: sort by ctime, newest first

       -C     list entries by columns

       --color[=WHEN]
              colorize  the output; WHEN can be 'always' (default if omitted),
              'auto', or 'never'; more info below

       -d, --directory
              list directories themselves, not their contents

       -D, --dired
              generate output designed for Emacs' dired mode

       -f     do not sort, enable -aU, disable -ls --color

       -F, --classify
              append indicator (one of */=>@|) to entries

       --file-type
              likewise, except do not append '*'

       --format=WORD
              across -x, commas -m, horizontal -x, long -l, single-column  -1,
              verbose -l, vertical -C

       --full-time
              like -l --time-style=full-iso

       -g     like -l, but do not list owner

       --group-directories-first
              group directories before files;

              can   be  augmented  with  a  --sort  option,  but  any  use  of
              --sort=none (-U) disables grouping

       -G, --no-group
              in a long listing, don't print group names

       -h, --human-readable
              with -l and/or -s, print human readable sizes (e.g., 1K 234M 2G)

       --si   likewise, but use powers of 1000 not 1024

       -H, --dereference-command-line
              follow symbolic links listed on the command line

       --dereference-command-line-symlink-to-dir
              follow each command line symbolic link

              that points to a directory

       --hide=PATTERN
              do not list implied entries matching shell  PATTERN  (overridden
              by -a or -A)

       --indicator-style=WORD
              append indicator with style WORD to entry names: none (default),
              slash (-p), file-type (--file-type), classify (-F)

       -i, --inode
              print the index number of each file

       -I, --ignore=PATTERN
              do not list implied entries matching shell PATTERN

       -k, --kibibytes
              default to 1024-byte blocks for disk usage

       -l     use a long listing format

       -L, --dereference
              when showing file information for a symbolic link, show informa‐
              tion  for  the file the link references rather than for the link
              itself

       -m     fill width with a comma separated list of entries

       -n, --numeric-uid-gid
              like -l, but list numeric user and group IDs

       -N, --literal
              print raw entry names (don't treat e.g. control characters  spe‐
              cially)

       -o     like -l, but do not list group information

       -p, --indicator-style=slash
              append / indicator to directories

       -q, --hide-control-chars
              print ? instead of nongraphic characters

       --show-control-chars
              show nongraphic characters as-is (the default, unless program is
              'ls' and output is a terminal)

       -Q, --quote-name
              enclose entry names in double quotes

       --quoting-style=WORD
              use quoting style WORD for entry names: literal, locale,  shell,
              shell-always, shell-escape, shell-escape-always, c, escape

       -r, --reverse
              reverse order while sorting

       -R, --recursive
              list subdirectories recursively

       -s, --size
              print the allocated size of each file, in blocks

       -S     sort by file size, largest first

       --sort=WORD
              sort  by  WORD instead of name: none (-U), size (-S), time (-t),
              version (-v), extension (-X)

       --time=WORD
              with -l, show time as WORD instead of default modification time:
              atime  or  access  or  use  (-u); ctime or status (-c); also use
              specified time as sort key if --sort=time (newest first)

       --time-style=STYLE
              with -l, show times using style STYLE: full-iso, long-iso,  iso,
              locale,  or  +FORMAT;  FORMAT  is interpreted like in 'date'; if
              FORMAT  is  FORMAT1<newline>FORMAT2,  then  FORMAT1  applies  to
              non-recent  files  and FORMAT2 to recent files; if STYLE is pre‐
              fixed with 'posix-', STYLE takes effect only outside  the  POSIX
              locale

       -t     sort by modification time, newest first

       -T, --tabsize=COLS
              assume tab stops at each COLS instead of 8

       -u     with  -lt:  sort by, and show, access time; with -l: show access
              time and sort by name; otherwise: sort by  access  time,  newest
              first

       -U     do not sort; list entries in directory order

       -v     natural sort of (version) numbers within text

       -w, --width=COLS
              set output width to COLS.  0 means no limit

       -x     list entries by lines instead of by columns

       -X     sort alphabetically by entry extension

       -Z, --context
              print any security context of each file

       -1     list one file per line.  Avoid '\n' with -q or -b

       --help display this help and exit

       --version
              output version information and exit

       The  SIZE  argument  is  an  integer and optional unit (example: 10K is
       10*1024).  Units are K,M,G,T,P,E,Z,Y  (powers  of  1024)  or  KB,MB,...
       (powers of 1000).

       Using  color  to distinguish file types is disabled both by default and
       with --color=never.  With --color=auto, ls emits color codes only  when
       standard  output is connected to a terminal.  The LS_COLORS environment
       variable can change the settings.  Use the dircolors command to set it.

   Exit status:
       0      if OK,

       1      if minor problems (e.g., cannot access subdirectory),

       2      if serious trouble (e.g., cannot access command-line argument).

AUTHOR
       Written by Richard M. Stallman and David MacKenzie.

REPORTING BUGS
       GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
       Report ls translation bugs to <http://translationproject.org/team/>

COPYRIGHT
       Copyright © 2016 Free Software Foundation, Inc.   License  GPLv3+:  GNU
       GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
       This  is  free  software:  you  are free to change and redistribute it.
       There is NO WARRANTY, to the extent permitted by law.

SEE ALSO
       Full documentation at: <http://www.gnu.org/software/coreutils/ls>
       or available locally via: info '(coreutils) ls invocation'

GNU coreutils 8.25               February 2017                           LS(1)

2. Grabbing files from internet using curl¶

%%sh
cd mydir
curl -O http://lamastex.org/datasets/public/SOU/sou/20170228.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 29323  100 29323    0     0  22872      0  0:00:01  0:00:01 --:--:-- 22855

%%sh
ls mydir/

20170228.txt

%%sh
cd mydir/
head 20170228.txt

Donald J. Trump 

February 28, 2017 
Thank you very much. Mr. Speaker, Mr. Vice President, members of Congress, the first lady of the United States ... 
... and citizens of America, tonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our nation's path toward civil rights and the work that still remains to be done. 
Recent threats ... 
Recent threats targeting Jewish community centers and vandalism of Jewish cemeteries, as well as last week's shooting in Kansas City, remind us that while we may be a nation divided on policies, we are a country that stands united in condemning hate and evil in all of its very ugly forms. 
Each American generation passes the torch of truth, liberty and justice, in an unbroken chain all the way down to the present. That torch is now in our hands. And we will use it to light up the world. 
I am here tonight to deliver a message of unity and strength, and it is a message deeply delivered from my heart. A new chapter ... 
... of American greatness is now beginning. A new national pride is sweeping across our nation. And a new surge of optimism is placing impossible dreams firmly within our grasp. What we are witnessing today is the renewal of the American spirit. Our allies will find that America is once again ready to lead.

To have more fun with all SOU addresses¶

Do the following:

%%sh
mkdir -p mydir # first create a directory called 'mydir'
cd mydir # change into this mydir directory
rm -f sou.tar.gz # remove any file in mydir called sou.tar.gz
curl -O http://lamastex.org/datasets/public/SOU/sou.tar.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3566k  100 3566k    0     0   333k      0  0:00:10  0:00:10 --:--:--  285k

%%sh
pwd
ls -lh mydir

/Users/avelin/git/1MS041/master/jp
total 4.1M
-rw-r--r-- 1 sage sage  29K Aug 17 11:40 20170228.txt
-rw-r--r-- 1 sage sage 3.5M Aug 17 11:40 sou.tar.gz

%%sh
cd mydir 
tar zxvf sou.tar.gz

sou/
sou/18111105.txt
sou/20040120.txt
sou/19061203.txt
sou/18411207.txt
sou/19091207.txt
sou/18701205.txt
sou/19410106.txt
sou/18571208.txt
sou/18891203.txt
sou/18341201.txt
sou/19660112.txt
sou/17981208.txt
sou/19610130.txt
sou/18140920.txt
sou/18011208.txt
sou/18811206.txt
sou/18281202.txt
sou/19840125.txt
sou/18611203.txt
sou/18731201.txt
sou/19400103.txt
sou/19630114.txt
sou/19281204.txt
sou/19221208.txt
sou/19031207.txt
sou/18681209.txt
sou/18431206.txt
sou/18861206.txt
sou/19261207.txt
sou/19271206.txt
sou/19141208.txt
sou/18791201.txt
sou/19131202.txt
sou/19041206.txt
sou/18001111.txt
sou/18041108.txt
sou/20010227.txt
sou/18621201.txt
sou/19251208.txt
sou/19700122.txt
sou/19790125.txt
sou/19870127.txt
sou/20050202.txt
sou/18331203.txt
sou/17961207.txt
sou/18021215.txt
sou/18771203.txt
sou/19890209.txt
sou/18301206.txt
sou/18121104.txt
sou/19580109.txt
sou/20110125.txt
sou/19450106.txt
sou/18031017.txt
sou/19301202.txt
sou/18661203.txt
sou/19520109.txt
sou/19620111.txt
sou/18531205.txt
sou/19610112.txt
sou/19430107.txt
sou/19960123.txt
sou/17911025.txt
sou/18211203.txt
sou/18951207.txt
sou/18901201.txt
sou/18721202.txt
sou/20140128.txt
sou/18361205.txt
sou/18101205.txt
sou/18081108.txt
sou/18961204.txt
sou/18871206.txt
sou/18781202.txt
sou/19480107.txt
sou/19001203.txt
sou/18421206.txt
sou/18241207.txt
sou/18131207.txt
sou/19500104.txt
sou/20010920.txt
sou/19940125.txt
sou/19850206.txt
sou/18541204.txt
sou/17921106.txt
sou/19800121.txt
sou/19311208.txt
sou/18461208.txt
sou/19161205.txt
sou/19121203.txt
sou/19370106.txt
sou/19151207.txt
sou/19051205.txt
sou/19021202.txt
sou/18321204.txt
sou/18671203.txt
sou/18651204.txt
sou/19510108.txt
sou/18581206.txt
sou/18161203.txt
sou/19390104.txt
sou/19321206.txt
sou/18641206.txt
sou/20070123.txt
sou/18691206.txt
sou/17991203.txt
sou/18551231.txt
sou/19440111.txt
sou/19910129.txt
sou/18921206.txt
sou/18061202.txt
sou/19470106.txt
sou/19590109.txt
sou/18151205.txt
sou/18751207.txt
sou/18981205.txt
sou/20090224.txt
sou/18071027.txt
sou/18171212.txt
sou/18821204.txt
sou/19211206.txt
sou/18371205.txt
sou/19181202.txt
sou/19720120.txt
sou/18601203.txt
sou/19530107.txt
sou/18741207.txt
sou/19460121.txt
sou/19350104.txt
sou/19201207.txt
sou/18591219.txt
sou/18221203.txt
sou/19231206.txt
sou/19730202.txt
sou/19291203.txt
sou/19820126.txt
sou/20060131.txt
sou/18501202.txt
sou/17931203.txt
sou/18711204.txt
sou/18441203.txt
sou/17900108.txt
sou/18561202.txt
sou/18381203.txt
sou/19071203.txt
sou/19570110.txt
sou/19171204.txt
sou/18181116.txt
sou/19420106.txt
sou/18201114.txt
sou/20130212.txt
sou/20030128.txt
sou/19340103.txt
sou/19970204.txt
sou/19810116.txt
sou/18841201.txt
sou/19101206.txt
sou/18351207.txt
sou/17951208.txt
sou/19530202.txt
sou/18971206.txt
sou/18231202.txt
sou/18831204.txt
sou/19490105.txt
sou/18991205.txt
sou/18391202.txt
sou/18481205.txt
sou/18631208.txt
sou/19930217.txt
sou/19111205.txt
sou/19740130.txt
sou/20160112.txt
sou/19880125.txt
sou/19690114.txt
sou/19360103.txt
sou/20120124.txt
sou/18091129.txt
sou/19680117.txt
sou/18851208.txt
sou/20100127.txt
sou/19191202.txt
sou/19670110.txt
sou/18451202.txt
sou/19241203.txt
sou/20080128.txt
sou/18291208.txt
sou/19980127.txt
sou/18401205.txt
sou/18471207.txt
sou/19540107.txt
sou/18191207.txt
sou/18801206.txt
sou/18491204.txt
sou/18881203.txt
sou/19950124.txt
sou/19380103.txt
sou/18761205.txt
sou/18931203.txt
sou/18051203.txt
sou/18271204.txt
sou/19900131.txt
sou/18941202.txt
sou/20150120.txt
sou/17941119.txt
sou/18261205.txt
sou/19710122.txt
sou/17901208.txt
sou/19081208.txt
sou/19550106.txt
sou/17971122.txt
sou/19560105.txt
sou/19640108.txt
sou/20020129.txt
sou/19920128.txt
sou/18511202.txt
sou/18311206.txt
sou/19770112.txt
sou/18521206.txt
sou/19760119.txt
sou/18251206.txt
sou/19860204.txt
sou/19830125.txt
sou/19780119.txt
sou/19650104.txt
sou/19990119.txt
sou/20000127.txt
sou/19600107.txt
sou/19011203.txt
sou/19750115.txt
sou/18911209.txt

After running the above two cells, you should have all the SOU (State of Union) addresses. By evaluating the next cell's ls ... command you should see the SOU files like the following:

total 11M
-rw------- 1 raazesh raazesh 6.6K Feb 18  2016 17900108.txt
-rw------- 1 raazesh raazesh 8.3K Feb 18  2016 17901208.txt
-rw------- 1 raazesh raazesh  14K Feb 18  2016 17911025.txt
...
...
...
-rw------- 1 raazesh raazesh  39K Feb 18  2016 20140128.txt
-rw------- 1 raazesh raazesh  38K Feb 18  2016 20150120.txt
-rw------- 1 raazesh raazesh  31K Feb 18  2016 20160112.txt

%%sh
ls -lh mydir/sou

total 11M
-rw------- 1 sage sage 6.6K Feb 18  2016 17900108.txt
-rw------- 1 sage sage 8.3K Feb 18  2016 17901208.txt
-rw------- 1 sage sage  14K Feb 18  2016 17911025.txt
-rw------- 1 sage sage  13K Feb 18  2016 17921106.txt
-rw------- 1 sage sage  12K Feb 18  2016 17931203.txt
-rw------- 1 sage sage  18K Feb 18  2016 17941119.txt
-rw------- 1 sage sage  13K Feb 18  2016 17951208.txt
-rw------- 1 sage sage  17K Feb 18  2016 17961207.txt
-rw------- 1 sage sage  13K Feb 18  2016 17971122.txt
-rw------- 1 sage sage  14K Feb 18  2016 17981208.txt
-rw------- 1 sage sage 9.1K Feb 18  2016 17991203.txt
-rw------- 1 sage sage 8.2K Feb 18  2016 18001111.txt
-rw------- 1 sage sage  19K Feb 18  2016 18011208.txt
-rw------- 1 sage sage  13K Feb 18  2016 18021215.txt
-rw------- 1 sage sage  14K Feb 18  2016 18031017.txt
-rw------- 1 sage sage  13K Feb 18  2016 18041108.txt
-rw------- 1 sage sage  17K Feb 18  2016 18051203.txt
-rw------- 1 sage sage  17K Feb 18  2016 18061202.txt
-rw------- 1 sage sage  14K Feb 18  2016 18071027.txt
-rw------- 1 sage sage  16K Feb 18  2016 18081108.txt
-rw------- 1 sage sage  11K Feb 18  2016 18091129.txt
-rw------- 1 sage sage  15K Feb 18  2016 18101205.txt
-rw------- 1 sage sage  14K Feb 18  2016 18111105.txt
-rw------- 1 sage sage  20K Feb 18  2016 18121104.txt
-rw------- 1 sage sage  20K Feb 18  2016 18131207.txt
-rw------- 1 sage sage  13K Feb 18  2016 18140920.txt
-rw------- 1 sage sage  19K Feb 18  2016 18151205.txt
-rw------- 1 sage sage  20K Feb 18  2016 18161203.txt
-rw------- 1 sage sage  26K Feb 18  2016 18171212.txt
-rw------- 1 sage sage  26K Feb 18  2016 18181116.txt
-rw------- 1 sage sage  28K Feb 18  2016 18191207.txt
-rw------- 1 sage sage  21K Feb 18  2016 18201114.txt
-rw------- 1 sage sage  34K Feb 18  2016 18211203.txt
-rw------- 1 sage sage  28K Feb 18  2016 18221203.txt
-rw------- 1 sage sage  38K Feb 18  2016 18231202.txt
-rw------- 1 sage sage  49K Feb 18  2016 18241207.txt
-rw------- 1 sage sage  53K Feb 18  2016 18251206.txt
-rw------- 1 sage sage  46K Feb 18  2016 18261205.txt
-rw------- 1 sage sage  42K Feb 18  2016 18271204.txt
-rw------- 1 sage sage  44K Feb 18  2016 18281202.txt
-rw------- 1 sage sage  62K Feb 18  2016 18291208.txt
-rw------- 1 sage sage  89K Feb 18  2016 18301206.txt
-rw------- 1 sage sage  42K Feb 18  2016 18311206.txt
-rw------- 1 sage sage  46K Feb 18  2016 18321204.txt
-rw------- 1 sage sage  46K Feb 18  2016 18331203.txt
-rw------- 1 sage sage  79K Feb 18  2016 18341201.txt
-rw------- 1 sage sage  63K Feb 18  2016 18351207.txt
-rw------- 1 sage sage  72K Feb 18  2016 18361205.txt
-rw------- 1 sage sage  68K Feb 18  2016 18371205.txt
-rw------- 1 sage sage  69K Feb 18  2016 18381203.txt
-rw------- 1 sage sage  79K Feb 18  2016 18391202.txt
-rw------- 1 sage sage  54K Feb 18  2016 18401205.txt
-rw------- 1 sage sage  48K Feb 18  2016 18411207.txt
-rw------- 1 sage sage  49K Feb 18  2016 18421206.txt
-rw------- 1 sage sage  47K Feb 18  2016 18431206.txt
-rw------- 1 sage sage  55K Feb 18  2016 18441203.txt
-rw------- 1 sage sage  94K Feb 18  2016 18451202.txt
-rw------- 1 sage sage 106K Feb 18  2016 18461208.txt
-rw------- 1 sage sage  95K Feb 18  2016 18471207.txt
-rw------- 1 sage sage 125K Feb 18  2016 18481205.txt
-rw------- 1 sage sage  45K Feb 18  2016 18491204.txt
-rw------- 1 sage sage  49K Feb 18  2016 18501202.txt
-rw------- 1 sage sage  78K Feb 18  2016 18511202.txt
-rw------- 1 sage sage  59K Feb 18  2016 18521206.txt
-rw------- 1 sage sage  57K Feb 18  2016 18531205.txt
-rw------- 1 sage sage  61K Feb 18  2016 18541204.txt
-rw------- 1 sage sage  69K Feb 18  2016 18551231.txt
-rw------- 1 sage sage  63K Feb 18  2016 18561202.txt
-rw------- 1 sage sage  81K Feb 18  2016 18571208.txt
-rw------- 1 sage sage  97K Feb 18  2016 18581206.txt
-rw------- 1 sage sage  73K Feb 18  2016 18591219.txt
-rw------- 1 sage sage  83K Feb 18  2016 18601203.txt
-rw------- 1 sage sage  41K Feb 18  2016 18611203.txt
-rw------- 1 sage sage  49K Feb 18  2016 18621201.txt
-rw------- 1 sage sage  37K Feb 18  2016 18631208.txt
-rw------- 1 sage sage  36K Feb 18  2016 18641206.txt
-rw------- 1 sage sage  54K Feb 18  2016 18651204.txt
-rw------- 1 sage sage  44K Feb 18  2016 18661203.txt
-rw------- 1 sage sage  70K Feb 18  2016 18671203.txt
-rw------- 1 sage sage  60K Feb 18  2016 18681209.txt
-rw------- 1 sage sage  46K Feb 18  2016 18691206.txt
-rw------- 1 sage sage  51K Feb 18  2016 18701205.txt
-rw------- 1 sage sage  38K Feb 18  2016 18711204.txt
-rw------- 1 sage sage  24K Feb 18  2016 18721202.txt
-rw------- 1 sage sage  59K Feb 18  2016 18731201.txt
-rw------- 1 sage sage  54K Feb 18  2016 18741207.txt
-rw------- 1 sage sage  72K Feb 18  2016 18751207.txt
-rw------- 1 sage sage  40K Feb 18  2016 18761205.txt
-rw------- 1 sage sage  48K Feb 18  2016 18771203.txt
-rw------- 1 sage sage  48K Feb 18  2016 18781202.txt
-rw------- 1 sage sage  70K Feb 18  2016 18791201.txt
-rw------- 1 sage sage  41K Feb 18  2016 18801206.txt
-rw------- 1 sage sage  24K Feb 18  2016 18811206.txt
-rw------- 1 sage sage  19K Feb 18  2016 18821204.txt
-rw------- 1 sage sage  24K Feb 18  2016 18831204.txt
-rw------- 1 sage sage  54K Feb 18  2016 18841201.txt
-rw------- 1 sage sage 119K Feb 18  2016 18851208.txt
-rw------- 1 sage sage  91K Feb 18  2016 18861206.txt
-rw------- 1 sage sage  31K Feb 18  2016 18871206.txt
-rw------- 1 sage sage  55K Feb 18  2016 18881203.txt
-rw------- 1 sage sage  77K Feb 18  2016 18891203.txt
-rw------- 1 sage sage  68K Feb 18  2016 18901201.txt
-rw------- 1 sage sage  95K Feb 18  2016 18911209.txt
-rw------- 1 sage sage  80K Feb 18  2016 18921206.txt
-rw------- 1 sage sage  75K Feb 18  2016 18931203.txt
-rw------- 1 sage sage  96K Feb 18  2016 18941202.txt
-rw------- 1 sage sage  88K Feb 18  2016 18951207.txt
-rw------- 1 sage sage  93K Feb 18  2016 18961204.txt
-rw------- 1 sage sage  72K Feb 18  2016 18971206.txt
-rw------- 1 sage sage 121K Feb 18  2016 18981205.txt
-rw------- 1 sage sage  91K Feb 18  2016 18991205.txt
-rw------- 1 sage sage 116K Feb 18  2016 19001203.txt
-rw------- 1 sage sage 114K Feb 18  2016 19011203.txt
-rw------- 1 sage sage  57K Feb 18  2016 19021202.txt
-rw------- 1 sage sage  89K Feb 18  2016 19031207.txt
-rw------- 1 sage sage 102K Feb 18  2016 19041206.txt
-rw------- 1 sage sage 144K Feb 18  2016 19051205.txt
-rw------- 1 sage sage 135K Feb 18  2016 19061203.txt
-rw------- 1 sage sage 159K Feb 18  2016 19071203.txt
-rw------- 1 sage sage 113K Feb 18  2016 19081208.txt
-rw------- 1 sage sage  83K Feb 18  2016 19091207.txt
-rw------- 1 sage sage  42K Feb 18  2016 19101206.txt
-rw------- 1 sage sage 141K Feb 18  2016 19111205.txt
-rw------- 1 sage sage 150K Feb 18  2016 19121203.txt
-rw------- 1 sage sage  21K Feb 18  2016 19131202.txt
-rw------- 1 sage sage  25K Feb 18  2016 19141208.txt
-rw------- 1 sage sage  44K Feb 18  2016 19151207.txt
-rw------- 1 sage sage  13K Feb 18  2016 19161205.txt
-rw------- 1 sage sage  22K Feb 18  2016 19171204.txt
-rw------- 1 sage sage  31K Feb 18  2016 19181202.txt
-rw------- 1 sage sage  28K Feb 18  2016 19191202.txt
-rw------- 1 sage sage  16K Feb 18  2016 19201207.txt
-rw------- 1 sage sage  34K Feb 18  2016 19211206.txt
-rw------- 1 sage sage  35K Feb 18  2016 19221208.txt
-rw------- 1 sage sage  41K Feb 18  2016 19231206.txt
-rw------- 1 sage sage  42K Feb 18  2016 19241203.txt
-rw------- 1 sage sage  65K Feb 18  2016 19251208.txt
-rw------- 1 sage sage  62K Feb 18  2016 19261207.txt
-rw------- 1 sage sage  53K Feb 18  2016 19271206.txt
-rw------- 1 sage sage  49K Feb 18  2016 19281204.txt
-rw------- 1 sage sage  68K Feb 18  2016 19291203.txt
-rw------- 1 sage sage  29K Feb 18  2016 19301202.txt
-rw------- 1 sage sage  36K Feb 18  2016 19311208.txt
-rw------- 1 sage sage  26K Feb 18  2016 19321206.txt
-rw------- 1 sage sage  14K Feb 18  2016 19340103.txt
-rw------- 1 sage sage  21K Feb 18  2016 19350104.txt
-rw------- 1 sage sage  22K Feb 18  2016 19360103.txt
-rw------- 1 sage sage  17K Feb 18  2016 19370106.txt
-rw------- 1 sage sage  28K Feb 18  2016 19380103.txt
-rw------- 1 sage sage  23K Feb 18  2016 19390104.txt
-rw------- 1 sage sage  19K Feb 18  2016 19400103.txt
-rw------- 1 sage sage  19K Feb 18  2016 19410106.txt
-rw------- 1 sage sage  20K Feb 18  2016 19420106.txt
-rw------- 1 sage sage  26K Feb 18  2016 19430107.txt
-rw------- 1 sage sage  22K Feb 18  2016 19440111.txt
-rw------- 1 sage sage  48K Feb 18  2016 19450106.txt
-rw------- 1 sage sage 171K Feb 18  2016 19460121.txt
-rw------- 1 sage sage  37K Feb 18  2016 19470106.txt
-rw------- 1 sage sage  30K Feb 18  2016 19480107.txt
-rw------- 1 sage sage  21K Feb 18  2016 19490105.txt
-rw------- 1 sage sage  30K Feb 18  2016 19500104.txt
-rw------- 1 sage sage  23K Feb 18  2016 19510108.txt
-rw------- 1 sage sage  30K Feb 18  2016 19520109.txt
-rw------- 1 sage sage  56K Feb 18  2016 19530107.txt
-rw------- 1 sage sage  43K Feb 18  2016 19530202.txt
-rw------- 1 sage sage  37K Feb 18  2016 19540107.txt
-rw------- 1 sage sage  46K Feb 18  2016 19550106.txt
-rw------- 1 sage sage  51K Feb 18  2016 19560105.txt
-rw------- 1 sage sage  26K Feb 18  2016 19570110.txt
-rw------- 1 sage sage  30K Feb 18  2016 19580109.txt
-rw------- 1 sage sage  30K Feb 18  2016 19590109.txt
-rw------- 1 sage sage  35K Feb 18  2016 19600107.txt
-rw------- 1 sage sage  40K Feb 18  2016 19610112.txt
-rw------- 1 sage sage  31K Feb 18  2016 19610130.txt
-rw------- 1 sage sage  39K Feb 18  2016 19620111.txt
-rw------- 1 sage sage  31K Feb 18  2016 19630114.txt
-rw------- 1 sage sage  19K Feb 18  2016 19640108.txt
-rw------- 1 sage sage  25K Feb 18  2016 19650104.txt
-rw------- 1 sage sage  30K Feb 18  2016 19660112.txt
-rw------- 1 sage sage  41K Feb 18  2016 19670110.txt
-rw------- 1 sage sage  29K Feb 18  2016 19680117.txt
-rw------- 1 sage sage  24K Feb 18  2016 19690114.txt
-rw------- 1 sage sage  25K Feb 18  2016 19700122.txt
-rw------- 1 sage sage  26K Feb 18  2016 19710122.txt
-rw------- 1 sage sage  23K Feb 18  2016 19720120.txt
-rw------- 1 sage sage 9.7K Feb 18  2016 19730202.txt
-rw------- 1 sage sage  29K Feb 18  2016 19740130.txt
-rw------- 1 sage sage  25K Feb 18  2016 19750115.txt
-rw------- 1 sage sage  30K Feb 18  2016 19760119.txt
-rw------- 1 sage sage  28K Feb 18  2016 19770112.txt
-rw------- 1 sage sage  26K Feb 18  2016 19780119.txt
-rw------- 1 sage sage  20K Feb 18  2016 19790125.txt
-rw------- 1 sage sage  20K Feb 18  2016 19800121.txt
-rw------- 1 sage sage 213K Feb 18  2016 19810116.txt
-rw------- 1 sage sage  31K Feb 18  2016 19820126.txt
-rw------- 1 sage sage  33K Feb 18  2016 19830125.txt
-rw------- 1 sage sage  30K Feb 18  2016 19840125.txt
-rw------- 1 sage sage  25K Feb 18  2016 19850206.txt
-rw------- 1 sage sage  20K Feb 18  2016 19860204.txt
-rw------- 1 sage sage  22K Feb 18  2016 19870127.txt
-rw------- 1 sage sage  28K Feb 18  2016 19880125.txt
-rw------- 1 sage sage  28K Feb 18  2016 19890209.txt
-rw------- 1 sage sage  21K Feb 18  2016 19900131.txt
-rw------- 1 sage sage  22K Feb 18  2016 19910129.txt
-rw------- 1 sage sage  27K Feb 18  2016 19920128.txt
-rw------- 1 sage sage  39K Feb 18  2016 19930217.txt
-rw------- 1 sage sage  42K Feb 18  2016 19940125.txt
-rw------- 1 sage sage  51K Feb 18  2016 19950124.txt
-rw------- 1 sage sage  36K Feb 18  2016 19960123.txt
-rw------- 1 sage sage  39K Feb 18  2016 19970204.txt
-rw------- 1 sage sage  42K Feb 18  2016 19980127.txt
-rw------- 1 sage sage  43K Feb 18  2016 19990119.txt
-rw------- 1 sage sage  44K Feb 18  2016 20000127.txt
-rw------- 1 sage sage  25K Feb 18  2016 20010227.txt
-rw------- 1 sage sage  17K Feb 18  2016 20010920.txt
-rw------- 1 sage sage  23K Feb 18  2016 20020129.txt
-rw------- 1 sage sage  32K Feb 18  2016 20030128.txt
-rw------- 1 sage sage  30K Feb 18  2016 20040120.txt
-rw------- 1 sage sage  30K Feb 18  2016 20050202.txt
-rw------- 1 sage sage  31K Feb 18  2016 20060131.txt
-rw------- 1 sage sage  32K Feb 18  2016 20070123.txt
-rw------- 1 sage sage  34K Feb 18  2016 20080128.txt
-rw------- 1 sage sage  33K Feb 18  2016 20090224.txt
-rw------- 1 sage sage  41K Feb 18  2016 20100127.txt
-rw------- 1 sage sage  39K Feb 18  2016 20110125.txt
-rw------- 1 sage sage  40K Feb 18  2016 20120124.txt
-rw------- 1 sage sage  37K Feb 18  2016 20130212.txt
-rw------- 1 sage sage  39K Feb 18  2016 20140128.txt
-rw------- 1 sage sage  38K Feb 18  2016 20150120.txt
-rw------- 1 sage sage  31K Feb 18  2016 20160112.txt

%%sh
head mydir/sou/17900108.txt

George Washington 

January 8, 1790 
Fellow-Citizens of the Senate and House of Representatives: 
I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. 
In resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom. 
Among the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace. 
A free people ought not only to be armed, but disciplined; to which end a uniform and well-digested plan is requisite; and their safety and interest require that they should promote such manufactories as tend to render them independent of others for essential, particularly military, supplies. 
The proper establishment of the troops which may be deemed indispensable will be entitled to mature consideration. In the arrangements which may be made respecting it it will be of importance to conciliate the comfortable support of the officers and soldiers with a due regard to economy. 
There was reason to hope that the pacific measures adopted with regard to certain hostile tribes of Indians would have relieved the inhabitants of our southern and western frontiers from their depredations, but you will perceive from the information contained in the papers which I shall direct to be laid before you (comprehending a communication from the Commonwealth of Virginia) that we ought to be prepared to afford protection to those parts of the Union, and, if necessary, to punish aggressors.

%%sh
head mydir/sou/20160112.txt

Barack Obama 

January 12, 2016 
Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: 
Tonight marks the eighth year I've come here to report on the State of the Union. And for this final one, I'm going to try to make it shorter. I know some of you are antsy to get back to Iowa. 
I also understand that because it's an election season, expectations for what we'll achieve this year are low. Still, Mr. Speaker, I appreciate the constructive approach you and the other leaders took at the end of last year to pass a budget and make tax cuts permanent for working families. So I hope we can work together this year on bipartisan priorities like criminal justice reform, and helping people who are battling prescription drug abuse. We just might surprise the cynics again. 
But tonight, I want to go easy on the traditional list of proposals for the year ahead. Don't worry, I've got plenty, from helping students learn to write computer code to personalizing medical treatments for patients. And I'll keep pushing for progress on the work that still needs doing. Fixing a broken immigration system. Protecting our kids from gun violence. Equal pay for equal work, paid leave, raising the minimum wage. All these things still matter to hardworking families; they are still the right thing to do; and I will not let up until they get done. 
But for my final address to this chamber, I don't want to talk just about the next year. I want to focus on the next five years, ten years, and beyond. 
I want to focus on our future. 
We live in a time of extraordinary change, change that's reshaping the way we live, the way we work, our planet and our place in the world. It's change that promises amazing medical breakthroughs, but also economic disruptions that strain working families. It promises education for girls in the most remote villages, but also connects terrorists plotting an ocean away. It's change that can broaden opportunity, or widen inequality. And whether we like it or not, the pace of this change will only accelerate.

An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:

Alix Rule, Jean-Philippe Cointet, and Peter S. Bearman, Lexical shifts, substantive changes, and continuity in State of the Union discourse, 1790–2014, PNAS 2015 112 (35) 10837-10844; doi:10.1073/pnas.1512221112.

Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.

You will be able to carry out such analyses and/or critically reflect on the mathematical statistical assumptions made in such analyses, as you learn more during your programme of study after successfully completing this course.

How the `sou.tgz` file was created?¶

If you are curious, read: http://lamastex.org/datasets/public/SOU/README.md.

Briefly, this is how a website with SOU was scraped by Paul Brouwers and adapted by Raazesh Sainudiin. A data scientist, and more generally a researcher interested in making statistical inference from data that is readily available online in a particular format, is expected to be comfortable with such web-scraping tasks (which can be done in more gracious and robust ways using specialised Python libraries). Such tasks also known as Extract-Load-Transform (ELT) operations are often time-consuming, expensive andnthe necessary first step towards statistical inference.

A bit of bash and lynx to achieve the scraping of the state of the union addresses of the US Presidents,¶

by Paul Brouwers¶

The code below is mainly there to show how the text content of each state of the union address was scraped from the following URL:

http://stateoftheunion.onetwothree.net/texts/index.html

Such data acquisition tasks is usually the first and cucial step in a data scientist's workflow.

We have done this and put the data in the distributed file system for easy loading into our notebooks for further analysis. This keeps us from having to install unix programs like lynx, sed, etc. that are needed in the shell script below.

for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html | grep texts | grep -v index | sed 's/.*http/http/') ; do lynx --dump $i | tail -n+13 | head -n-14 | sed 's/^\s\+//' | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt ; done

Or in a more atomic form:

for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html \

        | grep texts \

        | grep -v index \

        | sed 's/.*http/http/')

do 

        lynx --dump $i \

               | tail -n+13 \

               | head -n-14 \

               | sed 's/^\s\+//' \

               | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' \

               > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt

done

Assignment 1, PROBLEM 1¶

Maximum Points = 2

Finding out the number of lines and characters in a file¶

Evaluate the following two cells by replacing X with the right command-line option to wc command in order to find:

the number of lines in data/earthquakes_small.csv and
the number of words in data/earthquakes_small.csv

Finally, update the following cell by replacing XXX with the right integer answers, respectively, for:

NumberOfLinesIn_earthquakes_small_csv_file and
NumberOfWordsIn_earthquakes_small_csv_file

Here is a brief synopsis of wc that you would get from running man wc as follows:

%%sh
man wc

WC(1)                     BSD General Commands Manual                    WC(1)

NAME
     wc -- word, line, character, and byte count

SYNOPSIS
     wc [-clmw] [file ...]

DESCRIPTION
     The wc utility displays the number of lines, words, and bytes contained in each input file, or standard input (if no file is specified) to the standard output.  A line is defined as a string of characters delimited by a <newline> character.  Characters beyond the final <newline> character will not be included in the line count.

     A word is defined as a string of characters delimited by white space characters.  White space characters are the set of characters for which the iswspace(3) function returns true.  If more than one input file is specified, a line of cumulative counts for all the files is displayed on a separate line after the output for the last file.

     The following options are available:

     -c      The number of bytes in each input file is written to the standard output.  This will cancel out any prior usage of the -m option.

     -l      The number of lines in each input file is written to the standard output.

     -m      The number of characters in each input file is written to the standard output.  If the current locale does not support multibyte
             characters, this is equivalent to the -c option.  This will cancel out any prior usage of the -c option.

     -w      The number of words in each input file is written to the standard output.

     When an option is specified, wc only reports the information requested by that option.  The order of output always takes the form of line, word, byte, and file name.  The default action is equivalent to specifying the -c, -l and -w options.

%%sh
# replace X in the next line with the right option to find the number of lines
wc -X data/earthquakes_small.csv

  File "<ipython-input-1-90a1f9066310>", line 2
    %%sh
    ^
SyntaxError: invalid syntax

%%sh
# replace X in the next line with the right option to find the number of words
wc -X data/earthquakes_small.csv

  File "<ipython-input-1-a929ee0c7a3f>", line 2
    %%sh
    ^
SyntaxError: invalid syntax

# write your answer below by replacing XXX don't modify anything else! 

NumberOfLinesIn_earthquakes_small_csv_file = XXX
NumberOfWordsIn_earthquakes_small_csv_file = XXX

Local Test for Assignment 1, PROBLEM 1¶

Evaluate cell below to make sure your answer is valid. You should not modify anything in the cell below when evaluating it to do a local test of your solution. You may need to include and evaluate code snippets from lecture notebooks in cells above to make the local test work correctly sometimes (see error messages for clues). This is meant to help you become efficient at recalling materials covered in lectures that relate to this problem. Such local tests will generally not be available in the exam.

# Evaluate this cell locally to make sure you have the answer as a non-negative integer
try:
    assert(NumberOfLinesIn_earthquakes_small_csv_file > -1)
    print("Good! You have 0 or more lines as your answer. Hopefully it is the correct!")
except AssertionError:
    print("Try Again. You seem to not have a valid number of lines as your answer.")
try:
    assert(NumberOfWordsIn_earthquakes_small_csv_file > -1)
    print("Good! You have 0 or more words as your answer. Hopefully it is the correct!")
except AssertionError:
    print("Try Again. You seem to not have a valid number of words as your answer.")

Introduction to Data Science: A Comp-Math-Stat Approach¶