©2021 Raazesh Sainudiin, Benny Avelin. Attribution 4.0 International (CC BY 4.0)
pwd
--- print working directoryls
--- list files in current working directorymkdir
--- make directorycd
--- change directoryman ls
--- manual pages for any commandcurl
def showURL(url, ht=500):
"""Return an IFrame of the url to show in notebook with height ht"""
from IPython.display import IFrame
return IFrame(url, width='95%', height=ht)
showURL('https://en.wikipedia.org/wiki/Bash_(Unix_shell)',400)
Using %%sh
in a code cell we can access the BASH (Unix Shell) command prompt.
Let us pwd
or print working directory.
%%sh
pwd
/home/user/datascience-intro/raaz/1MS041/master/jp
%%sh
# this is a comment in BASH shell as it is preceeded by '#'
ls # list the contents of this working directory
%%sh
mkdir -p mydir
%%sh
cd mydir
pwd
ls -al
/home/user/datascience-intro/raaz/1MS041/master/jp/mydir total 3600 drwxr-xr-x 3 user user 5 Sep 14 11:42 . drwxr-xr-x 20 user user 68 Sep 14 12:04 .. -rw-r--r-- 1 user user 29323 Sep 14 11:42 20170228.txt drwxr-xr-x 2 user user 232 Sep 14 11:42 sou -rw-r--r-- 1 user user 3652403 Sep 14 11:42 sou.tar.gz
%%sh
pwd
/home/user/datascience-intro/raaz/1MS041/master/jp
man
-ning the unknown command
¶By evaluating the next cell, you are using the man
ual pages to find more about the command ls
. You can learn more about any command called command
by typing man command
in the BASH shell.
The output of the next cell with command man ls
will look something like the following:
LS(1) User Commands LS(1)
NAME
ls - list directory contents
SYNOPSIS
ls [OPTION]... [FILE]...
DESCRIPTION
List information about the FILEs (the current directory by default).
Sort entries alphabetically if none of -cftuvSUX nor --sort is speci‐
fied.
Mandatory arguments to long options are mandatory for short options
too.
-a, --all
do not ignore entries starting with .
-A, --almost-all
do not list implied . and ..
...
...
...
Exit status:
0 if OK,
1 if minor problems (e.g., cannot access subdirectory),
2 if serious trouble (e.g., cannot access command-line argument).
AUTHOR
Written by Richard M. Stallman and David MacKenzie.
REPORTING BUGS
GNU coreutils online help: <http://www.gnu.org/software/coreutils/>
Report ls translation bugs to <http://translationproject.org/team/>
COPYRIGHT
Copyright © 2017 Free Software Foundation, Inc. License GPLv3+: GNU
GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
SEE ALSO
Full documentation at: <http://www.gnu.org/software/coreutils/ls>
or available locally via: info '(coreutils) ls invocation'
GNU coreutils 8.28 January 2018 LS(1)
%%sh
## uncomment by removing '#' in the next line and try executing this cell
# man ls
%%sh
cd mydir
curl -O http://lamastex.org/datasets/public/SOU/sou/20170228.txt
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 29323 100 29323 0 0 39097 0 --:--:-- --:--:-- --:--:-- 39045
%%sh
ls mydir/
20170228.txt sou sou.tar.gz
%%sh
cd mydir/
head 20170228.txt
Donald J. Trump February 28, 2017 Thank you very much. Mr. Speaker, Mr. Vice President, members of Congress, the first lady of the United States ... ... and citizens of America, tonight, as we mark the conclusion of our celebration of Black History Month, we are reminded of our nation's path toward civil rights and the work that still remains to be done. Recent threats ... Recent threats targeting Jewish community centers and vandalism of Jewish cemeteries, as well as last week's shooting in Kansas City, remind us that while we may be a nation divided on policies, we are a country that stands united in condemning hate and evil in all of its very ugly forms. Each American generation passes the torch of truth, liberty and justice, in an unbroken chain all the way down to the present. That torch is now in our hands. And we will use it to light up the world. I am here tonight to deliver a message of unity and strength, and it is a message deeply delivered from my heart. A new chapter ... ... of American greatness is now beginning. A new national pride is sweeping across our nation. And a new surge of optimism is placing impossible dreams firmly within our grasp. What we are witnessing today is the renewal of the American spirit. Our allies will find that America is once again ready to lead.
Do the following:
%%sh
mkdir -p mydir # first create a directory called 'mydir'
cd mydir # change into this mydir directory
rm -f sou.tar.gz # remove any file in mydir called sou.tar.gz
curl -O http://lamastex.org/datasets/public/SOU/sou.tar.gz
% Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 3566k 100 3566k 0 0 1563k 0 0:00:02 0:00:02 --:--:-- 1563k
%%sh
pwd
ls -lh mydir
/home/user/datascience-intro/raaz/1MS041/master/jp total 25K -rw-r--r-- 1 user user 29K Sep 14 12:07 20170228.txt drwxr-xr-x 2 user user 232 Sep 14 11:42 sou -rw-r--r-- 1 user user 3.5M Sep 14 12:07 sou.tar.gz
%%sh
cd mydir
tar zxvf sou.tar.gz
After running the above two cells, you should have all the SOU (State of Union) addresses. By evaluating the next cell's ls ...
command you should see the SOU files like the following:
total 11M
-rw------- 1 raazesh raazesh 6.6K Feb 18 2016 17900108.txt
-rw------- 1 raazesh raazesh 8.3K Feb 18 2016 17901208.txt
-rw------- 1 raazesh raazesh 14K Feb 18 2016 17911025.txt
...
...
...
-rw------- 1 raazesh raazesh 39K Feb 18 2016 20140128.txt
-rw------- 1 raazesh raazesh 38K Feb 18 2016 20150120.txt
-rw------- 1 raazesh raazesh 31K Feb 18 2016 20160112.txt
%%sh
ls -lh mydir/sou
%%sh
head mydir/sou/17900108.txt
George Washington January 8, 1790 Fellow-Citizens of the Senate and House of Representatives: I embrace with great satisfaction the opportunity which now presents itself of congratulating you on the present favorable prospects of our public affairs. The recent accession of the important state of North Carolina to the Constitution of the United States (of which official information has been received), the rising credit and respectability of our country, the general and increasing good will toward the government of the Union, and the concord, peace, and plenty with which we are blessed are circumstances auspicious in an eminent degree to our national prosperity. In resuming your consultations for the general good you can not but derive encouragement from the reflection that the measures of the last session have been as satisfactory to your constituents as the novelty and difficulty of the work allowed you to hope. Still further to realize their expectations and to secure the blessings which a gracious Providence has placed within our reach will in the course of the present important session call for the cool and deliberate exertion of your patriotism, firmness, and wisdom. Among the many interesting objects which will engage your attention that of providing for the common defense will merit particular regard. To be prepared for war is one of the most effectual means of preserving peace. A free people ought not only to be armed, but disciplined; to which end a uniform and well-digested plan is requisite; and their safety and interest require that they should promote such manufactories as tend to render them independent of others for essential, particularly military, supplies. The proper establishment of the troops which may be deemed indispensable will be entitled to mature consideration. In the arrangements which may be made respecting it it will be of importance to conciliate the comfortable support of the officers and soldiers with a due regard to economy. There was reason to hope that the pacific measures adopted with regard to certain hostile tribes of Indians would have relieved the inhabitants of our southern and western frontiers from their depredations, but you will perceive from the information contained in the papers which I shall direct to be laid before you (comprehending a communication from the Commonwealth of Virginia) that we ought to be prepared to afford protection to those parts of the Union, and, if necessary, to punish aggressors.
%%sh
head mydir/sou/20160112.txt
Barack Obama January 12, 2016 Mr. Speaker, Mr. Vice President, Members of Congress, my fellow Americans: Tonight marks the eighth year I've come here to report on the State of the Union. And for this final one, I'm going to try to make it shorter. I know some of you are antsy to get back to Iowa. I also understand that because it's an election season, expectations for what we'll achieve this year are low. Still, Mr. Speaker, I appreciate the constructive approach you and the other leaders took at the end of last year to pass a budget and make tax cuts permanent for working families. So I hope we can work together this year on bipartisan priorities like criminal justice reform, and helping people who are battling prescription drug abuse. We just might surprise the cynics again. But tonight, I want to go easy on the traditional list of proposals for the year ahead. Don't worry, I've got plenty, from helping students learn to write computer code to personalizing medical treatments for patients. And I'll keep pushing for progress on the work that still needs doing. Fixing a broken immigration system. Protecting our kids from gun violence. Equal pay for equal work, paid leave, raising the minimum wage. All these things still matter to hardworking families; they are still the right thing to do; and I will not let up until they get done. But for my final address to this chamber, I don't want to talk just about the next year. I want to focus on the next five years, ten years, and beyond. I want to focus on our future. We live in a time of extraordinary change, change that's reshaping the way we live, the way we work, our planet and our place in the world. It's change that promises amazing medical breakthroughs, but also economic disruptions that strain working families. It promises education for girls in the most remote villages, but also connects terrorists plotting an ocean away. It's change that can broaden opportunity, or widen inequality. And whether we like it or not, the pace of this change will only accelerate.
An interesting analysis of the textual content of the State of the Union (SoU) addresses by all US presidents was done in:
Fig. 5. A river network captures the flow across history of US political discourse, as perceived by contemporaries. Time moves along the x axis. Clusters on semantic networks of 300 most frequent terms for each of 10 historical periods are displayed as vertical bars. Relations between clusters of adjacent periods are indexed by gray flows, whose density reflects their degree of connection. Streams that connect at any point in history may be considered to be part of the same system, indicated with a single color.
You will be able to carry out such analyses and/or critically reflect on the mathematical statistical assumptions made in such analyses, as you learn more during your programme of study after successfully completing this course.
sou.tgz
file created?¶If you are curious, read: http://lamastex.org/datasets/public/SOU/README.md.
Briefly, this is how a website with SOU was scraped by Paul Brouwers and adapted by Raazesh Sainudiin. A data scientist, and more generally a researcher interested in making statistical inference from data that is readily available online in a particular format, is expected to be comfortable with such web-scraping tasks (which can be done in more gracious and robust ways using specialised Python libraries). Such tasks also known as Extract-Load-Transform (ELT) operations are often time-consuming, expensive and the necessary first step towards extracting value from data.
The code below is mainly there to show how the text content of each state of the union address was scraped from the following URL:
Such data acquisition tasks is usually the first and cucial step in a data scientist's workflow.
We have done this and put the data in the distributed file system for easy loading into our notebooks for further analysis. This keeps us from having to install unix programs like lynx
, sed
, etc. that are needed in the shell script below.
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html | grep texts | grep -v index | sed 's/.*http/http/') ; do lynx --dump $i | tail -n+13 | head -n-14 | sed 's/^\s\+//' | sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' > $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt ; done
Or in a more atomic form:
for i in $(lynx --dump http://stateoftheunion.onetwothree.net/texts/index.html \
| grep texts \
| grep -v index \
| sed 's/.*http/http/')
do
lynx --dump $i \
| tail -n+13 \
| head -n-14 \
| sed 's/^\s\+//' \
| sed -e ':a;N;$!ba;s/\(.\)\n/\1 /g' -e 's/\n/\n\n/' \
> $(echo $i | sed 's/.*\([0-9]\{8\}\).*/\1/').txt
done
If you have time and are curious how each of the components in the above pipeline via |
operators work, try to read man echo
, man sed
, man grep
, man head
, man tail
, and man lynx
or lynx --help
. If a command like lynx
is not in your system, then you can install it with some work (mostly googling).
%%sh
## uncomment by removing '#' in the next line and try executing this cell by pressing Ctrl-Enter to see if lynx is installed
#lynx --help
So using lynx
is not that difficult. Suppose you want to dump the contents of https://lamastex.github.io/research/#available-student-projects to stdout
or standard out, we can do the following:
%%sh
## uncomment by removing '#' in the next line and try executing this cell if lynx is installed
#lynx --dump https://lamastex.github.io/research/#available-student-projects
Hopefully, you had fun with BASH! Now let us put BASH to use.