14. How can I work on a remote server?#

Today we will connect to a remote server and learn new bash commands for working with the content of files.

14.1. What are remote servers and HPC systems?#

diagram illustrating a remote connection to a login node and compute cluster

14.2. Connecting to Seawulf#

We connect with secure shell or ssh from our terminal (GitBash or Putty on windows) to URI’s teaching High Performance Computing (HPC) Cluster Seawulf.

Our login is the part of your uri e-mail address before the @.

ssh -l brownsarahm seawulf.uri.edu

When it logs in it looks like this and requires you to change your password. They configure it with a default and with it past expired.

Note

This block is sort of weird, because it is interactive terminal. I have rendered it all as output, but broken it down to separate chunks to add explanation.

let me know your thoughts on this choice for a community badge

The authenticity of host 'seawulf.uri.edu (131.128.217.210)' can't be established.
ECDSA key fingerprint is SHA256:RwhTUyjWLqwohXiRw+tYlTiJEbqX2n/drCpkIwQVCro.
Are you sure you want to continue connecting (yes/no/[fingerprint])? y
Please type 'yes', 'no' or the fingerprint: yes

Follow the instruction to type yes

I will tell you how to find your default password if you missed class (do not want to post it publicly). Comment on your experience report PR to ask for this information.

Warning: Permanently added 'seawulf.uri.edu,131.128.217.210' (ECDSA) to the list of known hosts.
brownsarahm@seawulf.uri.edu's password:

It does not show charachters when you type your password, but it works when you press enter

Then it requires you to change your password

You are required to change your password immediately (root enforced)
WARNING: Your password has expired.
You must change your password now and login again!

To change, it asks for you current (default) password first,

Important

You use the default password when prompted for your username’s password. Then again when it asks for the (current) UNIX password:. Then you must type the same, new password twice.

Choose a new password you will remember, we will come back to this server

Changing password for user brownsarahm.
Changing password for brownsarahm.
(current) UNIX password:

then the new one twice

New password:
Retype new password:
passwd: all authentication tokens updated successfully.
Connection to seawulf.uri.edu closed.

after you give it a new password, then it logs you out and you have to log back in.

brownsarahm@~ $ ssh -l brownsarahm seawulf.uri.edu

When you log in it shows you information about your recent logins.

We have logged into our home directory which is empty

[brownsarahm@seawulf ~]$ ls
[brownsarahm@seawulf ~]$ pwd
/home/brownsarahm

Notice that the prompt says uriusername@seawulf to indicate that you are logged into the server, not working locally.

14.3. Downloading files#

wget allows you to get files from the web.

[brownsarahm@seawulf ~]$ wget http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz
--2022-03-08 12:58:09--  http://www.hpc-carpentry.org/hpc-shell/files/bash-lesson.tar.gz
Resolving www.hpc-carpentry.org (www.hpc-carpentry.org)... 104.21.33.152, 172.67.146.136
Connecting to www.hpc-carpentry.org (www.hpc-carpentry.org)|104.21.33.152|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12534006 (12M) [application/gzip]
Saving to: ‘bash-lesson.tar.gz’

100%[======================================>] 12,534,006  4.19MB/s   in 2.9s   

2022-03-08 12:58:12 (4.19 MB/s) - ‘bash-lesson.tar.gz’ saved [12534006/12534006]

Note that this is a reasonably sized download and it finished very quickly. This is because the download happened on the remote server not your laptop. The server has a high quality hard-wired connection to the internet that is very fast, unlike the wifi in our classroom.

This is an advantage of using a remote system. If your connection is slow, but stable enough to connect, you can do the work on a different computer that has better connection.

Now we see we have the file.

[brownsarahm@seawulf ~]$ ls
bash-lesson.tar.gz

we see the new file!

14.4. Unzipping a file on the command line#

This file is compressed.

We can use man tar to see the manual aka man file of the tar program to learn how it works. You can also read man files online from GNU where you can choose your format, this page shows the full version.

[brownsarahm@seawulf ~]$ tar -xvf bash-lesson.tar.gz
dmel-all-r6.19.gtf
dmel_unique_protein_isoforms_fb_2016_01.tsv
gene_association.fb
SRR307023_1.fastq
SRR307023_2.fastq
SRR307024_1.fastq
SRR307024_2.fastq
SRR307025_1.fastq
SRR307025_2.fastq
SRR307026_1.fastq
SRR307026_2.fastq
SRR307027_1.fastq
SRR307027_2.fastq
SRR307028_1.fastq
SRR307028_2.fastq
SRR307029_1.fastq
SRR307029_2.fastq
SRR307030_1.fastq
SRR307030_2.fastq

This command uses the tar program and:

  • v makes it verbose (to output all that info while it works)

  • x makes it extract

  • f option accepts the file name to work on

We can see what it did with ls

[brownsarahm@seawulf ~]$ ls
bash-lesson.tar.gz                           SRR307026_1.fastq
dmel-all-r6.19.gtf                           SRR307026_2.fastq
dmel_unique_protein_isoforms_fb_2016_01.tsv  SRR307027_1.fastq
gene_association.fb                          SRR307027_2.fastq
SRR307023_1.fastq                            SRR307028_1.fastq
SRR307023_2.fastq                            SRR307028_2.fastq
SRR307024_1.fastq                            SRR307029_1.fastq
SRR307024_2.fastq                            SRR307029_2.fastq
SRR307025_1.fastq                            SRR307030_1.fastq
SRR307025_2.fastq                            SRR307030_2.fastq

Note: To extract files to a different directory use the option --directory --directory path/to/directory

14.5. Using a compute note#

14.6. Working with large files#

One of these files, contains the entire genome for the common fruitfly, let’s take a look at it:

[brownsarahm@seawulf ~]$ cat dmel-all-r6.19.gtf
X	FlyBase	gene	19961297	19969323	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3";
2L	FlyBase	3UTR	782825	782885	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";

Warning

this output is truncated for display purposes

We see that this actually take a long time to output and is way tooo much information to actually read. In fact, in order to make the website work, I had to cut that content using command line tools, my text editor couldn’t open the file and GitHub was unhappy when I pushed it.

Note

I initially used head and tail to pick out the excerpts of the terminal output that I needed, but this year I reused what I had done before

For a file like this, we don’t really want to read the whole file but we do need to know what it’s strucutred like in order to design programs to work with it.

head lets us look at the first 10 lines.

[brownsarahm@seawulf ~]$ head dmel-all-r6.19.gtf
X	FlyBase	gene	19961297	19969323	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3";
X	FlyBase	mRNA	19961689	19968479	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	5UTR	19961689	19961845	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19961689	19961845	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19963955	19964071	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19964782	19964944	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19965006	19965126	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19965197	19965511	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19965577	19966071	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
X	FlyBase	exon	19966183	19967012	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";

We use the man file for head to look at its options.

Historically there were only the single character options, but these are limited (you can only have 52) and cryptic. They were good though when you only could type and had minimal memory. the longer ones are good because they are more descriptive.

The long versions are better to use in scripts because they are easier for someone else to read. The single character versions save time when you are working interactively.

We can use the -n parameter to change the number.

And, tails shows the last few.

[brownsarahm@seawulf ~]$ tail dmel-all-r6.19.gtf

which in this case looks mostly the same

2L	FlyBase	exon	782124	782181	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	exon	782238	782441	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	exon	782495	782885	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	start_codon	781297	781299	.	+	0	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	CDS	781297	782048	.	+	0	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	CDS	782124	782181	.	+	1	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	CDS	782238	782441	.	+	0	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	CDS	782495	782821	.	+	0	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	stop_codon	782822	782824	.	+	0	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";
2L	FlyBase	3UTR	782825	782885	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";

We can also see how much content is in the file wc give a word count and with its -l parameter gives us the number of lines.

[brownsarahm@seawulf ~]$ wc -l dmel-all-r6.19.gtf
542048 dmel-all-r6.19.gtf

Over five hundred forty thousand lines is a lot.

How can we get the number of lines in each of the .fastq files?

[brownsarahm@seawulf ~]$ wc -l *.fastw
wc: *.fastw: No such file or directory

note that in my typo, it tells me no files matched my pattern.

[brownsarahm@seawulf ~]$ wc -l *.fastq
   20000 SRR307023_1.fastq
   20000 SRR307023_2.fastq
   20000 SRR307024_1.fastq
   20000 SRR307024_2.fastq
   20000 SRR307025_1.fastq
   20000 SRR307025_2.fastq
   20000 SRR307026_1.fastq
   20000 SRR307026_2.fastq
   20000 SRR307027_1.fastq
   20000 SRR307027_2.fastq
   20000 SRR307028_1.fastq
   20000 SRR307028_2.fastq
   20000 SRR307029_1.fastq
   20000 SRR307029_2.fastq
   20000 SRR307030_1.fastq
   20000 SRR307030_2.fastq
  320000 total

when it does work, we also get the total.

We can use redirects as before to save these to a file:

[brownsarahm@seawulf ~]$ wc -l *.fastq > linecounts.txt
[brownsarahm@seawulf ~]$ cat linecounts.txt
   20000 SRR307023_1.fastq
   20000 SRR307023_2.fastq
   20000 SRR307024_1.fastq
   20000 SRR307024_2.fastq
   20000 SRR307025_1.fastq
   20000 SRR307025_2.fastq
   20000 SRR307026_1.fastq
   20000 SRR307026_2.fastq
   20000 SRR307027_1.fastq
   20000 SRR307027_2.fastq
   20000 SRR307028_1.fastq
   20000 SRR307028_2.fastq
   20000 SRR307029_1.fastq
   20000 SRR307029_2.fastq
   20000 SRR307030_1.fastq
   20000 SRR307030_2.fastq
  320000 total

We can also search files, without loading them all into memory or displaying them, with grep:

[brownsarahm@seawulf ~]$ grep Act5c dmel-all-r6.19.gtf
[brownsarahm@seawulf ~]$ grep mRNA dmel-all-r6.19.gtf

this output a lot, so the output is truncated here

X	FlyBase	mRNA	19961689	19968479	.	+	.	gene_id "FBgn0031081"; gene_symbol "Nep3"; transcript_id "FBtr0070000"; transcript_symbol "Nep3-RA";
2L	FlyBase	mRNA	781276	782885	.	+	.	gene_id "FBgn0041250"; gene_symbol "Gr21a"; transcript_id "FBtr0331651"; transcript_symbol "Gr21a-RB";

and we can combine grep with wc to count occurences.

[brownsarahm@seawulf ~]$ grep mRNA dmel-all-r6.19.gtf | wc -l
34025

14.7. Feedback#

Note

will fill in later

14.8. Prepare for Next Class#

  1. Read about connection protocols in general and specifically https and ssh. Wikipedia is a good source to start from,and use additional sources to verify anything you find confusing. Be sure you have the basic terminology down and bring questions to class. Plan to check off your questions as they are answered during class on Tuesady and then submit others in your experience reflection.

14.9. Review today’s class#

Important

This is an integrative 2x badge.

integrative 2: The PR title must be review-2023-10-26 verbatim to get the double credit

create a vocab-quiz.md file with 10 mutliple choice questions that cover topics from at least 5 different class sessions. Each question should have 4 options, 1 correct and 3 that represent a reasonable, but incorrect idea someone may have. Questions should check that a person understand the key terms of the first half of the course. For each option explain why it is/not correct in a way that would help clarify someone’s confusion if they had picked that answer instead of the correct answer.

**For this badge you must request a review from @brownsarahm **

Use the following syntax:

Question text 

- [ ] a wrong answer
- [ ] another wrong 
- [x] correct answer marked with x
- [ ] another wong

---

[section title](link/to/notes)
- explanation for first wrong
- explanation for second wrong
- key point about correct
- explantion for third wrong 

---

Next question 

14.10. Practice#

Important

This is an integrative 2x badge.

integrative 2: The PR title must be practice-2023-10-26 verbatim to get the double credit

create a midterm.md file in your kwl repo with 10 mutliple choice questions that cover topics from at least 5 different class sessions. Each question should have 4 options, 1 correct and 3 that represent a reasonable, but incorrect idea someone may have. All 10 questions should check understanding of key concepts, not only terminology or the name of a command. For each option explain why it is/not correct in a way that would help clarify someone’s confusion if they had picked that answer instead of the correct answer. Each solution block must contain a link to the applicable class notes. **For this badge you must request a review from @brownsarahm **

Use the following syntax:

Question text 

- [ ] a wrong answer
- [ ] another wrong 
- [x] correct answer marked with x
- [ ] another wong

---

[section title](link/to/notes)
- explanation for first wrong
- explanation for second wrong
- key point about correct
- explantion for third wrong 

---

Next question 

14.11. Experience Report Evidence#

No specific files.

14.12. Questions After Today’s Class#

14.12.1. Will we be incorporating remote server use with coding at all in this class? Maybe some bash scripts?#

Yes. Some in lab tomorrow and more after we learn building C code.

14.12.2. When would we want to use as many resources as possible but the reomte server would not be appropriate because we still need to connect to it?#

This is an interesting question, but I can not come up with such a scenario.

14.12.3. What is a real use case of using a remote machine to do computing?#

My research students, as do Professor Daniels’ use a server to run their code because the problems are too big to do on a laptop.

ML works that way in industry too. As does real genomics research both in universities and in pharma companies.

14.12.4. How do I set up SSH keys for secure access to a remote Git server in Git Bash?(If possible)#

We will set up ssh keys for server access in class next week and you will then be assigned to add them to git from GitHub docs.

14.12.5. Is it expensive to have a remote server?#

At face value, yes, but they’re typically shared costs. It might cost $10-50k to add a unit to the server, but then each use is not.

You could look up billing for AWS, MSFT Azure, or Google Cloud to see hourly use rates on their servers.

Also, the cost is compared to people’s salary being spent waiting around. Even \(50k, that lasts for only 3 years, could be worth it, if you consider it saves time of people who each have a salary over over \)100k/ year.

14.12.6. how do we set up our own remote server?#

This is a good explore badge topic, but not something I have personally ever done. If you actually build it out, it could be a build badge.

14.12.7. What are other large remote servers that are free to use, and are there unsafe servers?#

Most are not really free. Some give free access to students, many give free trials.

Unsafe would be like insecure in this sense that the data is not secure.

14.12.8. Would it be worthwhile to look into using a remote server for synchronizing school work between my laptop and my desktop (work beyond the scope of git/GitHub)?#

That’s not the point of an HPC or server like we used today. That is a good use for GitHub, or a cloud service that is file-focused like DropBox.

14.12.9. What’s the point of Seawulf?#

To learn about using remote servers. It’s separate from the research servers so that we doing low priority class work do not slow down people doing research. The research ones are also funded with grant dollars and Seawulf is not.

14.12.10. What would be an instance that we may need to use a remote system for in school?#

in CSC411 there is a homework server so that all students have the same environment.

14.12.11. How exactly computer intensive is graphics and pretty shapes, such that it is a necessity to cut them out for maximum computer resources available?#

This is a good explore badge topic to monitor over a bit of time how much of your compute resources are being used by visualization vs the actual computation.

You could do a small experiment where you print vs do not and time how long the code takes. This could be done in any language.

14.13. Questions after this topic in previous semesters#

14.13.1. What is meant by the term “gene symbol” ?#

For the purpose of this class it is just a bit of the content in the file. For completeness, since the file is genomic data is is a name for the gene at that line.

14.13.2. When would I use these remote servers most often?#

Remote servers are used for production code, and for large computations in any sort of science setting or machine learning setting.

14.13.3. What type of data structure for storage does the server we used today follow? How does it version track and store files?#

It is a regular unix system with a standard file system. If we want to track versions, we use git. (or another version control system)

14.13.4. Will the files on our seawulf, still be available after exiting.#

Yes, but not forever. Seawulf is a teaching server that makes no promise of backing up or keeping your data forever.

14.13.5. What are .fastq and .gz files?#

.gz is a zip file, .fastq is a plain text output of a sequencing tool.

14.13.6. Can we connect to other servers other than the URI server?#

Yes, if you have credentials, the ssh works basically the same way. We will learn one more thing that will give you more ability to use other servers next week.

14.13.7. What is the point of a remote server?#

More compute power.

14.13.8. Do the files download directly to my computer or to the remote server?#

The file goes to the system where the command is run, in this case, it went to the the server.

14.13.9. where is the ssh server downloaded on my computer?#

ssh is the connection protocol or the set of rules by which our information is sent.

The server is another computer, you were using that computer through your local terminal.

We did not create any files on your local system today.

14.13.10. Is access to the chmod command restricted on computers and servers?#

on a file by file basis typically, eys.

14.13.11. Can I access the remote files using vscode and write code?#

Theoretically yes, but a more typical workflow would be to edit large files locally running vscode on your system then send the code to the remote server to run it. You might send one file at a time directly there or push from your local system to for example GitHub and then from GitHub on the server.

14.13.12. Will I work with files more like this or locally as a software engineer?#

Maybe not this type specifically, but you will almost surely encounter a large file at some point. You will also likely encounter a server for some reason.

14.13.13. for files that are impossible to open with text editors because they’re too long, is it possible to edit parts of the file just through terminal commands?#

Yes! This is a good explore badge as well.

14.13.14. How can I zip from tthe terminal?#

zip file file see the man

14.13.15. Are you able to run a website through a terminal?#

You could launch a web server from a terminal. You can also launch a browser.

For he course website when I build the pdf, version a bash script launches a browser, uses the browsers print to pdf function and closes the browser.