Skip to content

Commit fe25c0a

Browse files
first round of updates
1 parent 54023e3 commit fe25c0a

23 files changed

+15792
-620
lines changed

bonus-git-and-github.qmd

Lines changed: 365 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,365 @@
1+
---
2+
title: "Running a Reproducible Analysis"
3+
subtitle: "Bonus content: Git & Github"
4+
author: "Brian Gural, Justin Landis"
5+
format:
6+
html:
7+
number-sections: true
8+
toc: true
9+
editor: visual
10+
---
11+
12+
## Reproducible science... *in-silico*??
13+
14+
- Bioinformaticians are people too
15+
- We need to make sure our research is well documented and reproducible just like bench scientists
16+
- Projects can get complex, messy, and very computationally demanding
17+
18+
## How can computational projects get derailed?
19+
20+
It turns out that computational biologists need to be careful with how they manage their code and data. [Leaving everything on your personal/lab computer comes with a lot of risk.]{.underline}
21+
22+
You can reduce the risk of a mishap by housing data on UNC's cloud computing service, **Longleaf**, and putting your code on **Github**. Both of these provide you with backups that can be accessed from anywhere with the internet.
23+
24+
```{r dog_gif,echo=FALSE, fig.align = 'center', out.width = "70%", fig.cap = "The impending destruction of their laptop doesn't bother this dog since they use remote computing and GitHub"}
25+
knitr::include_graphics("data/project-day-1-files/thisisfine.gif")
26+
```
27+
28+
Don't think it's worth it? Here are some moments that made UNC researchers wished they had used these tools:
29+
30+
::: panel-tabset
31+
## Broken laptops, crushed dreams
32+
33+
"...there was a time where my computer just stopped letting me log in and needed to be wiped so if I wasn't using Longleaf I would have lost everything. I did lose a nice powerpoint."
34+
35+
"In undergrad I was using local storage only on the desktop in my advisor's office. There was some big failure with IT one day (tbh I still don't know what happened) and I lost all my code"
36+
37+
"Our collaborator lost the hard drive with the raw RNAseq data, dooming my first 1st author publication. His collaborator saved the day with a backup he had on Longleaf"
38+
39+
## "How did you make this figure from 2018?"
40+
41+
"An undergrad left their code in `/pine`\* when they went home over the summer and it got deleted, so they had to re-write their code from scratch (which delayed the project as a whole)"
42+
43+
"Only having GitHub as a memory of projects I did in grad school, being able to search and find bits of code so I don't have to rewrite them."
44+
45+
"People emailing 2 years after a paper is published asking about obscure details of simulation etc."
46+
47+
\*`pine` was UNC's *temporary* data storage space
48+
:::
49+
50+
## Good computational practices 101
51+
52+
Computational projects ought to be approached with the same expectations of rigor and reproducibility expected of a bench project. This means that the work needs to be [well documented]{.underline}, things need to be [properly stored]{.underline}, and everything should be [organized clearly enough for someone to reproduce it]{.underline}.
53+
54+
Thankfully, we're not the first researchers to run into these problems. A whole suite of tools and services exist to manage these issues:
55+
56+
::::::: columns
57+
::: {.column width="60%"}
58+
- Documenting everything
59+
60+
- Storing data & getting resources
61+
62+
- Keeping your R project organized
63+
:::
64+
65+
::: {.column width="5%"}
66+
<!-- empty column to create gap -->
67+
:::
68+
69+
::: {.column width="30%"}
70+
`git`/GitHub
71+
72+
Longleaf
73+
74+
RStudio Projects
75+
:::
76+
77+
::: {.column width="5%"}
78+
<!-- empty column to create gap -->
79+
:::
80+
:::::::
81+
82+
## Suffering from manual version control? `git` can help.
83+
84+
What is **version control** exactly? At its core, it's a way of keeping track of the changes made to files. Before now, you've probably used a system like this:
85+
86+
::: panel-tabset
87+
## Before GitHub
88+
89+
``` markdown
90+
paper_draft1.doc
91+
paper_draft2.doc
92+
paper_reviewed_by_john.doc
93+
paper_draft3_comments_incorporated.doc
94+
paper_final_draft.doc
95+
paper_final_reviewed.doc
96+
paper_final_submission.doc
97+
paper_final_submission_revised.doc
98+
paper_final_submission_revised_v2.doc
99+
paper_published_version.doc
100+
```
101+
102+
```{r charlie_gif,echo=FALSE, fig.align = 'center', out.width = "70%", fig.cap = "Charlie explains how his file naming system makes perfect sense"}
103+
knitr::include_graphics("data/project-day-1-files/charlie-day.gif")
104+
```
105+
106+
## After GitHub
107+
108+
With `git`, **you can update a file while keeping a detailed log of the changes**.
109+
110+
``` markdown
111+
* 9a2b3c4 - Add published version of the paper (2024-04-29)
112+
* 8f7e6d5 - Revise submission after additional feedback, version 2 (2024-04-25)
113+
* 7d6c5b4 - Update submission based on post-submission feedback (2024-04-20)
114+
* 6c5b4a3 - Prepare final version for submission (2024-04-15)
115+
* 5b4a392 - Finalize draft after thorough review (2024-04-10)
116+
* 4a39881 - Incorporate feedback from final review (2024-04-05)
117+
* 3928717 - Update draft, incorporate feedback from John (2024-04-01)
118+
* 2871606 - Add second draft of the paper (2024-03-28)
119+
* 1760505 - Initial draft of the paper (2024-03-25)
120+
```
121+
122+
```{r thumbs_up_gif,echo=FALSE, fig.align = 'center', out.width = "70%", fig.cap = "Kevin, age 6, finds GitHub to be totally radical"}
123+
knitr::include_graphics("data/project-day-1-files/thumbs_up.gif")
124+
```
125+
:::
126+
127+
### Go on, `git`!
128+
129+
`git` is version control system used to record changes to files. [GitHub]{.blue} uses `git` to help users host/review code and manage projects
130+
131+
[`git`/GitHub matter because they:]{.underline}
132+
133+
- Track every version of every script
134+
135+
- Publicly document your work
136+
137+
- Allow for new versions of projects to `branch`
138+
139+
- Make it easy to collaborate
140+
141+
## Longleaf: The darling of UNC bioinformaticians
142+
143+
Longleaf is UNC's high-performance computing cluster (HPC). It's basically a ton of computers/storage. Its accessible from anywhere with internet and offers a lot of storage. Labs typically start with 40 TB, users get 10 TB. Also you've been using it this whole time! RStudio OnDemand is hosted by Longleaf.
144+
145+
::: callout-tip
146+
There are a ton of reasons to use LL:
147+
148+
- Many scripts can be run at once, with your computer off
149+
150+
- It has A LOT more resources than a typical computer
151+
152+
- Easy to share files!
153+
154+
- Dedicated technical support via ITS
155+
:::
156+
157+
## Connecting Longleaf and Github
158+
159+
Github and Longleaf each can be daunting to novice programmers, so lets walk through how to set them up together.
160+
161+
The setup is going to amount to three general steps:
162+
163+
1. Introduce our Github and Longleaf accounts to each other with something called a `SSH` key
164+
- SSH keys (**s**ecure **sh**ell) are encrypted passwords that link two computer systems
165+
2. Make our first repository (project) on Github
166+
3. Learn how to get scripts from Github to Longleaf and update changes we made on Longleaf back to Github
167+
168+
Before we can start that, we're going to need to know just a tad about **terminals** and **Bash**. These topics could be a whole course onto itself, but in a nutshell you can think of them like this:
169+
170+
**Terminals**, also called **command lines**, are text-based software for interacting with your computer. RStudio has a built-in terminal, on the tab next to "Console".
171+
172+
**Bash** is a type of computer language that understands and carries out the instructions you type in the terminal, usually called "shell scripting". It's very common on Linux and Mac computers. Longleaf uses Linux and working on Longleaf means using a bit of Bash.
173+
174+
::: callout-tip
175+
This course isn't meant to teach you shell scripting, and you aren't expected to fully understand some of the Bash command we'll run. If you'd like to learn more, please refer to the following cheat sheets on [common bash commands](https://github.com/RehanSaeed/Bash-Cheat-Sheet) and [scripting in bash](https://devhints.io/bash) as helpful resources!
176+
:::
177+
178+
## Linking via `SSH`
179+
180+
Let's start by getting the terminal open. In the top left, click View -\> Move Focus To Terminal. It should've opened in the panel that also contains tabs for "Console" and "Background Jobs".
181+
182+
Next, let's find the SSH key associated with your Longleaf account. We'll run the following bash command in the terminal that we just opened (be aware that terminals are notoriously finicky with copy and paste): `cat ~/.ssh/id_rsa.pub`
183+
184+
::: callout-caution
185+
Do [**NOT**]{.underline} forget to add the `.pub` of the above extension. RSA keys come in pairs, a private and a public version. The private key (i.e. the file that does not have the `.pub` extension) should [**NEVER**]{.underline} be shared. Sharing your private key will allow malicious actors to interact with services that cache your public key [as if they were you]{.underline}!
186+
:::
187+
188+
This (hopefully) has copied your public SSH key to your clipboard!
189+
190+
Now, let's go over to Github and set up the key.
191+
192+
1. Go to profile settings on github and select the "ssh key" section
193+
194+
2. Add new key
195+
196+
3. Name the key (should remind you that this is the key for Longleaf)
197+
198+
4. Paste the copied ssh key from the `cat` step above
199+
200+
5. Create!
201+
202+
With that done, we need to log into Github on Longleaf:
203+
204+
1. Go back to the RStudio terminal
205+
206+
2. `git config --global user.name "your-github-username"`
207+
208+
- Keep the quotes around your username!
209+
210+
3. `git config --global user.email your.email.linkedwithgithub`
211+
212+
- No quotes this time!
213+
214+
## Making a repository {#sec-repo}
215+
216+
Great, we've connected Longleaf and Github (a herculean task for a beginner programmer!). What we'll want to do next is make a repository (repo), which you could think of as a self-contained project folder. Let's go back to Github:
217+
218+
1. In the "Repositories" tab of your profile, click "New"
219+
220+
2. Give it a name, maybe "example_repo"
221+
222+
3. Click the "Add a README file box"
223+
224+
4. Add a \`.gitignore\` with an R template
225+
226+
5. Create the repo!
227+
228+
We'll explain the signifance of the README and .gitignore steps a bit later. For now, lets go over to our new repo on our profile. To get it onto Longleaf, we can:
229+
230+
1. Click on the green "Code" box
231+
232+
2. Click on "SSH"
233+
234+
3. Copy the SSH right below that! It should end in \`.git\`
235+
236+
4. Lets create a new directory to house our experimental projects. In the terminal window of RStudio, do the following:
237+
238+
```{bash}
239+
#| eval: false
240+
cd ~ #set working directory to home directory
241+
mkdir learning-R #create a new directory "learning-R"
242+
cd learning-R #set "learning-R" as our working directory
243+
```
244+
245+
::: {.callout-note collapse="true" title="Git Project Organization"}
246+
Git repositories may be cloned anywhere to your file system that makes sense to you. The above is a simple example of project organization in which we may create more repositories under the `learning-R` directory. Ultimately the choice is yours on how you wish to organize your git projects.
247+
:::
248+
249+
5. Now we can clone our new github repository locally!
250+
251+
```{bash}
252+
#| eval: false
253+
# add what you copied in step 3 behind "clone".
254+
# Your command may look something like this
255+
git clone [email protected]:<username>/<reponame>
256+
```
257+
258+
## RProjects
259+
260+
In this section, we will be discussing recommendations for organization. From @sec-repo, we can see that our project is fairly empty
261+
262+
```
263+
.
264+
├── .git/
265+
├── .gitignore
266+
└── README.md
267+
```
268+
269+
It is up to use to bring some organization to our project!
270+
271+
::: callout-note
272+
The hidden directory `.git/` and file `.gitignore` will be covered in the next @sec-git.
273+
:::
274+
275+
Organization can greatly improve the experience of coding. It is a way for us to show our future selves some kindness as well as anyone who may maintain our work in the future.
276+
277+
Below is an example of how we may setup directories for our project:
278+
279+
```
280+
.
281+
├── .git/
282+
├── .gitignore
283+
├── data/
284+
├── outputs/
285+
├── R/
286+
├── scripts/
287+
└── README.md
288+
```
289+
290+
Here, we have created 4 new directories: `data/`, `outputs/`, `R/` and `scripts/`. These directory names imply something to the viewer about their contents and provide quick navigation.
291+
292+
`data/` and `outputs/` are perhaps the most self explanatory. We will reserve `data` for files we may read in for our analysis, and `outputs` as a place for us to store our results.
293+
294+
`R/` and `scripts/` may be a bit nuanced. Perhaps the project author will keep helpful reusable R code in the `R/` directory, while the `scripts/` directory may be analysis workflows that are called from the command line.
295+
296+
A project may, or may not, require this level of granularity. You may choose different directory names as well. However we want to maintain some level of interpretation and do not want to contradict expectations with the files within. Critically thinking about your project structure will ultimately save you time when you return to the project.
297+
298+
Here are some general guidelines to follow:
299+
300+
**Organization Do's**
301+
302+
- Place files in some sort of relative structure
303+
- Use descriptive file and directory names
304+
- Format dates with `YYYY-MM-DD` <!-- files will be ordered by date by your file system -->
305+
306+
**Organization Do Not's**
307+
308+
- Place all files at the top level of the repository
309+
- Stratifying files between too many directories <!-- can make this just as hard to find -->
310+
- Using spaces in file names! <!-- escaping spaces on the command line with git is annoying -->
311+
312+
### README! ... Please? 😢
313+
314+
As the name implies, this file is intended to be read by anyone who happens upon your repository.Think of this as extra documentation that you, as the project owner, may use to communicate aspects of the repository. Include helpful information about your project such as:
315+
316+
- A brief description of the project
317+
- What is the scope of the project
318+
- What problem does it solve?
319+
- Intended use cases!
320+
- Examples of what it should not be used for!
321+
- How to get started or use the project
322+
- How to contribute or report a bug
323+
- Project Status: active development, stable, or abandoned
324+
325+
If your repository is being viewed on [github.com](github.com) then top level `README.md` is displayed like the landing page.
326+
327+
::: callout-tip
328+
You may also place `README.md` within sub directories as well, allowing you to communicate more specific information within.
329+
330+
Here are examples of what you may include in a `data/README.md` file
331+
332+
- explanation of what data was collected and how it was processed.
333+
334+
- documentation of the columns of a `.csv` file. What the acronyms and data labels mean and how they should be interpreted.
335+
:::
336+
337+
## Using git {#sec-git}
338+
339+
- `.git/` is a hidden directory whose main function is to track your `git ...` commands. It is generally wise to ignore this directory.
340+
341+
- `.gitignore` uses rules to exclude files by name pattern or location
342+
343+
- Don't upload data or large files!
344+
345+
- Changes need to be staged with `git add`
346+
347+
- `git add .` adds all files not excluded by the .gitignore in the directory
348+
349+
- `git add -i` opens an interactive adding session
350+
351+
- Commit staged changes with a note to your future self
352+
353+
- `git commit -m "Hi future me, this is what I changed"`
354+
355+
::: {.callout-note collapse="true" title="note: Using `vim`"}
356+
If you were to write\
357+
`git commit`\
358+
and omit the `-m "your message here"` portion, `git` will force you to write a message by placing you into an interactive prompt. The default text editor is usually `vim` which may be tricky to navigate.
359+
360+
To enter edit mode, press `i` on your keyboard, and then you may begin writing your message. When you have finished, press `ESC` to exit your current command mode. To exit vim, press `:wq` which informs vim to write your changes and quit.
361+
:::
362+
363+
- Commits are pushed to branches
364+
365+
- For the main branch, use `git push origin main`

0 commit comments

Comments
 (0)