From fe23901626454e58e63db413d27067a5c4074e72 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 13:13:25 -0400 Subject: [PATCH 1/8] Update 01_Setting_up.md --- Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md index a1db3a7a..ae317227 100644 --- a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md +++ b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md @@ -51,6 +51,8 @@ _You should see `advanced_shell.zip` as part of the output to the screen._ **4.** Finally, to **decompress the folder**: +## Comment-Upen: As we are already in terminal in above step and going back to terminal in step below. it might be easier to type `unzip advanced_shell.zip` rather then going back to GUI. + * Double click on advanced_shell.zip on a mac. This will automatically inflate the folder. * If you are on windows, press and hold (or right-click) the folder, select Extract All..., and then follow the instructions. From bae7b5016f9891d07559f24dfa2e7593ff738e0a Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 14:03:54 -0400 Subject: [PATCH 2/8] Update 02_Regular_expressions.md --- .../lessons/02_Regular_expressions.md | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md index 9a8e47bf..172ccc05 100644 --- a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md +++ b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md @@ -15,6 +15,8 @@ In this lesson, we will: ## Getting Started +## Comment-Upen: Even we mentioned that we introuduced grep in the previous workshops. I think participants will find it useful to get a brief introduction of grep before we go in depth. and then introduce our toy file `catch.txt` to use. + Before we get started, let's take a briefly look at the `catch.txt` file in a `less` buffer in order to get an idea of what the file looks like: ``` @@ -23,6 +25,9 @@ less catch.txt In here, you can see that we have a variety of case differences and misspellings. These differences are not exhaustive, but they will be helpful in exploring how regular expressions are implemented in `grep`. + +## Comment-Upen: Before introducing cautions and extended regular expression which we say, we won't be using too many. I think, If I am a participant and in beginner level, I would be more interested to just use grep on our catch.txt file to start with. use double quotation to start with, And explan the difference between no quotation, single quotataion, and double quotation with dummy error we can produce down below. may be we can demonstrate few simple flags we can use with grep like -c for counting, -n for printing line number, using -v to print negative results and others. We can use double quotations in all the examples and ask participants what will happen if we don't use quotation marks or use single quotation. Ask them to do it to practice grep with differnt flags. and introduce the importance of quotations and cases where those will be useful. + ## A bit more depth on grep There are two principles that we should discuss more, the `-E` option and the use of quotation marks. @@ -31,6 +36,8 @@ There are two principles that we should discuss more, the `-E` option and the us There is a `-E` option when using `grep` that allows the user to use what is considered "extended regular expressons". We won't use too many of these types of regular expressions and we will point them out when we need them. If you want to make it a habit to always use the `-E` option when using regular expressions in `grep` it is a bit more safe. +## Comment-Upen: I would explain what we meant by safe. + ### Quotations When using grep it is usually not required to put your search term in quotes. However, if you would like to use `grep` to do certain types of searches, it is better or *safer* to wrap your search term in quotations, and likely double quotations. Let's briefly discuss the differences: @@ -86,7 +93,7 @@ Will return: ``` C${at}CH ``` - +## Comment-Upen: Maybe this take home message can go to the bottom of the page and bullet point 1. ### grep Depth Take-Home In conclusion, while these are all mostly edge cases, we believe that it is generally a good habit to wrap the expressions that you use for `grep` in double quotations and also use the `-E` option. This practice will not matter for the overwhelming number of cases, but it is sometimes difficult to remember these edge cases and thus it is mofe safe to just build them into a habit. Of course, your preferences may vary. From 1f74c3a8a0edebdd2d0551ea03496ec2d587bf6c Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 14:13:29 -0400 Subject: [PATCH 3/8] Update 02_Regular_expressions.md --- .../lessons/02_Regular_expressions.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md index 172ccc05..910ed83c 100644 --- a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md +++ b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md @@ -472,6 +472,8 @@ C${at}CH COTCH ``` +## Comment-Upen: Having a multi-fasta file in our demo data and using that to count number of sequences with grep -c "^>" my.fasta, may be finding the starting codon "ATG" or stop codon "TAA" or extracting "cds" between ATG and TAA. using grep with -A 1 and -B 1 to get the header and sequence information of the sequence using a small part of the sequence. may be using using primerpairs to locate the pcr amplicon region, would give participants some basic realworld example of grep application. I mean a few of these examples but not too many. and I think this will align well with bioinformatic examples in other lesson in this workshop, just a thought. + *** ## Exercises From 03b5929478ebd18b37a717b80699e825aa2bd5a7 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 14:41:56 -0400 Subject: [PATCH 4/8] Update 03_sed.md --- Finding_and_summarizing_colossal_files/lessons/03_sed.md | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/Finding_and_summarizing_colossal_files/lessons/03_sed.md b/Finding_and_summarizing_colossal_files/lessons/03_sed.md index 95002b65..06422b5e 100644 --- a/Finding_and_summarizing_colossal_files/lessons/03_sed.md +++ b/Finding_and_summarizing_colossal_files/lessons/03_sed.md @@ -159,6 +159,7 @@ Lastly, you can use `N~n` in the address to indicate that you want to apply the ``` sed '1~2 s/an/replacement/g' ecosystems.txt ``` +## Comment-Upen: tilde didn't work on my computer in above code. it says not a valid command. I am using mac with Apple M3 chip, its the latest, I suppose many of my participants will have similar configuration? ## Bioinformatics Example @@ -178,6 +179,7 @@ cat my_fastq.fq.gz | sed -n '1~4p' > quality_scores.txt ``` The first half of the pipe prints the file and the sed command grabs every forth line. Try it with the `Mov10_oe_1.subset.fq` file in the advanced_shell directory! +## Comment-Upen: There is no my_fastq.fq.gz in our training material folder. and also just my_fastq.gz or my_fq.gz would be fine as a file name. again tilde won't work on mine. ## Deletion @@ -262,24 +264,26 @@ You can also ***c***hange entire lines in `sed` using the `c` command. We could sed '1 c header' ecosystems.txt ``` +## Comment-Upen: The above command doesn't work on my laptop. instead prints: `sed: 1: "1 c header": command c expects \ followed by text` This can also be utilized in conjunction with the `A,B` interval syntax, but we should be aware that it will replace ALL lines in that interval with a SINGLE line. ``` sed '1,3 c header' ecosystems.txt ``` - +## Comment-Upen: same as above, doesn't work on my mac. You can also replace every *n*-th line starting at *N*-th line using the `N~n` address syntax: ``` sed '1~3 c header' ecosystems.txt ``` +## Comment-Upen: ~ in above command says invalid in my mac. Lastly, you can also replace lines match a pattern: ``` sed '/jaguar/ c header' ecosystems.txt ``` - +## Comment-Upen: error on above command: sed: 1: "/jaguar/ c header": command c expects \ followed by text ## Multiple expressions From dedbac932be0d00f4a31f05f979cc3cda55170b8 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 15:34:45 -0400 Subject: [PATCH 5/8] Update AWK_module.md --- .../lessons/AWK_module.md | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/Finding_and_summarizing_colossal_files/lessons/AWK_module.md b/Finding_and_summarizing_colossal_files/lessons/AWK_module.md index 997bb95e..d3bf28c2 100644 --- a/Finding_and_summarizing_colossal_files/lessons/AWK_module.md +++ b/Finding_and_summarizing_colossal_files/lessons/AWK_module.md @@ -166,6 +166,7 @@ Were seals ever observed in any of the other parks, note that `||` is or in awk +## Comment-Upen: Both options above doesnot print anything on my laptop. **** @@ -266,6 +267,7 @@ To simply extract the Yosemite data (column 3). We use the second part: ```bash awk -F "," '$2 ~ "coyote"' ``` +## Comment-Upen: may be add the file name animal_observations_edited.txt at the end in above script. if someone enters this, terminal will just hung up. to separate the comma separated fields of column 3 and ask which lines have the string coyote in field 2. We want to print the entire comma separated list (i.e., column 3) to test our code which is the default behavior of `awk` in this case. @@ -377,6 +379,8 @@ samtools view -S -b ${sam}.sam > ${sam}.bam done ``` +## Comment-Upen: We are not running this workshop in cluster right? running above chunk with samtools might be a problem? + This actually combines a number of basic and intermediate shell topics such as [positional parameters]([positional_params.md](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/positional_params.html)), [for loops](https://hbctraining.github.io/Training-modules/Accelerate_with_automation/lessons/loops_and_scripts.html), and `awk`! * We start with a for loop that counts from 1 to 10 @@ -391,6 +395,8 @@ With our new `awk` expertise let's take a look at that `awk` command alone! ```bash awk -v awkvar="${i}" 'NR==awkvar' samples.txt ``` +## Comment-Upen: No samples.txt in workshop material folder?? + We have not encountered -v yet. The correct syntax is `-v var=val` which assign the value val to the variable var, before execution of the program begins. So what we are doing is creating our own variable within our `awk` program, calling it `awkvar` and assigning it the value of `${i}` which will be a number between 1 and 10 (see for loop above). `${i}` and thus `awkvar` will be different for each loop. From 8e7309c885dd6710711b5ed93b252330bd65aef6 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 15:37:12 -0400 Subject: [PATCH 6/8] Update 01_Setting_up.md --- Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md index ae317227..90cbbde0 100644 --- a/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md +++ b/Finding_and_summarizing_colossal_files/lessons/01_Setting_up.md @@ -51,7 +51,7 @@ _You should see `advanced_shell.zip` as part of the output to the screen._ **4.** Finally, to **decompress the folder**: -## Comment-Upen: As we are already in terminal in above step and going back to terminal in step below. it might be easier to type `unzip advanced_shell.zip` rather then going back to GUI. +## Comment-Upen: As we are already in terminal. it might be easier to type `unzip advanced_shell.zip` rather then going back to GUI. * Double click on advanced_shell.zip on a mac. This will automatically inflate the folder. * If you are on windows, press and hold (or right-click) the folder, select Extract All..., and then follow the instructions. From 498a5cdb7d5fc52d261a1217e921bd6cffe3817f Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 15:40:21 -0400 Subject: [PATCH 7/8] Update 02_Regular_expressions.md --- .../lessons/02_Regular_expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md index 910ed83c..23ea0a26 100644 --- a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md +++ b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md @@ -26,7 +26,7 @@ less catch.txt In here, you can see that we have a variety of case differences and misspellings. These differences are not exhaustive, but they will be helpful in exploring how regular expressions are implemented in `grep`. -## Comment-Upen: Before introducing cautions and extended regular expression which we say, we won't be using too many. I think, If I am a participant and in beginner level, I would be more interested to just use grep on our catch.txt file to start with. use double quotation to start with, And explan the difference between no quotation, single quotataion, and double quotation with dummy error we can produce down below. may be we can demonstrate few simple flags we can use with grep like -c for counting, -n for printing line number, using -v to print negative results and others. We can use double quotations in all the examples and ask participants what will happen if we don't use quotation marks or use single quotation. Ask them to do it to practice grep with differnt flags. and introduce the importance of quotations and cases where those will be useful. +## Comment-Upen: Before introducing cautions and extended regular expression, which we say we won't be using too many. I think, If I am a participant and in beginner level, I would be more interested to just try grep on the catch.txt file to start with with simple examples. And explain the difference between no quotation, single quotataion, and double quotation with dummy errors we can produce down below. may be we can demonstrate few simple flags we can use with grep like -c for counting, -n for printing line number, using -v to print negative results and others. We can use double quotations in all the examples and ask participants what will happen if we don't use quotation marks or use single quotation. Ask them to do it to practice grep with differnt flags. and introduce the importance of quotations and cases where those will be useful. Just a thought! ## A bit more depth on grep From bbcdf31df1d63f1c034767e96a503835a0a26864 Mon Sep 17 00:00:00 2001 From: Upendra Raj Bhattarai Date: Wed, 3 Apr 2024 15:43:04 -0400 Subject: [PATCH 8/8] Update 02_Regular_expressions.md --- .../lessons/02_Regular_expressions.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md index 23ea0a26..fb86f35e 100644 --- a/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md +++ b/Finding_and_summarizing_colossal_files/lessons/02_Regular_expressions.md @@ -472,7 +472,7 @@ C${at}CH COTCH ``` -## Comment-Upen: Having a multi-fasta file in our demo data and using that to count number of sequences with grep -c "^>" my.fasta, may be finding the starting codon "ATG" or stop codon "TAA" or extracting "cds" between ATG and TAA. using grep with -A 1 and -B 1 to get the header and sequence information of the sequence using a small part of the sequence. may be using using primerpairs to locate the pcr amplicon region, would give participants some basic realworld example of grep application. I mean a few of these examples but not too many. and I think this will align well with bioinformatic examples in other lesson in this workshop, just a thought. +## Comment-Upen: Having a multi-fasta or multi-fastq file in our demo data and using that to count number of sequences with grep -c "^>" my.fasta, Finding the starting codon "ATG" or stop codon "TAA" or extracting "cds" between ATG and TAA. using grep with -A 1 and -B 1 to get the header and sequence information of the sequence using a small part of the sequence. May be using using primer pairs to locate the pcr amplicon region, would give participants some basic real world example of grep application. I mean a few of these examples but not too many. and I think this will align well with bioinformatic examples in other lessons in this workshop, just a thought. ***