Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
202 changes: 202 additions & 0 deletions pretext/AdditionalTopics/BigO.ptx
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
<?xml version="1.0"?>
<section xml:id="extratopics-bigo">
<title>Big O Analysis</title>

<p>A common question that comes up when programming is: "How long will my program take to run?". Even if a program provides the correct output, if it takes
too long to finish then it is unacceptable. There is a problem here though, it's impossible to reliably say exactly how long a program will take to run.
It depends on too many things. The capabilities of the computer running the code, what else is running on the computer, and the size of the input are just
some of the things that would need to be considered.
</p>

<p>To simplify this issue, we'll give up trying to estimate exactly how long a program will run, and instead look at the biggest factor that affect
existing code: the size of the input. If we wrote a program that ran for 60 seconds on 100 megabytes of input data, how should we expect the program to
react to 200 megabytes of input data? Maybe it would run in 120 seconds (twice the data for twice the run time)? Maybe it would still run in 60 seconds,
assuming that extra data isn't used. Or maybe the program would run for far longer. The issue is that we don't know what the relationship is between the size
of the input data and the behavior of the program.</p>

<p>This is where <term>Big O Analysis</term> comes in. Big O is a notation computer scientists use to describe the relationship between the size
of the input data and the behavior of the program. These terms are written like a mathematical function using the variable n. n as a variable represents the
size of the input data provided to the program. The Big O function tells us how n affects the time the program will take to complete.</p>

<p>Consider the example we had before. We have a program that takes 60 seconds to run on 100 megabytes of input data, we'd like to know (roughly)
how long the program might take to run on 200 megabytes of input data. If we know the run time of the program is the function f(n) = n^2, with n being
the size of the data, now we have enough information to make a guess. If n is doubled, then the time the program runs for will quadruple! (2*n)^2 = 4 * n^2.</p>

<p>The formal mathematical notation for Big O is denoted with a capital O (a <em>big o</em>!) followed by parentheses.
Inside of the <c>O()</c> is most commonly some term of n. In our previous example, we would say the program has O(n^2) behavior.</p>

<p>Different functions of n have different <em>magnitudes</em>, which helps us to quantify how quick or slow an algorithm is relative to the input size <c>n</c>.
From left to right, left being the quickest time and right being the slowest time, we typically see these complexities:</p>

<p><c>O(1)</c>, <c>O(logn)</c>, <c>O(n)</c>, <c>O(nlogn)</c>, <c>O(n^2)</c>, <c>O(n^3)</c>, <c>O(2^n)</c>, <c>O(n!)</c>.</p>

<p>Big O is like a limit in that only the most significant terms matter as <c>n</c> gets bigger and bigger. We typically expect n to be very, VERY large because
small inputs aren't as strongly affected by time limits. If a program takes 0.001 seconds to run with most normal data, is it really a big deal if it takes 0.004
seconds on occasion? What if we were dealing with a program that had to run for a month though? Now that factor of four starts to hurt a lot more.</p>

<p>There is another important aspect that we have ignored up to this point: programs can often have wildly different behavior depending on their input.
Consider a contrived example:</p>
<program language="python">
<input>
var = input()
if 'a' in var:
while True:
print("run forever!")
else:
print("done")
</input>
</program>

<p>In this program, the size of the input doesn't matter as much as whether the input string contains a letter "a" or not. If it does, the program runs forever.
If it doesn't, the program ends almost immediately. How do we reconcile this with our Big O notation? The answer is to be a pessimist. We adopt the assumption that
everything that can happen to slow down our program will happen. In the code above, we assume that the input ALWAYS will contain an "a". This assumption is broadly
known as the "worst case". Big O notation uses this assumption in every instance you will see it (at least in this class).</p>

<p>Let's look at some more examples:</p>
<program language="python">
<input>
sum = 1 + 1
print(sum)
</input>
</program>
<p>This code has a Big O of <c>O(1)</c>, also referred to as <term>constant time</term>. This is because the program does nothing with its input. In fact, it doesn't
even take input! Constant time operations are typically things in code which <em>do not</em> loop. A constant time program suggests it will always finish in a
consistent amount of time, no matter what happens.</p>

<p>Now, let's check out an example with a loop:</p>
<program language="python">
<input>
def example_func(n):
for i in range(n):
print(i)
</input>
</program>
<p>As you can see, this function simply prints <c>0</c> to <c>n</c>. Each print takes a little time, so a larger n means a longer program run time.
We denote the complexity of <c>example_func</c> as <c>O(n)</c>,
because whether <c>n = 100</c> or <c>n = 10000000</c>, as the complexity trends to infinity, it remains <c>O(n)</c>.</p>

<p>In the last code example, <c>O(n)</c> was the complexity for <em>all</em> cases, because the loop <em>always</em> goes to <c>n</c>.</p>

<image source="ExtraTopics/Figures/complexity.png" width="75%" alt="Big O Complexity Graph"/>

<p>This figure shows complexities as a graph and which ones are considered "desirable" or at least "acceptable". Context mostly determines if these are "good" terms or not,
but do strive to never write something worse than <c>O(n^3)</c>!</p>

<p>It may be difficult to appreciate the implications of these terms when first seeing them. Let's say we have an algorithm with the following complexities, but they
all run with the same time (1 milliseconds) for n = 10. This table shows what will happen if we increase the size of the input:</p>

<table>
<tabular>
<row>
<cell>
n
</cell>
<cell>
<c>O(log(n))</c>
</cell>
<cell>
<c>O(n)</c>
</cell>
<cell>
<c>O(n^3)</c>
</cell>
<cell>
<c>O(2^n)</c>
</cell>
</row>
<row>
<cell>
10
</cell>
<cell>
1 ms
</cell>
<cell>
1 ms
</cell>
<cell>
1 ms
</cell>
<cell>
1 ms
</cell>
</row>
<row>
<cell>
11
</cell>
<cell>
1 ms
</cell>
<cell>
1.1 ms
</cell>
<cell>
~1.3 ms
</cell>
<cell>
2 ms
</cell>
</row>
<row>
<cell>
20
</cell>
<cell>
1.3 ms
</cell>
<cell>
2 ms
</cell>
<cell>
8 ms
</cell>
<cell>
1 s
</cell>
</row>
<row>
<cell>
100
</cell>
<cell>
2 ms
</cell>
<cell>
10 ms
</cell>
<cell>
1 s
</cell>
<cell>
10^16 years
</cell>
</row>
<row>
<cell>
100000
</cell>
<cell>
5 ms
</cell>
<cell>
10 s
</cell>
<cell>
31 years
</cell>
<cell>
:)
</cell>
</row>
</tabular>
</table>

<p>As you can see, what started off as a negligible difference exploded into a totally unacceptable time for larger input sizes applied to larger Big O terms. Examples like these are precisely why
computer scientists are so fixated on Big O. 100000 data points is not a lot of data. Large tech companies are often running code on billions or
trillions of data points, and anything less the most efficient code won't be able to run at-scale.</p>

<p>We will end this section with a disclaimer. We have only covered the bare basic concepts of Big O here today. If you continue to study computer science,
you'll have more opportunities to explore it in much more detail, including seeing the formal definition of Big O as well as learning how to determine the Big O of your own code.
For this specific class, we only ask you to be familiar with the notation of Big O and have a basic intuition behind what it communicates.</p>
</section>
53 changes: 53 additions & 0 deletions pretext/AdditionalTopics/BinaryRepresentations.ptx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<?xml version="1.0"?>
<section xml:id="extratopics-binaryrepresentations">
<title>Binary Representations</title>

<p>Have you ever seen all of the "hacker" 01010110101010s in the movies? As you might
know, this is called binary. While it's not actually how hacking works, binary is still the base of all computing.
Every word that you are reading right now was transmitted to your computer as a series of 1's and 0's. Although you won't
be typing 0's and 1's at a keyboard all day, binary is still useful to know.</p>

<p>Quick background: binary is a numbering system, just like decimal (the numbering system we normally use).
Decimal uses the digits 0-9, but binary only uses the digits 0 and 1, which are called bits.
In other words, binary is just a different way of counting.</p>

<p>Believe it or not, this is indirectly how you've been counting your entire life. For instance, in decimal numbering (base 10):</p>

<p><c>1023</c> (base 10) = (1 * 10^3) + (0 * 10^2) + (2 * 10^1) + (3 * 10^0)</p>

<p>There are even more numbering systems, like hexadecimal and octal, but you only need to understand binary for this course.</p>

<p>Binary deals with powers of two (hence the name), reading from right to left and starting at 0.
If the bit is 0, it is "off" and the position is multiplied by 0; if the bit is 1, it is "on" and its
position in the number is the exponent with 2 as the base. Binary numbering is also called <term>base 2</term>
because of that. For instance:</p>

<p><c>1000</c> (base 2) = (1 * 2^3) + (0 * 2^2) + (0 * 2^1) + (0 * 2^0) = 8</p>

<p><term>Converting decimal to binary</term>: A quick way to convert decimal to binary is to find the largest
factor of 2 that will go into the number, and concatenate 1 if it goes into the number; concatenate 0 if not. Subtract
the number from the running total and repeat until we hit 0. For instance:</p>

<p>Example: Convert 78 to binary</p>
<p>1. If we think about all of our powers of 2, 2^7 = 128 is too large (128 &gt; 78), so we know 2^6 is where we'll start our number, and we need a 1 in that position. We now have: <c>1xxxxxx</c>.</p>
<p>2. 78 - 64 = 14, which is our remainder from the last digit. 2^5 = 32 &gt; 14, so we know 2^5 is a 0. We now have: <c>10xxxxx</c>.</p>
<p>3. 78 - 64 = 14, which is our remainder from the last digit. 2^4 = 16 &gt; 14, so we know 2^4 is a 0. We now have: <c>100xxxx</c>.</p>
<p>4. 78 - 64 = 14, which is our remainder from the last digit. 2^3 = 8 &lt; 14, so we know 2^3 is a 1 because it fits in! We now have: <c>1001xxx</c>.</p>
<p>5. 78 - 64 - 8 = 6, which is our remainder from the last digit. 2^2 = 4 &lt; 6, so we know 2^2 is a 1 because it fits in! We now have: <c>10011xx</c>.</p>
<p>6. 78 - 64 - 8 - 4 = 2, which is our remainder from the last digit. 2^1 = 2 &lt; 4, so we know 2^1 is a 1 because it fits in! We now have: <c>100111x</c>.</p>
<p>7. 78 - 64 - 8 - 4 - 2 = 0, so we are done and can fill any remainders with a 0 bit.</p>
<p>Our final answer is: <c>1001110</c> (base 2)</p>

<note>
Typically when we write binary, we'll see our bits in groups of 4, because our binary sequences are normally
some multiple of 4, like 8, 16, or 32. Because of this, we would add a leading zero and
write our previous answer as: <c>0100</c> <c>1110</c> (base 2).
</note>

<p><term>Converting binary to decimal</term>: As mentioned above, you can simply look at each bit,
and add 2 to the power of its position if the bit is 1.</p>

<p>Like usual in math, there are a few different ways to arrive at one conclusion. These are not
the only ways to do conversions. If these explanations don't make sense to you, ask your instructor
or Google for their explanation.</p>
</section>
53 changes: 53 additions & 0 deletions pretext/AdditionalTopics/DataScience.ptx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
<?xml version="1.0"?>
<section xml:id="extratopics-datascience">
<title>Data Science</title>

<p>Data science is a multidisciplinary field which combines computer science,
math, and other domains to answer questions using data.</p>

<p>As the world moves more and more towards storing and analyzing large amounts of data,
data science is a vital skill for you to be familiar with, whether you're a computer science major or not. It is also
a very common and useful application of programming, which is why we're discussing it in this class.</p>

<p>Data science is perhaps best defined by describing what data science looks like. <term>The data science process</term> consists of four steps:</p>

<ol>
<li>Obtaining data</li>
<li>Cleaning the data</li>
<li>Exploring the data</li>
<li>Predicting unknowns</li>
</ol>

<p><term>Obtaining the data</term>: We live in a time where data is more abundant then ever before. Getting a hold of data can involve gathering it yourself,
purchasing it, or taking advantage of the many, many sites online now which have a plethora of data
available for free (and sometimes paid) use. if you are getting your data from some 3rd party, it will likely come in a .csv, .json, or SQL database format.</p>

<p><term>Cleaning the data</term>: This can vary, but ultimately you need to prepare your data
in a way that makes it easily usable in the next steps. Often data starts out "noisy" or contains errors. In this step you may
fix things in the data, change missing data, or correct wrong data.</p>

<p>Cleaning is regularly considered the longest step in this process! Data can come in all sorts of different
formats now, with anomalies, with blanks, and so much more. It often depends on context and you own goals
what "fixing" data even means.</p>

<p><term>Exploring the data</term>: Now that the data is prepared, we can do some analysis on it! As the term suggests, exploring the data is about coming to better
understand it. You often don't know what is interesting or useful about data when you first encounter it. You may need to do some sort of statistical
analysis to uncover the interesting aspects, or you may want to graph values and look for relationships and trends visually.</p>

<p><term>Predicting unknowns</term>: Having come to understand the data better, you can now use it to create new knowledge. These days, this step typically involves
using machine learning models. These techniques can generally be split into three groups:</p>

<ol>
<li><term>Supervised Learning</term>: With supervised learning, we try to construct a model that describes the relationship between inputs and outputs (regularly
referred to as "labels"). Knowing what labels we want in advance is what makes a method "supervised". For example, we could create a model to guess when an email
is spam or not based on its contents; the label here is "spam" or "not spam". Or we could try to guess what the stock price will be for our favorite company based
on how it has performed in the last few weeks. The label here would be the predicted stock price.</li>
<li><term>Unsupervised Learning</term>: Contrasting with supervised learning, with unsupervised learning we don't know the labels in advance. An example here could be
using social media data to automatically identify friend groups. We don't know in advance how many groups we'll find or what their nature will be. Because of this, it
can be harder to guess what kind of results unsupervised learning will produce.</li>
<li><term>Semi-Supervised Learning:</term> Semi-supervised learning is an attempt to capture the best aspects of both supervised and unsupervised learning. With these
approaches we start with some data that has labels and also some data that doesn't. To use a previous example, we could take a collection of emails, only some of
which have been labeled as spam or not, and still try to construct a reliable method for identifying new emails as spam. If it goes well, then we've saved ourselves
a lot of time that would have otherwise been spent labeling emails.</li>
</ol>
</section>
19 changes: 18 additions & 1 deletion pretext/AdditionalTopics/Glossary.ptx
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,24 @@
<section xml:id="dictionaries_glossary">
<title>Glossary</title>
<glossary sorted="False">
<gi>
<title>Binary search</title>
<p>A searching algorithm where you look through a sorted list in halves; an improvement upon linear search.</p>
</gi>
<gi>
<title>Big O Notation</title>
<p>A notation computer scientists use to describe the relationship between the size
of the input data and the behavior of the program, denoted with O() and some factor inside of the parenthesis.</p>
</gi>
<gi>
<title>constant time</title>
<p>A Big O time complexity of <c>O(1)</c>.</p>
</gi>
<gi>
<title>comprehension</title>
<p>A specific Python construction that allows collection types to be created and filled with a single line of code.</p>
</gi>
<gi>
<gi>
<title>dictionary</title>
<p>A collection of key-value pairs that maps from keys to values. The keys
can be any immutable type, and the values can be any type.</p>
Expand All @@ -21,6 +34,10 @@
<p>One of the pairs of items in a dictionary. Values are looked up in a
dictionary by key.</p>
</gi>
<gi>
<title>Linear search</title>
<p>A searching algorithm where you look through in a linear order (directly from start to end).</p>
</gi>
<gi>
<title>mapping type</title>
<p>A mapping type is a data type comprised of a collection of keys and
Expand Down
Loading