from rossum ntb

dburian · Sep 26, 2024 · b4586a1 · b4586a1
1 parent 49197a1
commit b4586a1
Show file tree

Hide file tree

Showing 11 changed files with 327 additions and 0 deletions.
diff --git a/contrastive_loss.md b/contrastive_loss.md
@@ -0,0 +1,40 @@
+# Contrastive loss
+
+Contrastive loss was introduced by [Hadsell, Chopra, LeCun
+(2006)](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)
+
+## Context
+
+Contrastive loss was the first loss to train [Metric
+learning](./metric_learning.md). As such it is very simple, but for supervised
+training it is surpassed. Negatives of contrastive loss:
+
+- converges slowly since it doesn't use the full batch information, unlike
+  N-pair loss
+- 
+
+## Formula
+
+Contrastive loss penalizes model for distant, but similar pairs, or for close,
+distinct pairs:
+
+$$
+L(x_1, x_2) = y D(x_1, x_2) - (1 - y)max(0, m - D(x_1, x_2))
+$$
+
+where
+- $x_1$, $x_2$ are training samples
+- $y$ is similarity (1 = similar, 0 = distinct)
+- $D$ is distance metric (in the paper it was L2)
+- $m$ is margin
+
+The margin $m$ ensures that very distant and distinct points won't be adjusted.
+The motivation behind this is that the priority is having similar points close.
+Messing with very distant points, may get in the way and so it is ignored.
+
+
+## Training data
+
+The authors propose quite simple generation of training data: for each sample
+generate all positive pairs (neighbors up to a constant similarity) and all
+negative pairs. Then sample uniformly.
diff --git a/deberta.md b/deberta.md
@@ -0,0 +1,18 @@
+# DeBERTa
+
+DeBERTa -- Decoding-Enhanced BERT with disentangled Attention -- is a model from
+Microsoft introduced by [He et al. (2021)](https://arxiv.org/pdf/2006.03654).
+The paper introduces three novelties:
+
+- adding absolute position encodings to the *output* of a BERT-like transformer
+  to help with MLM prediction
+- disentangling computation of embedding and position attention scores
+- adversarial training method (not explored)
+
+The paper reports significant performance bumps over similarly sized models
+despite training on half of the data. The mentioned benchmarks are MNLI, SQuAD,
+RACE and SuperGLUE.
+
+## Disentangled attention computation
+
+
diff --git a/git_worktrees.md b/git_worktrees.md
@@ -0,0 +1,49 @@
+# Git Worktrees
+
+Git worktrees allow you to have more working directories and thereby having more
+branches checked-out at one time. However, often it comes with some hassle, some
+of which can be avoided.
+
+Source: [Using git worktrees in a clean
+way](https://morgan.cugerone.com/blog/how-to-use-git-worktree-and-in-a-clean-way/).
+
+
+## Bare clone
+
+Instead of using the typical clone and having one *main* worktree (also known as
+[working directory](./git_three_trees.md)) and other *linked* worktree it is
+recommended to checkout bare repo. The reasons are that it is more clearer what
+each directory holds. Linked worktrees can be deleted, without any fuss. But
+deleting your main worktree which holds the `.git` directory will purge entire
+code base (together with unpushed history).
+
+Create bare repo like so:
+
+```bash
+# Inside your project folder
+git clone --bare git@.... .bare
+```
+
+To tell git to use that repo in your project folder, add a link to it:
+
+```bash
+echo "gitdir: ./.bare" > .git
+```
+
+While bare repositories make the project folder a lot cleaner, they have one
+major disadvantage. By adding `--bare` to the clone command, no remote-tracking
+branches or related configuration is created. This means that you won't be able
+to checkout remote branch. The solution is to add this configuration yourself:
+
+```bash
+git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*"
+```
+
+## Add worktrees
+
+All worktree-related operations are done with `git worktree`
+
+
+
+
+
diff --git a/guide_to_training_nn.md b/guide_to_training_nn.md
@@ -0,0 +1,41 @@
+# Guide to training neural nets
+
+This note is gist of [Andrej Karpathy's
+post](https://karpathy.github.io/2019/04/25/recipe/) about training neural
+networks. There are quite a few good tips, but the blog post is dense and long,
+so I'm wiriting down my take to fully grasp all the mentioned ideas.
+
+Deep learning is not typical software development. Neural networks differ in two
+important ways:
+
+#### Training neural networks cannot be abstracted away completely
+
+Although many libraries try to make training neural networks as simple as 2
+lines of code, training neural networks cannot be abstracted completely. There
+will be always quirks that you cannot debug without the full knowledge of what
+is going on: from [tokenization](./tokenization_gotchas.md) and data to loss
+function and backpropagation algorithm.
+
+#### Training of neural networks fails silently
+
+Normally when software has a bug there is a big red error popping up, more often
+than not describing what exactly went wrong. Training NN isn't like that. E.g.
+neural networks can learn to fix issues in input data without giving you any
+indication that it is doing that, except that its performance would be slightly
+lower.
+
+For the two above reasons, you should go slowly, increasing the complexity
+slowly, not in a big steps. If you add too much complexity at once, debugging
+could get out of hand quite quickly.
+
+## Recipe
+
+### Study the data
+
+### Implement end-to-end training and evaluation pipeline
+
+### Overfit
+
+### Regularize
+
+
diff --git a/matrix_norms.md b/matrix_norms.md
@@ -19,3 +19,13 @@ Forbenious norm is a special case of a $L_{p, q}$ norm for $p = q = 2$.
 $$
 ||A||_F = \sqrt{\sum_i \sum_j |a_{i, j}|^2} = \sqrt{trace(A^TA)}
 $$
+
+## Nuclear norm
+
+Nuclear norm is a variant of Schatten p-norms for $p=1$. For a matrix $A \in
+\mathbb{R}^{m \times n}$ and its [singular values](./svd.md) $\sigma_i(A)$ for $i \in
+\{1\ldots \min{\{m, n\}}\}$, it is defined as:
+
+$$
+||A||_{\ast} = \text{trace}(\sqrt{A^TA}) = \sum_{i}^{\min{\{m, n\}}} \sigma_i(A)
+$$
diff --git a/metric_learning.md b/metric_learning.md
@@ -0,0 +1,17 @@
+# Metric learning
+
+In metric learning we map data samples onto a real-valued vector space. The task
+is to do it in a such a way that **the relative positions** of 2 data samples in
+the vector space **reflect thier similarity**. This allows to quickly compute
+similarity between any two data samples which is handy in retrieval.
+
+## Methods
+
+- [Contrastive](./contrastive_loss.md)
+- [Triplet](./triplet_loss.md)
+
+
+## Sources
+
+- [Survey of Deep Metric learning (DML)
+  methods](https://hav4ik.github.io/articles/deep-metric-learning-survey)
diff --git a/n_pair_loss.md b/n_pair_loss.md
@@ -0,0 +1,78 @@
+# N-pair loss
+
+N-pair loss is loss for supervised [metric learning](./metric_learning.md)
+introduced by [Sohn
+(2016)](https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf).
+It is a natural progression of the [Triple loss](./triplet_loss.md), but trying
+to extract more information from a given batch.
+
+## Loss formula
+
+Notation:
+- $x$ -- anchor input
+- $x^{+}$ -- positive input
+- $x_i$ -- negative input to $x$
+- $f$ -- normalized representation of $x$
+- $f^{+}$ -- normalized representation of $x^{+}$
+- $f_i$ -- normalized representation of $x_i$
+
+Then the loss for $x$ is:
+
+$$
+\mathcal{L}\left(\{x, x^{+}, \{x_i\}_{i=1}^{N-1}\}; f\right)
+  = \log\left(
+    1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right)
+  \right)
+$$
+
+Note that this is identical to classical softmax loss:
+
+$$
+\begin{aligned}
+\log\left(
+    1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right)
+  \right) &= 
+-1 \log\left(
+  \frac{f^\top f^{+}}{f^\top f^{+}} + \frac{\sum_{i=1}^{N-1}\exp(f^\top f_i)}{f^\top f^{+}}
+  \right)^{-1} \\
+&=-\log\left(
+  \frac{f^\top f^{+}}{
+    f^\top f^{+} + \sum_{i=1}^{N-1}\exp(f^\top f_i)
+  }
+\right)
+\end{aligned}
+$$
+
+## Efficient batching
+
+Instead of computing a single loss for each batch of N+1 inputs (N-1 negatives,
+1 positive, 1 anchor), the authors propose to create a batch of 2N inputs
+composed of N pairs from different classes (whose embeddings should the loss
+pull apart). From the 2N inputs we can compute N losses:
+
+- take $j$-th anchor and positive as $x$ and $x^{+}$, other N-1 positives as
+   $x_i$
+
+## Hard class mining
+
+Triplet loss relies on mining of hard instances to speed up convergence. Authors
+of N-pair loss propose to mine classes instead of instances:
+
+1. Choose large number of classes C with 2 randomly sampled instances from each
+2. Get the sampled instances' embeddings
+3. Greedily create a batch of N classes by:
+    1. Randomly taking a class $i$ (a random instance of the randomly chosen class
+      $i$)
+    2. Choose $j$-th class next if it violates the triplet constraint the most
+      w.r.t. the selected classes.
+    3. Repeat
+
+## Results
+
+The authors report better results than
+- triplet loss w/ hard class mining (no hard instance mining)
+- classical softmax loss
+
+However, often the most performing version of N-pair loss is using hard-class
+mining, which is undoubtadly the most costly loss out of all that were
+evaluated.
diff --git a/ordinal_regression.md b/ordinal_regression.md
@@ -0,0 +1,8 @@
+# Ordinal regression
+
+Ordinal regression is a type of regression for predicting ordinal variables
+(i.e. variables whose values are categorical, but can be ordered). Its a type of
+regression where only the relative ordering of the predicted values matter, not
+the exact predicted values.
+
+([Wikipedia](./https://en.wikipedia.org/wiki/Ordinal_regression))
diff --git a/python_multiprocessing.md b/python_multiprocessing.md
@@ -0,0 +1,51 @@
+# Multiprocessing in Python
+
+By default python code is synchronous -- it runs in single process and in single
+thread. However, the `multiprocessing` module allows for execution in multiple
+processes. There are few gotchas that one needs to be aware of.
+
+## Forking is bad
+
+The default method how to obtain another process is to fork the current one.
+Forking copies all memory of the parent process to the child process, however,
+it doesn't copy everything. This can [cause
+deadlocks](https://pythonspeed.com/articles/python-multiprocessing/). So it is
+recommended to use other methods such as 'spawn':
+
+```python
+from multiprocessing import get_context
+
+with get_context('spawn').Pool as p:
+  p.imap(...)
+```
+
+## When spawning the worker function should be from another module
+
+With the `spawn` method, a new python interpreter is created. This means it has
+no clue about the globals in the parent process and needs to import them. So if
+the worker function is in the main module of the parent process, there is a risk
+of running that multiprocessing again in the child process. Aparently there are
+some boundaries, so it doesn't happen. The solution is to either have the worker
+function in another module, or maybe adding the `if __name__ == '__main__':`
+before multiprocessing code could help.
+
+## Introducing `joblib`
+
+With the mentioned above hassles in mind, I found that the experience using
+`joblib` is much better. It is an external package with no further dependencies.
+And it works like so:
+
+```python
+from joblib import Parallel, delayed
+
+with Parallel(
+  n_jobs=32, # By default num. of CPUs
+  verbose=11 # Above 0 it logs something, above 10 it reports on every iteration
+  batch_size=1 # Number of runs per CPU
+  ) as parallel:
+  result = parallel(delayed(expensive_fn)(args, kwarg1=other_argument) for args in prepared_job_args)
+```
+
+From my experience it is quicker or as fast as spawn multiprocessing, you don't
+have to create single-argument functions and place them in different modules,
+and I've encountered zero unexpected errors.
diff --git a/tokenization_gotchas.md b/tokenization_gotchas.md
diff --git a/triplet_loss.md b/triplet_loss.md
@@ -0,0 +1,15 @@
+# Triplet loss
+
+Introduced as a part of the [FaceNet paper written by Schroff et al.
+(2015)](https://arxiv.org/pdf/1503.03832).
+
+## Context
+
+Triplet loss can be seen as a successor to [Contrastive
+loss](./contrastive_loss.md), in its time taking the place of the
+state-of-the-art loss for [metric learning](./metric_learning.md).
+
+## Loss
+
+
+