Skip to content

Commit

Permalink
from rossum ntb
Browse files Browse the repository at this point in the history
  • Loading branch information
dburian committed Sep 26, 2024
1 parent 49197a1 commit b4586a1
Show file tree
Hide file tree
Showing 11 changed files with 327 additions and 0 deletions.
40 changes: 40 additions & 0 deletions contrastive_loss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Contrastive loss

Contrastive loss was introduced by [Hadsell, Chopra, LeCun
(2006)](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf)

## Context

Contrastive loss was the first loss to train [Metric
learning](./metric_learning.md). As such it is very simple, but for supervised
training it is surpassed. Negatives of contrastive loss:

- converges slowly since it doesn't use the full batch information, unlike
N-pair loss
-

## Formula

Contrastive loss penalizes model for distant, but similar pairs, or for close,
distinct pairs:

$$
L(x_1, x_2) = y D(x_1, x_2) - (1 - y)max(0, m - D(x_1, x_2))
$$

where
- $x_1$, $x_2$ are training samples
- $y$ is similarity (1 = similar, 0 = distinct)
- $D$ is distance metric (in the paper it was L2)
- $m$ is margin

The margin $m$ ensures that very distant and distinct points won't be adjusted.
The motivation behind this is that the priority is having similar points close.
Messing with very distant points, may get in the way and so it is ignored.


## Training data

The authors propose quite simple generation of training data: for each sample
generate all positive pairs (neighbors up to a constant similarity) and all
negative pairs. Then sample uniformly.
18 changes: 18 additions & 0 deletions deberta.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# DeBERTa

DeBERTa -- Decoding-Enhanced BERT with disentangled Attention -- is a model from
Microsoft introduced by [He et al. (2021)](https://arxiv.org/pdf/2006.03654).
The paper introduces three novelties:

- adding absolute position encodings to the *output* of a BERT-like transformer
to help with MLM prediction
- disentangling computation of embedding and position attention scores
- adversarial training method (not explored)

The paper reports significant performance bumps over similarly sized models
despite training on half of the data. The mentioned benchmarks are MNLI, SQuAD,
RACE and SuperGLUE.

## Disentangled attention computation


49 changes: 49 additions & 0 deletions git_worktrees.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Git Worktrees

Git worktrees allow you to have more working directories and thereby having more
branches checked-out at one time. However, often it comes with some hassle, some
of which can be avoided.

Source: [Using git worktrees in a clean
way](https://morgan.cugerone.com/blog/how-to-use-git-worktree-and-in-a-clean-way/).


## Bare clone

Instead of using the typical clone and having one *main* worktree (also known as
[working directory](./git_three_trees.md)) and other *linked* worktree it is
recommended to checkout bare repo. The reasons are that it is more clearer what
each directory holds. Linked worktrees can be deleted, without any fuss. But
deleting your main worktree which holds the `.git` directory will purge entire
code base (together with unpushed history).

Create bare repo like so:

```bash
# Inside your project folder
git clone --bare git@.... .bare
```

To tell git to use that repo in your project folder, add a link to it:

```bash
echo "gitdir: ./.bare" > .git
```

While bare repositories make the project folder a lot cleaner, they have one
major disadvantage. By adding `--bare` to the clone command, no remote-tracking
branches or related configuration is created. This means that you won't be able
to checkout remote branch. The solution is to add this configuration yourself:

```bash
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*"
```

## Add worktrees

All worktree-related operations are done with `git worktree`





41 changes: 41 additions & 0 deletions guide_to_training_nn.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Guide to training neural nets

This note is gist of [Andrej Karpathy's
post](https://karpathy.github.io/2019/04/25/recipe/) about training neural
networks. There are quite a few good tips, but the blog post is dense and long,
so I'm wiriting down my take to fully grasp all the mentioned ideas.

Deep learning is not typical software development. Neural networks differ in two
important ways:

#### Training neural networks cannot be abstracted away completely

Although many libraries try to make training neural networks as simple as 2
lines of code, training neural networks cannot be abstracted completely. There
will be always quirks that you cannot debug without the full knowledge of what
is going on: from [tokenization](./tokenization_gotchas.md) and data to loss
function and backpropagation algorithm.

#### Training of neural networks fails silently

Normally when software has a bug there is a big red error popping up, more often
than not describing what exactly went wrong. Training NN isn't like that. E.g.
neural networks can learn to fix issues in input data without giving you any
indication that it is doing that, except that its performance would be slightly
lower.

For the two above reasons, you should go slowly, increasing the complexity
slowly, not in a big steps. If you add too much complexity at once, debugging
could get out of hand quite quickly.

## Recipe

### Study the data

### Implement end-to-end training and evaluation pipeline

### Overfit

### Regularize


10 changes: 10 additions & 0 deletions matrix_norms.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,3 +19,13 @@ Forbenious norm is a special case of a $L_{p, q}$ norm for $p = q = 2$.
$$
||A||_F = \sqrt{\sum_i \sum_j |a_{i, j}|^2} = \sqrt{trace(A^TA)}
$$

## Nuclear norm

Nuclear norm is a variant of Schatten p-norms for $p=1$. For a matrix $A \in
\mathbb{R}^{m \times n}$ and its [singular values](./svd.md) $\sigma_i(A)$ for $i \in
\{1\ldots \min{\{m, n\}}\}$, it is defined as:

$$
||A||_{\ast} = \text{trace}(\sqrt{A^TA}) = \sum_{i}^{\min{\{m, n\}}} \sigma_i(A)
$$
17 changes: 17 additions & 0 deletions metric_learning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
# Metric learning

In metric learning we map data samples onto a real-valued vector space. The task
is to do it in a such a way that **the relative positions** of 2 data samples in
the vector space **reflect thier similarity**. This allows to quickly compute
similarity between any two data samples which is handy in retrieval.

## Methods

- [Contrastive](./contrastive_loss.md)
- [Triplet](./triplet_loss.md)


## Sources

- [Survey of Deep Metric learning (DML)
methods](https://hav4ik.github.io/articles/deep-metric-learning-survey)
78 changes: 78 additions & 0 deletions n_pair_loss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# N-pair loss

N-pair loss is loss for supervised [metric learning](./metric_learning.md)
introduced by [Sohn
(2016)](https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf).
It is a natural progression of the [Triple loss](./triplet_loss.md), but trying
to extract more information from a given batch.

## Loss formula

Notation:
- $x$ -- anchor input
- $x^{+}$ -- positive input
- $x_i$ -- negative input to $x$
- $f$ -- normalized representation of $x$
- $f^{+}$ -- normalized representation of $x^{+}$
- $f_i$ -- normalized representation of $x_i$

Then the loss for $x$ is:

$$
\mathcal{L}\left(\{x, x^{+}, \{x_i\}_{i=1}^{N-1}\}; f\right)
= \log\left(
1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right)
\right)
$$

Note that this is identical to classical softmax loss:

$$
\begin{aligned}
\log\left(
1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right)
\right) &=
-1 \log\left(
\frac{f^\top f^{+}}{f^\top f^{+}} + \frac{\sum_{i=1}^{N-1}\exp(f^\top f_i)}{f^\top f^{+}}
\right)^{-1} \\
&=-\log\left(
\frac{f^\top f^{+}}{
f^\top f^{+} + \sum_{i=1}^{N-1}\exp(f^\top f_i)
}
\right)
\end{aligned}
$$

## Efficient batching

Instead of computing a single loss for each batch of N+1 inputs (N-1 negatives,
1 positive, 1 anchor), the authors propose to create a batch of 2N inputs
composed of N pairs from different classes (whose embeddings should the loss
pull apart). From the 2N inputs we can compute N losses:

- take $j$-th anchor and positive as $x$ and $x^{+}$, other N-1 positives as
$x_i$

## Hard class mining

Triplet loss relies on mining of hard instances to speed up convergence. Authors
of N-pair loss propose to mine classes instead of instances:

1. Choose large number of classes C with 2 randomly sampled instances from each
2. Get the sampled instances' embeddings
3. Greedily create a batch of N classes by:
1. Randomly taking a class $i$ (a random instance of the randomly chosen class
$i$)
2. Choose $j$-th class next if it violates the triplet constraint the most
w.r.t. the selected classes.
3. Repeat

## Results

The authors report better results than
- triplet loss w/ hard class mining (no hard instance mining)
- classical softmax loss

However, often the most performing version of N-pair loss is using hard-class
mining, which is undoubtadly the most costly loss out of all that were
evaluated.
8 changes: 8 additions & 0 deletions ordinal_regression.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Ordinal regression

Ordinal regression is a type of regression for predicting ordinal variables
(i.e. variables whose values are categorical, but can be ordered). Its a type of
regression where only the relative ordering of the predicted values matter, not
the exact predicted values.

([Wikipedia](./https://en.wikipedia.org/wiki/Ordinal_regression))
51 changes: 51 additions & 0 deletions python_multiprocessing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Multiprocessing in Python

By default python code is synchronous -- it runs in single process and in single
thread. However, the `multiprocessing` module allows for execution in multiple
processes. There are few gotchas that one needs to be aware of.

## Forking is bad

The default method how to obtain another process is to fork the current one.
Forking copies all memory of the parent process to the child process, however,
it doesn't copy everything. This can [cause
deadlocks](https://pythonspeed.com/articles/python-multiprocessing/). So it is
recommended to use other methods such as 'spawn':

```python
from multiprocessing import get_context

with get_context('spawn').Pool as p:
p.imap(...)
```

## When spawning the worker function should be from another module

With the `spawn` method, a new python interpreter is created. This means it has
no clue about the globals in the parent process and needs to import them. So if
the worker function is in the main module of the parent process, there is a risk
of running that multiprocessing again in the child process. Aparently there are
some boundaries, so it doesn't happen. The solution is to either have the worker
function in another module, or maybe adding the `if __name__ == '__main__':`
before multiprocessing code could help.

## Introducing `joblib`

With the mentioned above hassles in mind, I found that the experience using
`joblib` is much better. It is an external package with no further dependencies.
And it works like so:

```python
from joblib import Parallel, delayed

with Parallel(
n_jobs=32, # By default num. of CPUs
verbose=11 # Above 0 it logs something, above 10 it reports on every iteration
batch_size=1 # Number of runs per CPU
) as parallel:
result = parallel(delayed(expensive_fn)(args, kwarg1=other_argument) for args in prepared_job_args)
```

From my experience it is quicker or as fast as spawn multiprocessing, you don't
have to create single-argument functions and place them in different modules,
and I've encountered zero unexpected errors.
Empty file added tokenization_gotchas.md
Empty file.
15 changes: 15 additions & 0 deletions triplet_loss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# Triplet loss

Introduced as a part of the [FaceNet paper written by Schroff et al.
(2015)](https://arxiv.org/pdf/1503.03832).

## Context

Triplet loss can be seen as a successor to [Contrastive
loss](./contrastive_loss.md), in its time taking the place of the
state-of-the-art loss for [metric learning](./metric_learning.md).

## Loss



0 comments on commit b4586a1

Please sign in to comment.