-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
327 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# Contrastive loss | ||
|
||
Contrastive loss was introduced by [Hadsell, Chopra, LeCun | ||
(2006)](http://yann.lecun.com/exdb/publis/pdf/hadsell-chopra-lecun-06.pdf) | ||
|
||
## Context | ||
|
||
Contrastive loss was the first loss to train [Metric | ||
learning](./metric_learning.md). As such it is very simple, but for supervised | ||
training it is surpassed. Negatives of contrastive loss: | ||
|
||
- converges slowly since it doesn't use the full batch information, unlike | ||
N-pair loss | ||
- | ||
|
||
## Formula | ||
|
||
Contrastive loss penalizes model for distant, but similar pairs, or for close, | ||
distinct pairs: | ||
|
||
$$ | ||
L(x_1, x_2) = y D(x_1, x_2) - (1 - y)max(0, m - D(x_1, x_2)) | ||
$$ | ||
|
||
where | ||
- $x_1$, $x_2$ are training samples | ||
- $y$ is similarity (1 = similar, 0 = distinct) | ||
- $D$ is distance metric (in the paper it was L2) | ||
- $m$ is margin | ||
|
||
The margin $m$ ensures that very distant and distinct points won't be adjusted. | ||
The motivation behind this is that the priority is having similar points close. | ||
Messing with very distant points, may get in the way and so it is ignored. | ||
|
||
|
||
## Training data | ||
|
||
The authors propose quite simple generation of training data: for each sample | ||
generate all positive pairs (neighbors up to a constant similarity) and all | ||
negative pairs. Then sample uniformly. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
# DeBERTa | ||
|
||
DeBERTa -- Decoding-Enhanced BERT with disentangled Attention -- is a model from | ||
Microsoft introduced by [He et al. (2021)](https://arxiv.org/pdf/2006.03654). | ||
The paper introduces three novelties: | ||
|
||
- adding absolute position encodings to the *output* of a BERT-like transformer | ||
to help with MLM prediction | ||
- disentangling computation of embedding and position attention scores | ||
- adversarial training method (not explored) | ||
|
||
The paper reports significant performance bumps over similarly sized models | ||
despite training on half of the data. The mentioned benchmarks are MNLI, SQuAD, | ||
RACE and SuperGLUE. | ||
|
||
## Disentangled attention computation | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
# Git Worktrees | ||
|
||
Git worktrees allow you to have more working directories and thereby having more | ||
branches checked-out at one time. However, often it comes with some hassle, some | ||
of which can be avoided. | ||
|
||
Source: [Using git worktrees in a clean | ||
way](https://morgan.cugerone.com/blog/how-to-use-git-worktree-and-in-a-clean-way/). | ||
|
||
|
||
## Bare clone | ||
|
||
Instead of using the typical clone and having one *main* worktree (also known as | ||
[working directory](./git_three_trees.md)) and other *linked* worktree it is | ||
recommended to checkout bare repo. The reasons are that it is more clearer what | ||
each directory holds. Linked worktrees can be deleted, without any fuss. But | ||
deleting your main worktree which holds the `.git` directory will purge entire | ||
code base (together with unpushed history). | ||
|
||
Create bare repo like so: | ||
|
||
```bash | ||
# Inside your project folder | ||
git clone --bare git@.... .bare | ||
``` | ||
|
||
To tell git to use that repo in your project folder, add a link to it: | ||
|
||
```bash | ||
echo "gitdir: ./.bare" > .git | ||
``` | ||
|
||
While bare repositories make the project folder a lot cleaner, they have one | ||
major disadvantage. By adding `--bare` to the clone command, no remote-tracking | ||
branches or related configuration is created. This means that you won't be able | ||
to checkout remote branch. The solution is to add this configuration yourself: | ||
|
||
```bash | ||
git config remote.origin.fetch "+refs/heads/*:refs/remotes/origin/*" | ||
``` | ||
|
||
## Add worktrees | ||
|
||
All worktree-related operations are done with `git worktree` | ||
|
||
|
||
|
||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,41 @@ | ||
# Guide to training neural nets | ||
|
||
This note is gist of [Andrej Karpathy's | ||
post](https://karpathy.github.io/2019/04/25/recipe/) about training neural | ||
networks. There are quite a few good tips, but the blog post is dense and long, | ||
so I'm wiriting down my take to fully grasp all the mentioned ideas. | ||
|
||
Deep learning is not typical software development. Neural networks differ in two | ||
important ways: | ||
|
||
#### Training neural networks cannot be abstracted away completely | ||
|
||
Although many libraries try to make training neural networks as simple as 2 | ||
lines of code, training neural networks cannot be abstracted completely. There | ||
will be always quirks that you cannot debug without the full knowledge of what | ||
is going on: from [tokenization](./tokenization_gotchas.md) and data to loss | ||
function and backpropagation algorithm. | ||
|
||
#### Training of neural networks fails silently | ||
|
||
Normally when software has a bug there is a big red error popping up, more often | ||
than not describing what exactly went wrong. Training NN isn't like that. E.g. | ||
neural networks can learn to fix issues in input data without giving you any | ||
indication that it is doing that, except that its performance would be slightly | ||
lower. | ||
|
||
For the two above reasons, you should go slowly, increasing the complexity | ||
slowly, not in a big steps. If you add too much complexity at once, debugging | ||
could get out of hand quite quickly. | ||
|
||
## Recipe | ||
|
||
### Study the data | ||
|
||
### Implement end-to-end training and evaluation pipeline | ||
|
||
### Overfit | ||
|
||
### Regularize | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Metric learning | ||
|
||
In metric learning we map data samples onto a real-valued vector space. The task | ||
is to do it in a such a way that **the relative positions** of 2 data samples in | ||
the vector space **reflect thier similarity**. This allows to quickly compute | ||
similarity between any two data samples which is handy in retrieval. | ||
|
||
## Methods | ||
|
||
- [Contrastive](./contrastive_loss.md) | ||
- [Triplet](./triplet_loss.md) | ||
|
||
|
||
## Sources | ||
|
||
- [Survey of Deep Metric learning (DML) | ||
methods](https://hav4ik.github.io/articles/deep-metric-learning-survey) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,78 @@ | ||
# N-pair loss | ||
|
||
N-pair loss is loss for supervised [metric learning](./metric_learning.md) | ||
introduced by [Sohn | ||
(2016)](https://proceedings.neurips.cc/paper/2016/file/6b180037abbebea991d8b1232f8a8ca9-Paper.pdf). | ||
It is a natural progression of the [Triple loss](./triplet_loss.md), but trying | ||
to extract more information from a given batch. | ||
|
||
## Loss formula | ||
|
||
Notation: | ||
- $x$ -- anchor input | ||
- $x^{+}$ -- positive input | ||
- $x_i$ -- negative input to $x$ | ||
- $f$ -- normalized representation of $x$ | ||
- $f^{+}$ -- normalized representation of $x^{+}$ | ||
- $f_i$ -- normalized representation of $x_i$ | ||
|
||
Then the loss for $x$ is: | ||
|
||
$$ | ||
\mathcal{L}\left(\{x, x^{+}, \{x_i\}_{i=1}^{N-1}\}; f\right) | ||
= \log\left( | ||
1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right) | ||
\right) | ||
$$ | ||
|
||
Note that this is identical to classical softmax loss: | ||
|
||
$$ | ||
\begin{aligned} | ||
\log\left( | ||
1 + \sum_{i=1}^{N-1} \exp\left(f^\top f_i - f^\top f^{+}\right) | ||
\right) &= | ||
-1 \log\left( | ||
\frac{f^\top f^{+}}{f^\top f^{+}} + \frac{\sum_{i=1}^{N-1}\exp(f^\top f_i)}{f^\top f^{+}} | ||
\right)^{-1} \\ | ||
&=-\log\left( | ||
\frac{f^\top f^{+}}{ | ||
f^\top f^{+} + \sum_{i=1}^{N-1}\exp(f^\top f_i) | ||
} | ||
\right) | ||
\end{aligned} | ||
$$ | ||
|
||
## Efficient batching | ||
|
||
Instead of computing a single loss for each batch of N+1 inputs (N-1 negatives, | ||
1 positive, 1 anchor), the authors propose to create a batch of 2N inputs | ||
composed of N pairs from different classes (whose embeddings should the loss | ||
pull apart). From the 2N inputs we can compute N losses: | ||
|
||
- take $j$-th anchor and positive as $x$ and $x^{+}$, other N-1 positives as | ||
$x_i$ | ||
|
||
## Hard class mining | ||
|
||
Triplet loss relies on mining of hard instances to speed up convergence. Authors | ||
of N-pair loss propose to mine classes instead of instances: | ||
|
||
1. Choose large number of classes C with 2 randomly sampled instances from each | ||
2. Get the sampled instances' embeddings | ||
3. Greedily create a batch of N classes by: | ||
1. Randomly taking a class $i$ (a random instance of the randomly chosen class | ||
$i$) | ||
2. Choose $j$-th class next if it violates the triplet constraint the most | ||
w.r.t. the selected classes. | ||
3. Repeat | ||
|
||
## Results | ||
|
||
The authors report better results than | ||
- triplet loss w/ hard class mining (no hard instance mining) | ||
- classical softmax loss | ||
|
||
However, often the most performing version of N-pair loss is using hard-class | ||
mining, which is undoubtadly the most costly loss out of all that were | ||
evaluated. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Ordinal regression | ||
|
||
Ordinal regression is a type of regression for predicting ordinal variables | ||
(i.e. variables whose values are categorical, but can be ordered). Its a type of | ||
regression where only the relative ordering of the predicted values matter, not | ||
the exact predicted values. | ||
|
||
([Wikipedia](./https://en.wikipedia.org/wiki/Ordinal_regression)) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,51 @@ | ||
# Multiprocessing in Python | ||
|
||
By default python code is synchronous -- it runs in single process and in single | ||
thread. However, the `multiprocessing` module allows for execution in multiple | ||
processes. There are few gotchas that one needs to be aware of. | ||
|
||
## Forking is bad | ||
|
||
The default method how to obtain another process is to fork the current one. | ||
Forking copies all memory of the parent process to the child process, however, | ||
it doesn't copy everything. This can [cause | ||
deadlocks](https://pythonspeed.com/articles/python-multiprocessing/). So it is | ||
recommended to use other methods such as 'spawn': | ||
|
||
```python | ||
from multiprocessing import get_context | ||
|
||
with get_context('spawn').Pool as p: | ||
p.imap(...) | ||
``` | ||
|
||
## When spawning the worker function should be from another module | ||
|
||
With the `spawn` method, a new python interpreter is created. This means it has | ||
no clue about the globals in the parent process and needs to import them. So if | ||
the worker function is in the main module of the parent process, there is a risk | ||
of running that multiprocessing again in the child process. Aparently there are | ||
some boundaries, so it doesn't happen. The solution is to either have the worker | ||
function in another module, or maybe adding the `if __name__ == '__main__':` | ||
before multiprocessing code could help. | ||
|
||
## Introducing `joblib` | ||
|
||
With the mentioned above hassles in mind, I found that the experience using | ||
`joblib` is much better. It is an external package with no further dependencies. | ||
And it works like so: | ||
|
||
```python | ||
from joblib import Parallel, delayed | ||
|
||
with Parallel( | ||
n_jobs=32, # By default num. of CPUs | ||
verbose=11 # Above 0 it logs something, above 10 it reports on every iteration | ||
batch_size=1 # Number of runs per CPU | ||
) as parallel: | ||
result = parallel(delayed(expensive_fn)(args, kwarg1=other_argument) for args in prepared_job_args) | ||
``` | ||
|
||
From my experience it is quicker or as fast as spawn multiprocessing, you don't | ||
have to create single-argument functions and place them in different modules, | ||
and I've encountered zero unexpected errors. |
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Triplet loss | ||
|
||
Introduced as a part of the [FaceNet paper written by Schroff et al. | ||
(2015)](https://arxiv.org/pdf/1503.03832). | ||
|
||
## Context | ||
|
||
Triplet loss can be seen as a successor to [Contrastive | ||
loss](./contrastive_loss.md), in its time taking the place of the | ||
state-of-the-art loss for [metric learning](./metric_learning.md). | ||
|
||
## Loss | ||
|
||
|
||
|