Skip to content

Commit

Permalink
Development (#197)
Browse files Browse the repository at this point in the history
* File cluster proof of concept (#162)

* Implement files data-access object

* Migrate to files-dao

* Update server tests

* Fix regression

* Make module naming more ideomatic

* Resolve linting issues

* Support expunging by session scope

* Draft matches extraction

* Filter matches by distance

* Test query matches with cycles

* Test match loading

* Emulate matches pagination

* Update MatchesDAO tests

* Migrate server.api to MatchesDAO

* Update api/matches tests

* Test multiple hops with cycles

* Update file matches page

* Update cluster page to consume new matches format

* Improve graph container responsiveness

* Fix linting issues

* Improve graph responsiveness

* Implement generic loading trigger

* Support matches filtering

* Implement dynamic matches loading

* Fix loading trigger message

* Fix match reducers

* Improve dynamic match loading

* Implement neighbor loading

* Adjust graph style

* Enable zooming

* Enable cluster navigation

* Fix tooltips

* Make edge opacity dynamic

* Make color scheme static

* Handle playback issues (#165)

* Handle missing file in thumbnail endpoint

* Handle missing file in watch endpoint

* Display playback/loading errors

* Handle file missing in database case

* Use bundled flv.js

By default react-player tries to lazy-load playback SDK from CDN.
But the application must be able play video files when Internet
connection is not available. To solve that we bundle flv.js and
initialize global variable consumed by react-player's FilePlayer.

See https://www.npmjs.com/package/react-player#sdk-overrides
See cookpete/react-player#605 (comment)

* Suppress excessive flv.js error logs (#149)

* coding style improvements and small refactors

* Fix matches loading (#166) (#168)

* Support config tags (#167)

* Support tags by PathReprStorage

* Support tags by SQLiteReprStorage

* Update extract_features.py to support tags

* Add missing dependency

* Make config tag opaque

* Update extract_features.py sript

* Delete obsolete code

* Update repr storage tests

* Update generate_matches.py script

* Update template_matching.py script

* Update general_tests

* Add missing unit-test dependency

* Optimize module dependencies

* Implement side-by-side match comparison view (#155) (#169)

* Refactor file cluster page

* Refacto VideoInformationPane

* Make video details elements collapsible

* Move distance element to common package

* Add comparable file component

* Implement reusable file summary

* Implement match file list

* Implement mother file view

* Fetch matched files scenes

* Setup comparison page routing

* Reset video player on file change

* Fix matched file header

* Improve distance style

* Hook up compare button

* Fix match duplication

* Fix linting issues

* Make compare button primary-colored

* added missing dependency

* removed unnecessary files

* additional refactor and linting fixes

* Improve cluster view (#171)

* Refactor FileSummaryHeader

* Refactor linear list item

* Extract match file id from URL

* Navigate to comparison page on edge click

* Navigate to compare from single match preview

* Improve cluster tooltips

* Implement node highlighting

* Improve link popover

* Fix linting issues

* Add navigation from comparison page (#172) (#177)

* Handle partial processing results (#178)

* Handle missing data in backend (#135)

* Test missing data processing

* Fix go back navigation (#179)

* Fix go-back navigation

* Preserve filters and search results

Previous file filters and fetched files should not be updated when
we navigate from video-details page to the collection page.

* Handle missing video length

* Fix match category selector

* Ensure initial file loading

* Add match file enumeration & sorting (#181)

* Add match enumeration (#175)

* Sort files by match score

* Handle uninitialized ref

* Add image version option (#180)

* Ensure shell is bash

* Refactor setup script

* Select image version during setup (#97)

* Add make-goal for forced setup update

* Improve scripts output

* Add docker update and purge goals (#114)

* Use arrow-select for pre-built option

* Add various UI improvements (#173, #174, #176) (#182)

* Handle dense layout

* Add active filters indication (#176)

* Preserve file view type during session (#174)

* Enlarge links hitbox (#173)

* Fix link click handler

* augmented dataset evaluation pipeline

* Modifications to support benchmarking script

* Add cluster API endpoint (#183) (#184)

* Separate cluster and matches enpoints (#183)

* Update server tests

* Test matches endpoint

* Update REST client

* Refacto API client

* Creat generic hook for entity loading

* Consume matches and cluster API separately

* Manage fields inclusion

* Update immediate match loading

* Update comparison page

* Remove trash

* Remove trash

* Refactor ui state management (#185, #189) (#190)

* Move helpers to separate modules

* Move file cache to separate package

* Move file matches state to separate package

* Move cluster state to separate package

* Move file-list state to separate package

* Collect reusable prop-types in a public package

* Extract entity fetching logic

* Refactor file-cluster state

Use generic approach to manage file-cluster state.

* Refactor file-matches state

Use generic approach to manage file-matches state.

* Update server client

* Update matches params (#189)

* Fix file cluster update

* Disable false-positive linting issue

* Create post_push

* Simplify docker-compose workflow (#187, #188) (#191)

* Simplify docker-compose workflow

Use `build` and `image` simultaneously in compose-file. As a result
no shared configuration is needed and all the configuration is
managed by environment variables (located in the .env file).

To use prebuild images user will need to run `make pull`
To use local images (if production mode is disabled) user will
need to run `make build`. That's it.

* Commit missing script

* Add revision information to docker images (#188)

* Benchmarking improvements

* Update and rename post_push to build

* Delete build

* Parse exif general encoded date (#137) (#193)

* Update db schema

* Fix media-info output parsing

* Parse exif date-time

* Use date-time on backend

* Update tests

* Use numeric timestamps on frontend

* Handle NaN-representation of missing values

The exif metadata is pre-processed by the pandas.json_normalize
method which replaces any missing value with NaNs. As a result
the float value may appear in place of string.

* Handle pandas missing data representation

pandas.Series heuristically determines type of the underlying data
and tries to represent a missing values according to that data
time. In case of datetime the missing values are represented by
pandas.NaT which is not compatible with SQLAlchemy framework. To
fix that we have to explicitly transform NaTs to Nones.
See https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#datetimes

* Fix deprecated import

* Fix database URI

* Handle more date columns

* Update video metadata model (#139) (#196)

* Update db schema

* Update db access logic

* Update frontend

* Update server tests

* Update storage tests

* Additional error handling for dealing with missing files

* Added Missing frame level features check

* Quick merge fix

Co-authored-by: Stepan Anokhin <38860530+stepan-anokhin@users.noreply.github.com>
Co-authored-by: Felipe Batista <fsbatista1@gmail.com>
Co-authored-by: Stepan Anokhin <stepan.anokhin@gmail.com>
  • Loading branch information
4 people authored Nov 20, 2020
1 parent bf49fbd commit 31d0d80
Show file tree
Hide file tree
Showing 208 changed files with 10,534 additions and 3,544 deletions.
30 changes: 23 additions & 7 deletions .mk/docker.mk
Original file line number Diff line number Diff line change
@@ -1,16 +1,22 @@

.PHONY: docker-setup

## Setup environment variables required for docker-compose
## Setup environment variables required for docker-compose if needed
docker-setup:
@scripts/docker-setup.sh

.PHONY: docker-setup-update

## Update environment variables required for docker-compose
docker-setup-update:
@scripts/docker-setup.sh --force-update


.PHONY: docker-run

## Run application using docker-compose
docker-run: docker-setup
@scripts/docker-run.sh
sudo docker-compose up -d


.PHONY: docker-stop
Expand All @@ -19,11 +25,21 @@ docker-run: docker-setup
docker-stop:
sudo docker-compose stop

.PHONY: docker-build

## Build docker images for docker-compose application
docker-build:
sudo docker-compose build
@scripts/docker-build.sh


.PHONY: docker-pull

## Update docker images (rebuild local or pull latest from repository depending on configuration).
docker-pull:
sudo docker-compose pull

.PHONY: docker-purge

## Rebuild docker images
docker-rebuild:
sudo docker-compose rm -s -f
sudo docker-compose build
## Shut-down docker-compose application and remove all its images and volumes.
docker-purge:
sudo docker-compose down --rmi all -v --remove-orphans --timeout 0
22 changes: 19 additions & 3 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -18,12 +18,28 @@ stop: docker-stop

.PHONY: setup

## Setup docker-compose application (generate .env file)
## Setup docker-compose application (generate .env file in needed)
setup: docker-setup

## Rebuild docker-compose images
rebuild: docker-rebuild
.PHONY: update-setup

## Update docker-compose application (regenerate .env file)
setup-update: docker-setup-update

.PHONY: purge

## Remove docker-compose application and all its images and volumes.
purge: docker-purge

# Define default goal
.DEFAULT_GOAL := help

.PHONY: build

## Build Docker images locally.
build: docker-build

.PHONY: pull

## Pull images from Docker Hub
pull: docker-pull
31 changes: 31 additions & 0 deletions benchmarks/augmented_dataset/config.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
sources:
root: data/augmented_dataset
extensions:
- mp4
- ogv
- webm
- avi

repr:
directory: data/benchmark_output/representations


processing:
frame_sampling: 1
save_frames: true
match_distance: 0.75
video_list_filename: video_dataset_list.txt
filter_dark_videos: true
filter_dark_videos_thr: 2
min_video_duration_seconds: 3
detect_scenes: true
pretrained_model_local_path: null
keep_fileoutput: true

database:
use: false
uri: postgres://postgres:admin@localhost:5432/videodeduplicationdb

templates:
source_path: data/templates/test-group/CCSI Object Recognition External/

3,064 changes: 3,064 additions & 0 deletions benchmarks/augmented_dataset/labels.csv

Large diffs are not rendered by default.

121 changes: 121 additions & 0 deletions benchmarks/evaluate.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
import pandas as pd
from glob import glob
from utils import get_result, download_dataset, get_frame_sampling_permutations
import os
from winnow.utils import resolve_config
import click
from winnow.utils import scan_videos
import subprocess
import shlex
import numpy as np
import json

pd.options.mode.chained_assignment = None

@click.command()

@click.option(
'--benchmark', '-b',
help='name of the benchmark to evaluated',
default='augmented_dataset')

@click.option(
'--force-download', '-fd',
help='Force download of the dataset (even if an existing directory for the dataset has been detected',
default=False, is_flag=True)

@click.option(
'--overwrite', '-o',
help='Force feature extraction, even if we detect that signatures have already been processed.',
default=False, is_flag=True)


def main(benchmark, force_download, overwrite):

config_path = os.path.join('benchmarks', benchmark, 'config.yml')
config = resolve_config(config_path)
source_folder = config.sources.root

videos = scan_videos(source_folder, '**')

if len(videos) == 0 or force_download:

download_dataset(source_folder, url='https://winnowpre.s3.amazonaws.com/augmented_dataset.tar.xz')

videos = scan_videos(source_folder, '**')

print(f'Videos found after download:{len(videos)}')

if len(videos) > 0:

print('Video files found. Checking for existing signatures...')

signatures_path = os.path.join(
config.repr.directory,
'video_signatures', '**',
'**.npy')

signatures = glob(os.path.join(signatures_path), recursive=True)

if len(signatures) == 0 or overwrite:

# Load signatures and labels
#
command = f'python extract_features.py -cp {config_path}'
command = shlex.split(command)
subprocess.run(command, check=True)

# Check if signatures were generated properly
signatures = glob(os.path.join(signatures_path), recursive=True)

assert len(signatures) > 0, 'No signature files were found.'

available_df = pd.read_csv(
os.path.join(
'benchmarks',
benchmark,
'labels.csv'))
frame_level = glob(
os.path.join(
config.repr.directory,
'frame_level', '**',
'**.npy'), recursive=True)

signatures_permutations = get_frame_sampling_permutations(
list(range(1, 6)),
frame_level)

scoreboard = dict()

for fs, sigs in signatures_permutations.items():

results_analysis = dict()

for r in np.linspace(0.1, 0.25, num=10):

results = []

for i in range(5):

mAP, pr_curve = get_result(
available_df,
sigs,
ratio=r,
file_index=frame_level)
results.append(mAP)

results_analysis[r] = results

scoreboard[fs] = results_analysis

results_file = open('benchmarks/scoreboard.json', "w")
json.dump(scoreboard, results_file)
print('Saved scoreboard on {}'.format('benchmarks/scoreboard.json'))

else:

print(f'Please review the dataset (@ {source_folder})')

if __name__ == '__main__':

main()
159 changes: 159 additions & 0 deletions benchmarks/utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,159 @@
import pandas as pd
import numpy as np
from winnow.feature_extraction.loading_utils import evaluate, calculate_similarities, global_vector
from winnow.feature_extraction.utils import load_image, download_file
from winnow.feature_extraction import SimilarityModel
from collections import defaultdict
import os
import shutil
from glob import glob

def get_queries(min_num_of_samples, df, col='original_filename'):

fc = df[col].value_counts()
msk = fc >= min_num_of_samples

return fc[msk].index.values


def get_query_dataset(df, query, ratio=.22, col='original_filename'):

msk = df[col] == query
occ = df.loc[msk, :]
negative = df.loc[~msk, :]
n_positive_samples = len(occ)
positive_head = occ.sample(1)['new_filename'].values[0]

query_total = n_positive_samples / ratio
to_be_sampled = int(query_total - n_positive_samples)
confounders = negative.sample(to_be_sampled)
confounders.loc[:, 'label'] = 'X'
occ.loc[:, 'label'] = 'E'
merged = pd.concat([confounders, occ])

query_d = dict()

for i, row in merged.iterrows():

query_d[row['new_filename']] = row['label']

return positive_head, query_d


def get_ground_truth(available_df, queries, min_samples=4, ratio=0.2):

ground_truth = dict()

for query in queries:

head, query_ds = get_query_dataset(available_df, query, ratio=ratio)

ground_truth[head] = query_ds

return ground_truth


def convert_ground_truth(gt, base_to_idx):

queries = list(gt.keys())

qi = {base_to_idx[x]: i+1 for i, x in enumerate(queries)}

new_ds = dict()

for k, v in gt.items():

sub_d = dict()

for kk, vv in v.items():

sub_d[base_to_idx[kk]] = vv

new_ds[qi[base_to_idx[k]]] = sub_d

return new_ds


def get_result(df,
signatures,
min_samples=4,
ratio=0.25,
all_videos=False,
file_index=None):

if file_index is None:

signatures_data = np.array([np.load(x) for x in signatures])
basename = [os.path.basename(x)[:-4] for x in signatures]

else:

basename = [os.path.basename(x)[:-4] for x in file_index]
signatures_data = np.array(signatures)
signatures = file_index

basename_to_idx = {x: i for i, x in enumerate(basename)}

queries = get_queries(min_samples, df)
query_idx = [basename_to_idx[x] for x in queries]
similarities = calculate_similarities(query_idx, signatures_data)

ground_truth = get_ground_truth(df, queries, ratio=ratio)
final_gt = convert_ground_truth(ground_truth, basename_to_idx)
mAP, pr_curve = evaluate(final_gt, similarities, all_videos=all_videos)
return mAP, pr_curve


def download_dataset(
dst,
url="https://winnowpre.s3.amazonaws.com/augmented_dataset.tar.xz"):

if not os.path.exists(dst):

os.makedirs(dst)

number_of_files = len(glob(dst + '/**'))
print('Files Found',number_of_files)

if number_of_files < 2:

print('Downloading sample dataset to:{}'.format(dst))

fp = os.path.join(dst, 'dataset.tar.gz')
if not os.path.isfile(fp):

download_file(fp, url)
# unzip files
print('unpacking', fp)
shutil.unpack_archive(fp, dst)
# Delete tar
os.unlink(fp)
else:
print('Files have already been downloaded')


def get_frame_sampling_permutations(frame_samplings, frame_level_files):

d = defaultdict(list)

for v in frame_level_files:

data = np.load(v)

for frame_sampling in frame_samplings:

d[frame_sampling].append(data[::frame_sampling])

sm = SimilarityModel()

signatures = defaultdict(list)
for fs in d.keys():

video_level = np.array([global_vector(x) for x in d[fs]])
signatures[fs].append(
sm.predict_from_features(
video_level.reshape(
video_level.shape[0],
video_level.shape[2])))

return signatures
Loading

0 comments on commit 31d0d80

Please sign in to comment.