Development (#197)

* File cluster proof of concept (#162) * Implement files data-access object * Migrate to files-dao * Update server tests * Fix regression * Make module naming more ideomatic * Resolve linting issues * Support expunging by session scope * Draft matches extraction * Filter matches by distance * Test query matches with cycles * Test match loading * Emulate matches pagination * Update MatchesDAO tests * Migrate server.api to MatchesDAO * Update api/matches tests * Test multiple hops with cycles * Update file matches page * Update cluster page to consume new matches format * Improve graph container responsiveness * Fix linting issues * Improve graph responsiveness * Implement generic loading trigger * Support matches filtering * Implement dynamic matches loading * Fix loading trigger message * Fix match reducers * Improve dynamic match loading * Implement neighbor loading * Adjust graph style * Enable zooming * Enable cluster navigation * Fix tooltips * Make edge opacity dynamic * Make color scheme static * Handle playback issues (#165) * Handle missing file in thumbnail endpoint * Handle missing file in watch endpoint * Display playback/loading errors * Handle file missing in database case * Use bundled flv.js By default react-player tries to lazy-load playback SDK from CDN. But the application must be able play video files when Internet connection is not available. To solve that we bundle flv.js and initialize global variable consumed by react-player's FilePlayer. See https://www.npmjs.com/package/react-player#sdk-overrides See cookpete/react-player#605 (comment) * Suppress excessive flv.js error logs (#149) * coding style improvements and small refactors * Fix matches loading (#166) (#168) * Support config tags (#167) * Support tags by PathReprStorage * Support tags by SQLiteReprStorage * Update extract_features.py to support tags * Add missing dependency * Make config tag opaque * Update extract_features.py sript * Delete obsolete code * Update repr storage tests * Update generate_matches.py script * Update template_matching.py script * Update general_tests * Add missing unit-test dependency * Optimize module dependencies * Implement side-by-side match comparison view (#155) (#169) * Refactor file cluster page * Refacto VideoInformationPane * Make video details elements collapsible * Move distance element to common package * Add comparable file component * Implement reusable file summary * Implement match file list * Implement mother file view * Fetch matched files scenes * Setup comparison page routing * Reset video player on file change * Fix matched file header * Improve distance style * Hook up compare button * Fix match duplication * Fix linting issues * Make compare button primary-colored * added missing dependency * removed unnecessary files * additional refactor and linting fixes * Improve cluster view (#171) * Refactor FileSummaryHeader * Refactor linear list item * Extract match file id from URL * Navigate to comparison page on edge click * Navigate to compare from single match preview * Improve cluster tooltips * Implement node highlighting * Improve link popover * Fix linting issues * Add navigation from comparison page (#172) (#177) * Handle partial processing results (#178) * Handle missing data in backend (#135) * Test missing data processing * Fix go back navigation (#179) * Fix go-back navigation * Preserve filters and search results Previous file filters and fetched files should not be updated when we navigate from video-details page to the collection page. * Handle missing video length * Fix match category selector * Ensure initial file loading * Add match file enumeration & sorting (#181) * Add match enumeration (#175) * Sort files by match score * Handle uninitialized ref * Add image version option (#180) * Ensure shell is bash * Refactor setup script * Select image version during setup (#97) * Add make-goal for forced setup update * Improve scripts output * Add docker update and purge goals (#114) * Use arrow-select for pre-built option * Add various UI improvements (#173, #174, #176) (#182) * Handle dense layout * Add active filters indication (#176) * Preserve file view type during session (#174) * Enlarge links hitbox (#173) * Fix link click handler * augmented dataset evaluation pipeline * Modifications to support benchmarking script * Add cluster API endpoint (#183) (#184) * Separate cluster and matches enpoints (#183) * Update server tests * Test matches endpoint * Update REST client * Refacto API client * Creat generic hook for entity loading * Consume matches and cluster API separately * Manage fields inclusion * Update immediate match loading * Update comparison page * Remove trash * Remove trash * Refactor ui state management (#185, #189) (#190) * Move helpers to separate modules * Move file cache to separate package * Move file matches state to separate package * Move cluster state to separate package * Move file-list state to separate package * Collect reusable prop-types in a public package * Extract entity fetching logic * Refactor file-cluster state Use generic approach to manage file-cluster state. * Refactor file-matches state Use generic approach to manage file-matches state. * Update server client * Update matches params (#189) * Fix file cluster update * Disable false-positive linting issue * Create post_push * Simplify docker-compose workflow (#187, #188) (#191) * Simplify docker-compose workflow Use `build` and `image` simultaneously in compose-file. As a result no shared configuration is needed and all the configuration is managed by environment variables (located in the .env file). To use prebuild images user will need to run `make pull` To use local images (if production mode is disabled) user will need to run `make build`. That's it. * Commit missing script * Add revision information to docker images (#188) * Benchmarking improvements * Update and rename post_push to build * Delete build * Parse exif general encoded date (#137) (#193) * Update db schema * Fix media-info output parsing * Parse exif date-time * Use date-time on backend * Update tests * Use numeric timestamps on frontend * Handle NaN-representation of missing values The exif metadata is pre-processed by the pandas.json_normalize method which replaces any missing value with NaNs. As a result the float value may appear in place of string. * Handle pandas missing data representation pandas.Series heuristically determines type of the underlying data and tries to represent a missing values according to that data time. In case of datetime the missing values are represented by pandas.NaT which is not compatible with SQLAlchemy framework. To fix that we have to explicitly transform NaTs to Nones. See https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html#datetimes * Fix deprecated import * Fix database URI * Handle more date columns * Update video metadata model (#139) (#196) * Update db schema * Update db access logic * Update frontend * Update server tests * Update storage tests * Additional error handling for dealing with missing files * Added Missing frame level features check * Quick merge fix Co-authored-by: Stepan Anokhin <38860530+stepan-anokhin@users.noreply.github.com> Co-authored-by: Felipe Batista <fsbatista1@gmail.com> Co-authored-by: Stepan Anokhin <stepan.anokhin@gmail.com>
benetech · Nov 20, 2020 · 31d0d80 · 31d0d80
1 parent bf49fbd
commit 31d0d80
Show file tree

Hide file tree

Showing 208 changed files with 10,534 additions and 3,544 deletions.
diff --git a/.mk/docker.mk b/.mk/docker.mk
@@ -1,16 +1,22 @@
 
 .PHONY: docker-setup
 
-## Setup environment variables required for docker-compose
+## Setup environment variables required for docker-compose if needed
 docker-setup:
 	@scripts/docker-setup.sh
 
+.PHONY: docker-setup-update
+
+## Update environment variables required for docker-compose
+docker-setup-update:
+	@scripts/docker-setup.sh --force-update
+
 
 .PHONY: docker-run
 
 ## Run application using docker-compose
 docker-run: docker-setup
-	@scripts/docker-run.sh
+	sudo docker-compose up -d
 
 
 .PHONY: docker-stop
@@ -19,11 +25,21 @@ docker-run: docker-setup
 docker-stop:
 	sudo docker-compose stop
 
+.PHONY: docker-build
+
 ## Build docker images for docker-compose application
 docker-build:
-	sudo docker-compose build
+	@scripts/docker-build.sh
+
+
+.PHONY: docker-pull
+
+## Update docker images (rebuild local or pull latest from repository depending on configuration).
+docker-pull:
+	sudo docker-compose pull
+
+.PHONY: docker-purge
 
-## Rebuild docker images
-docker-rebuild:
-	sudo docker-compose rm -s -f
-	sudo docker-compose build
+## Shut-down docker-compose application and remove all its images and volumes.
+docker-purge:
+	sudo docker-compose down --rmi all -v --remove-orphans --timeout 0
diff --git a/Makefile b/Makefile
@@ -18,12 +18,28 @@ stop: docker-stop
 
 .PHONY: setup
 
-## Setup docker-compose application (generate .env file)
+## Setup docker-compose application (generate .env file in needed)
 setup: docker-setup
 
-## Rebuild docker-compose images
-rebuild: docker-rebuild
+.PHONY: update-setup
 
+## Update docker-compose application (regenerate .env file)
+setup-update: docker-setup-update
+
+.PHONY: purge
+
+## Remove docker-compose application and all its images and volumes.
+purge: docker-purge
 
 # Define default goal
 .DEFAULT_GOAL := help
+
+.PHONY: build
+
+## Build Docker images locally.
+build: docker-build
+
+.PHONY: pull
+
+## Pull images from Docker Hub
+pull: docker-pull
diff --git a/benchmarks/augmented_dataset/config.yml b/benchmarks/augmented_dataset/config.yml
@@ -0,0 +1,31 @@
+sources:
+  root: data/augmented_dataset
+  extensions:
+    - mp4
+    - ogv
+    - webm
+    - avi
+
+repr:
+  directory: data/benchmark_output/representations
+
+
+processing:
+  frame_sampling: 1
+  save_frames: true
+  match_distance: 0.75
+  video_list_filename: video_dataset_list.txt
+  filter_dark_videos: true
+  filter_dark_videos_thr: 2
+  min_video_duration_seconds: 3
+  detect_scenes: true
+  pretrained_model_local_path: null
+  keep_fileoutput: true
+
+database:
+  use: false
+  uri: postgres://postgres:admin@localhost:5432/videodeduplicationdb
+
+templates:
+  source_path: data/templates/test-group/CCSI Object Recognition External/
+
diff --git a/benchmarks/augmented_dataset/labels.csv b/benchmarks/augmented_dataset/labels.csv
diff --git a/benchmarks/evaluate.py b/benchmarks/evaluate.py
@@ -0,0 +1,121 @@
+import pandas as pd
+from glob import glob
+from utils import get_result, download_dataset, get_frame_sampling_permutations
+import os
+from winnow.utils import resolve_config
+import click
+from winnow.utils import scan_videos
+import subprocess
+import shlex
+import numpy as np
+import json
+
+pd.options.mode.chained_assignment = None
+
+@click.command()
+
+@click.option(
+    '--benchmark', '-b',
+    help='name of the benchmark to evaluated',
+    default='augmented_dataset')
+
+@click.option(
+    '--force-download', '-fd',
+    help='Force download of the dataset (even if an existing directory for the dataset has been detected',
+    default=False, is_flag=True)
+
+@click.option(
+    '--overwrite', '-o',
+    help='Force feature extraction, even if we detect that signatures have already been processed.',
+    default=False, is_flag=True)
+
+
+def main(benchmark, force_download, overwrite):
+
+    config_path = os.path.join('benchmarks', benchmark, 'config.yml')
+    config = resolve_config(config_path)
+    source_folder = config.sources.root
+
+    videos = scan_videos(source_folder, '**')
+
+    if len(videos) == 0 or force_download:
+
+        download_dataset(source_folder, url='https://winnowpre.s3.amazonaws.com/augmented_dataset.tar.xz')
+
+        videos = scan_videos(source_folder, '**')
+
+        print(f'Videos found after download:{len(videos)}')
+
+    if len(videos) > 0:
+
+        print('Video files found. Checking for existing signatures...')
+
+        signatures_path = os.path.join(
+                                    config.repr.directory,
+                                    'video_signatures', '**',
+                                    '**.npy')
+
+        signatures = glob(os.path.join(signatures_path), recursive=True)
+
+        if len(signatures) == 0 or overwrite:
+
+            # Load signatures and labels
+            #
+            command = f'python extract_features.py -cp {config_path}'
+            command = shlex.split(command)
+            subprocess.run(command, check=True)
+
+        # Check if signatures were generated properly
+        signatures = glob(os.path.join(signatures_path), recursive=True)
+
+        assert len(signatures) > 0, 'No signature files were found.'
+
+        available_df = pd.read_csv(
+                                os.path.join(
+                                            'benchmarks',
+                                            benchmark,
+                                            'labels.csv'))
+        frame_level = glob(
+                        os.path.join(
+                                    config.repr.directory,
+                                    'frame_level', '**',
+                                    '**.npy'), recursive=True)
+
+        signatures_permutations = get_frame_sampling_permutations(
+                                                        list(range(1, 6)),
+                                                        frame_level)
+
+        scoreboard = dict()
+
+        for fs, sigs in signatures_permutations.items():
+
+            results_analysis = dict()
+
+            for r in np.linspace(0.1, 0.25, num=10):
+
+                results = []
+
+                for i in range(5):
+
+                    mAP, pr_curve = get_result(
+                                            available_df,
+                                            sigs,
+                                            ratio=r,
+                                            file_index=frame_level)
+                    results.append(mAP)
+
+                results_analysis[r] = results
+
+            scoreboard[fs] = results_analysis
+
+        results_file = open('benchmarks/scoreboard.json', "w")
+        json.dump(scoreboard, results_file)
+        print('Saved scoreboard on {}'.format('benchmarks/scoreboard.json'))
+
+    else:
+
+        print(f'Please review the dataset (@ {source_folder})')
+
+if __name__ == '__main__':
+
+    main()
diff --git a/benchmarks/utils.py b/benchmarks/utils.py
@@ -0,0 +1,159 @@
+import pandas as pd
+import numpy as np
+from winnow.feature_extraction.loading_utils import evaluate, calculate_similarities, global_vector
+from winnow.feature_extraction.utils import load_image, download_file
+from winnow.feature_extraction import SimilarityModel
+from collections import defaultdict
+import os
+import shutil
+from glob import glob
+
+def get_queries(min_num_of_samples, df, col='original_filename'):
+
+    fc = df[col].value_counts()
+    msk = fc >= min_num_of_samples
+
+    return fc[msk].index.values
+
+
+def get_query_dataset(df, query, ratio=.22, col='original_filename'):
+
+    msk = df[col] == query
+    occ = df.loc[msk, :]
+    negative = df.loc[~msk, :]
+    n_positive_samples = len(occ)
+    positive_head = occ.sample(1)['new_filename'].values[0]
+
+    query_total = n_positive_samples / ratio
+    to_be_sampled = int(query_total - n_positive_samples)
+    confounders = negative.sample(to_be_sampled)
+    confounders.loc[:, 'label'] = 'X'
+    occ.loc[:, 'label'] = 'E'
+    merged = pd.concat([confounders, occ])
+
+    query_d = dict()
+
+    for i, row in merged.iterrows():
+
+        query_d[row['new_filename']] = row['label']
+
+    return positive_head, query_d
+
+
+def get_ground_truth(available_df, queries, min_samples=4, ratio=0.2):
+
+    ground_truth = dict()
+
+    for query in queries:
+
+        head, query_ds = get_query_dataset(available_df, query, ratio=ratio)
+
+        ground_truth[head] = query_ds
+
+    return ground_truth
+
+
+def convert_ground_truth(gt, base_to_idx):
+
+    queries = list(gt.keys())
+
+    qi = {base_to_idx[x]: i+1 for i, x in enumerate(queries)}
+
+    new_ds = dict()
+
+    for k, v in gt.items():
+
+        sub_d = dict()
+
+        for kk, vv in v.items():
+
+            sub_d[base_to_idx[kk]] = vv
+
+        new_ds[qi[base_to_idx[k]]] = sub_d
+
+    return new_ds
+
+
+def get_result(df, 
+               signatures,
+               min_samples=4,
+               ratio=0.25,
+               all_videos=False,
+               file_index=None):
+
+    if file_index is None:
+
+        signatures_data = np.array([np.load(x) for x in signatures])
+        basename = [os.path.basename(x)[:-4] for x in signatures]
+
+    else:
+
+        basename = [os.path.basename(x)[:-4] for x in file_index]
+        signatures_data = np.array(signatures)
+        signatures = file_index
+
+    basename_to_idx = {x: i for i, x in enumerate(basename)}
+
+    queries = get_queries(min_samples, df)
+    query_idx = [basename_to_idx[x] for x in queries]
+    similarities = calculate_similarities(query_idx, signatures_data)
+
+    ground_truth = get_ground_truth(df, queries, ratio=ratio)
+    final_gt = convert_ground_truth(ground_truth, basename_to_idx)
+    mAP, pr_curve = evaluate(final_gt, similarities, all_videos=all_videos)
+    return mAP, pr_curve
+
+
+def download_dataset(
+       dst,
+       url="https://winnowpre.s3.amazonaws.com/augmented_dataset.tar.xz"):
+
+    if not os.path.exists(dst):
+
+        os.makedirs(dst)
+
+    number_of_files = len(glob(dst + '/**'))
+    print('Files Found',number_of_files)
+
+    if number_of_files < 2:
+
+        print('Downloading sample dataset to:{}'.format(dst))
+
+        fp = os.path.join(dst, 'dataset.tar.gz')
+        if not os.path.isfile(fp):
+
+            download_file(fp, url)
+        # unzip files
+        print('unpacking', fp)
+        shutil.unpack_archive(fp, dst)
+        # Delete tar
+        os.unlink(fp)
+    else:
+        print('Files have already been downloaded')
+
+
+def get_frame_sampling_permutations(frame_samplings, frame_level_files):
+
+    d = defaultdict(list)
+
+    for v in frame_level_files:
+
+        data = np.load(v)
+
+        for frame_sampling in frame_samplings:
+
+            d[frame_sampling].append(data[::frame_sampling])
+
+    sm = SimilarityModel()
+
+    signatures = defaultdict(list)
+    for fs in d.keys():
+
+        video_level = np.array([global_vector(x) for x in d[fs]])
+        signatures[fs].append(
+                                sm.predict_from_features(
+                                    video_level.reshape(
+                                                video_level.shape[0],
+                                                video_level.shape[2])))
+
+    return signatures