Skip to content

Commit

Permalink
zip ha url parsing
Browse files Browse the repository at this point in the history
  • Loading branch information
yohhaan committed May 16, 2024
1 parent 8345d78 commit 8590fed
Show file tree
Hide file tree
Showing 6 changed files with 51 additions and 25 deletions.
22 changes: 22 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"name": "ha_topics_classification",
"build": {
"dockerfile": "../Dockerfile"
},
"postCreateCommand": "",
"customizations": {
"vscode": {
"extensions": [
// Other helpers
"shardulm94.trailing-spaces",
"stkb.rewrap" // rewrap comments after n characters on one line
],
"settings": {
// General settings
"files.eol": "\n",
"rewrap.autoWrap.enabled": true,
"editor.formatOnSave": true
}
}
}
}
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
ha_urls.zip
ha_urls/*
!ha_urls/.gitkeep
1 change: 1 addition & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ RUN apt-get update && apt-get install -y python3 \
python3-pip \
libusb-1.0-0-dev \
parallel \
unzip \
locales && \
apt-get clean autoclean && \
apt-get autoremove
Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ Classification of HTTP Archive origins by the Topics API.

1. Clone this repository along with its submodule with: `git clone --recurse-submodules <HTTPS or SSH URL>`.

2. Replace the `origins.txt` file with the domains to classify (just the FQDN).
2. Place the `.csv` files with the HA origins under ha_urls.

3. Launch classification (we recommend using a `screen` session):
3. Launch classification (we recommend using a `screen` session):
- (if you have dependencies installed): `./classify_origins.sh`
- **System Dependencies:** `python3`, GNU `parallel`
- **System Dependencies:** `python3`, GNU `parallel`, `unzip`
- **Python Dependencies:** `pandas`, `tflite-support`
- (if using Docker):
```
Expand Down
44 changes: 22 additions & 22 deletions classify_origins.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,28 +4,28 @@
model_version=chrome5 #latest as of May 14, 2024
classification_type=topics-api #replicates Chrome Classification

# Variables
input_path=origins.txt
domains_path=domains.txt
classified_path=classification_${model_version}_${classification_type}.tsv
ha_dir=ha_urls #folder with all the csv files with HA origins

#check if domains exist and
if [ ! -f $classified_path ]
then
for csv_file in ${ha_dir}/*.csv
do
domains_file="${csv_file%.*}"_domains.txt
classified_file="${csv_file%.*}"_classification_${model_version}_${classification_type}.tsv

#some preprocessing to make sure there is no http(s):// or duplicates
#if already enforced, feel free to remove and pass directly input_path to
#parallel command
sed -r "s/^https?:\/\///" $input_path > $domains_path
sort -u $domains_path -o $domains_path
if [ ! -f $classified_file ]
then
#some preprocessing to make sure there is no http(s):// or duplicates
#if already enforced, feel free to remove and pass directly input_path to
#parallel command, remove url header line too
sed -r "s/^https?:\/\/(.*)\/?$/\1/p" $csv_file > $domains_file
sed -i -r "s/(.*)\/$/\1/p" $domains_file
sed -i '1d' $domains_file
sort -u $domains_file -o $domains_file
#Classification
#Header
echo "domain\ttopic_id" > $classified_file
#Parallel inference
parallel -X --bar -N 1000 -a $domains_file -I @@ "python3 topics_classifier/classify.py -mv $model_version -ct $classification_type -i @@ >> $classified_file"
echo "Classification Done: $csv_file"
fi
done

#Classification
#Header
echo "domain\ttopic_id" > $classified_path
#Parallel inference
parallel -X --bar -N 1000 -a $domains_path -I @@ "python3 topics_classifier/classify.py -mv $model_version -ct $classification_type -i @@ >> $classified_path"

#rm domains input
rm $domains_path

fi
Empty file added ha_urls/.gitkeep
Empty file.

0 comments on commit 8590fed

Please sign in to comment.