Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate the opportunities of a .NET-based parser #113

Closed
17 of 18 tasks
philipmat opened this issue Aug 26, 2020 · 4 comments · Fixed by #114
Closed
17 of 18 tasks

Investigate the opportunities of a .NET-based parser #113

philipmat opened this issue Aug 26, 2020 · 4 comments · Fixed by #114
Assignees
Milestone

Comments

@philipmat
Copy link
Owner

philipmat commented Aug 26, 2020

The dotnet_parser branch has a dumb, unoptimized .NET Core-based parser and the results are promising, to say the least: on labels, it's about 3-4x faster than the python version: 22s vs 75s.

$ time python3 run.py --export=label /Users/af59986/Dev/tmp/discogs /Users/af59986/Dev/tmp/discogs/csv-dir
Processing      labels: 1571873labels [01:14, 21041.64labels/s]                                            
python3 run.py --export=label /Users/af59986/Dev/tmp/discogs   74.51s user 0.44s system 99% cpu 1:15.05 total
$ time dotnet run bin/Release/netcoreapp3.1/discogs.dll -- ~/Dev/tmp/discogs/discogs_20200806_labels.xml.gz
0  - bin/Release/netcoreapp3.1/discogs.dll; 1  - /Users/af59986/Dev/tmp/discogs/discogs_20200806_labels.xml.gz
Variant2: /Users/af59986/Dev/tmp/discogs/discogs_20200806_labels.xml.gz
Found 1,571,873 label. Wrote them to /Users/af59986/Dev/tmp/discogs/label.csv; /Users/af59986/Dev/tmp/discogs/label_url.csv; /Users/af59986/Dev/tmp/discogs/label_image.csv.
dotnet run bin/Release/netcoreapp3.1/discogs.dll --   27.20s user 2.34s system 131% cpu 22.435 total

releases.xml.gz (.NET w/o track artists): 42:38 vs 1:45:16

$ time python3 run.py --export=release /Users/af59986/Dev/tmp/discogs /Users/af59986/Dev/tmp/discogs/csv-dir
Processing    releases: 12867980releases [1:45:15, 2037.40releases/s]                                      
python3 run.py --export=release /Users/af59986/Dev/tmp/discogs   5753.31s user 48.92s system 91% cpu 1:45:16.46 total
time dotnet run bin/Release/netcoreapp3.1/discogs.dll -- ~/Dev/tmp/discogs/discogs_20200806_releases.xml.gz 
0  - bin/Release/netcoreapp3.1/discogs.dll; 1  - /Users/af59986/Dev/tmp/discogs/discogs_20200806_releases.xml.gz
Variant2: /Users/af59986/Dev/tmp/discogs/discogs_20200806_releases.xml.gz
Parsing done. Writing streams.                                                                                          
Found 12,867,980 releases. Wrote them to.....
dotnet run bin/Release/netcoreapp3.1/discogs.dll --   3182.56s user 240.38s system 133% cpu 42:38.93 total

Performance Numbers

Note: tests consistent (same OS, files, etc) only across same file

File Record Count Python C#
discogs_20200806_artists.xml.gz 7,046,615 6:22 2:35
discogs_20200806_labels.xml.gz 1,571,873 1:15 0:22
discogs_20200806_masters.xml.gz 1,734,371 3:56 1:57
discogs_20200806_releases.xml.gz 12,867,980 1:45:16 42:38

TODO:

  • labels parser (smallest file) that creates equivalent files to the python parser
  • releases parser (largest file)
    • compare times with python parser
    • compare release csv files with python csv files
  • artists
  • masters
  • compressed csv files
  • progress bar indicating conversion status
  • "accurate" API counts - might provide in a patch, if requested.
  • tests
  • GH build actions
  • binary production for major platforms
  • changes to database
  • command line arguments
  • verbose flag might provide in a patch if we can figure out what information is verbose
  • dry-run flag
  • provide platform builds; can get them from the built artifacts and attach them to the release
  • update README with running instructions
@philipmat philipmat self-assigned this Aug 26, 2020
@philipmat philipmat linked a pull request Aug 31, 2020 that will close this issue
@MuleaneEve
Copy link

This is awesome!
I will take a look at the pull request.

@philipmat philipmat added this to the v2.1 milestone Sep 4, 2020
@ijabz
Copy link
Collaborator

ijabz commented Sep 6, 2020

But its not really xcrossplatform is it, would be difficult to run on a non-windows platform ?

@philipmat
Copy link
Owner Author

@ijabz I encourage you to try the latest Linux build.

It's one binary requiring positively nothing to install.

I'm not sure how to make it much easier than that.

@ijabz
Copy link
Collaborator

ijabz commented Sep 7, 2020

Thought .net was windows only, my mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants