Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List files processed but excluded from multiuser stats and leader boards #71

Closed
jteresco opened this issue Apr 15, 2018 · 24 comments
Closed

Comments

@jteresco
Copy link
Contributor

We have a user who has submitted a second list for travels in an app in addition to his real travels. He asked to remove it from the system to keep the site's stats and leader boards legitimate. A task to consider is how to allow such a list file to be processed for a user and to be able to see maps and stats for such a user but exclude them from multiuser stats and leader boards. These kinds of list files might be useful for some other purposes. One way I might use them is to see stats and maps for a single trip for which I'd submit a list file.

@jteresco
Copy link
Contributor Author

This issue came back to mind as I traveled a route this week that previously had 0 TM travelers. But since it will show up in both terescoj.list (my overall travels) and jt2021.list (my 2021 travels), it will look like 2 users once I add it to both (I'm just adding travels on this trip to terescoj.list for now). I am pretty sure there are several "users" like this - in fact I have three extra users for subsets of my own travels.

I'm not sure how best to go forward on this. An extra file somewhere listing .list file names that should be excluded from rankings and traveler counts? Can it all be done in the site update and DB or will there need to be some web front end fixes as well?

@yakra
Copy link
Contributor

yakra commented Aug 21, 2021

Or some kind of delimiter pattern like _jt2021.list, __jt2021__.list, +jt2021.list, etc.?

@jteresco
Copy link
Contributor Author

Or some kind of delimiter pattern like _jt2021.list, jt2021.list, +jt2021.list, etc.?

Thinking back on this issue a bit, trying to decide if using such a pattern for these kinds of lists might mean this could be entirely a web front end issue. Maybe it's a matter of modifying the SQL queries to ignore traveler names that have the selected pattern.

@jteresco
Copy link
Contributor Author

TravelMapping/Web#741

@jteresco
Copy link
Contributor Author

The DataProcessing contribution here might be to introduce a new DB table for the travelers, with a column indicating if this is a "real" user or one of these users that should be excluded from stats and rankings. This could also be an opportunity to store some other user info in the DB if there are things we wanted to store during site update.

@jteresco
Copy link
Contributor Author

I might try to tackle this one soon. Summarizing options for how to indicate that a .list file is not a "real user":

  1. Put non-real user list in a subdirectory of list_files in the repository, or possibly all lists into one of two subdirectories, one for real users and one for non-real.
  2. Add a separate file in the repository listing the names of non-real users. If this were to be a .csv file, it could have other augmentations like a "real name" for the user, a home location, etc.
  3. Require something special in the filenames for non-real users like starting with an underscore.

When a non-real user is identified during the site update process, a new table would need to be introduced into the DB to make this information available.

@michihdeu
Copy link
Contributor

michihdeu commented Jun 11, 2024

I like option 2. I dislike Option 1 since it is a potential source of error.

@jteresco
Copy link
Contributor Author

Advantages of the extra file listing these kinds of things (option 2): that's much easier to undo or ignore, and nothing else in the repository needs to change to implement it, and new "normal" files can be created without needing to do anything with the file.

@jteresco
Copy link
Contributor Author

Here's how I'm thinking of it: the site update program will read in a new file in the list_files (or rlist_files) directory which I'm calling travelerinfo.csv. The initial version looks like this so far, putting in several of the known non-real traveler names:

traveler;real
defaults;1
6lane.list;0
6lanetest.list;0
futureinterstates.list;0
jt2015a3;0
jt2020;0
jt2020q5;0
jt2021;0
jt2022;0
jt2023;0
jt2024;0

The site update program would read this file, then use its contents to add lines to the generated .sql file to create and populate a new table called travelers. It would have a table row for every .list file that was imported, and a DB table column for each field in the CSV file. The first line of the travelerinfo.csv file gives the column names, and the second line gives the default values to use for each traveler if they are not listed in the remainder of the CSV. For those listed in the CSV, the values there would be used instead.

This would mean additional fields could be added later without further modifications to the site update program.

Once this information is in the DB, it could be used by the web front end in the implementation for TravelMapping/Web#741 to exclude the non-real users from rank counting.

@yakra (or others), any thoughts?

@michihdeu
Copy link
Contributor

I liked the augmentations idea of option 2. I‘d minimum add a description column to name what the user refers to eg 6lanetest

@jteresco
Copy link
Contributor Author

Definitely on the description column.

@yakra
Copy link
Contributor

yakra commented Jun 14, 2024

This would mean additional fields could be added later without further modifications to the site update program.

We'd still need to get all the CREATE TABLE args right.
This suggests to me an external .sql file as done in df09ebd.
siteupdate could read it in and spit it right back out into the DB file, then append the data from the .csv

For turning the .csv data into an SQL line, maybe use strtok, strcspn or some std::string-based method to get our single-quotes, commas, parentheses, whatever else; whatever makes sense & gets the job done.

Sanity checks:

  • Double up ' single quotes as needed
  • Disallow " double quotes
  • Make sure each line has the right number of fields.
  • As needed, make sure numeric fields are in fact numeric.
    Probably use the same valid_num_str function used to validate Waypoint lat/lng.
    When to do this? When the field is indicated in the previous external .sql file as being INTEGER, FLOAT, or... are there other numeric types?
  • Anything else?

@jteresco
Copy link
Contributor Author

The sanity checks are definitely important. Especially doubling up the single quotes. I would have forgotten that.

For determining types of each field, maybe another "special" line? So the format could be:

Line of field names
Line of field SQL types (can be used both to pick correct sanity checks and directly in CREATE TABLE)
Line of default values
Data lines

jteresco added a commit that referenced this issue Jun 25, 2024
@jteresco
Copy link
Contributor Author

jteresco commented Jun 25, 2024

In the unrankedusers branch, there is now code to read in a .csv and create a new database table. It is not yet being used in any meaningful way. The TravelMappingTest database on noreaster currently includes this table.

@jteresco
Copy link
Contributor Author

To do: omit unranked users from traveled-format graphs.

@yakra
Copy link
Contributor

yakra commented Jun 26, 2024

To do: omit unranked users from traveled-format graphs.

Want to find a way to do this that retains the performance benefits of using TMBitsets.
The first thing that comes to mind is to use a sorted TMArray of TravelerList objects.

Best way to do that is to sort the ids/usernames before constructing the objects, which means doing this before this. This looks safe to do per my first quick glance at the new code, as we're just setting up listinfo.
This allows us to change the sort criterion here to use a lambda/function that checks whether both strings are contained in listinfo and sort appropriately.

Hmm. Exact mechanics TBD.
I assume travelers not in listinfo will all default to being ranked.
For those that are, the data will specify whether they're ranked or not, correct?
We can assume the ranked column will always be at a certain column number, or we can find that column in the header row (fieldnames), and use that to index into the listinfo vectors.
Then there's the matter of how ranked & unranked are specified in the data field on each row -- 1 or 0, a text value, possible other values beyond simple ranked/unranked, etc.

The big advantage of doing this:
This keeps all of the ranked users together at the front of the array, and all the unranked users together at the end.
Just change the 2nd arg in this ctor call to num_ranked_travelers or whatever we end up calling it, and the TMBitset for travelers on edges can then point to a contiguous subset of the TMBitset for travelers on segments, and graph generation can proceed just as it does now -- even a wee bit quicker. The existing TMBitset::fast_union function can be used without modification.

Disadvantage:
Travelers will no longer be sorted alphabetically overall.
Just the 2 sub-arrays of ranked & unranked travelers will be.
I can't think of anywhere this comes into play off the top of my head other than stats CSVs. Is this something we even care about?

Should we also omit unranked travelers from stats CSVs?

@jteresco
Copy link
Contributor Author

To do: omit unranked users from traveled-format graphs.

Want to find a way to do this that retains the performance benefits of using TMBitsets.

Thanks for the thoughts on this. Please feel free to make this a very low priority. Maybe it deserves to be spun off into its own issue.

@yakra
Copy link
Contributor

yakra commented Jun 26, 2024

Does the .csv file exist anywhere?
I suppose I could read the forum thread ;)

@michihdeu
Copy link
Contributor

Do we need any check for https://github.com/TravelMapping/UserData/blob/master/list_files/listfileinfo.csv to avoid crashing the site update?

@yakra
Copy link
Contributor

yakra commented Jun 26, 2024

^ Jim's new code has a few errorchecks.

{ el.add_error("Error opening listinfo.csv file.");

{ el.add_error("Number of defaults does not match number of fieldnames in listinfo.csv.");

{ el.add_error("Number of fields does not match number of fieldnames in listinfo.csv for list " + listname);

I haven't looked closely enough too see whether anything else is necessary.
Maybe a sanity check for a minimum of 2 lines, header + defaults?

@michihdeu
Copy link
Contributor

Invalid characters?

@yakra
Copy link
Contributor

yakra commented Jun 30, 2024

  • Whether numeric fields can be fully & correctly converted to ints/floats?

@jteresco wrote:

This would mean additional fields could be added later without further modifications to the site update program

@jteresco wrote:

Line of field SQL types (can be used both to pick correct sanity checks and directly in CREATE TABLE)

Has this idea been abandoned?
I really don't have an issue with that being either a Yes or No, other than to note that actually doing this would be more complicated and harder to get right.
If abandoning, easier to add some add'l errorchecks.

sqlfile << "CREATE TABLE listEntries (traveler VARCHAR(" << DBFieldLength::traveler
<< "), description VARCHAR(" << DBFieldLength::listDescription
<< "), includeInRanks BOOLEAN);\n";

The number of fields is hard-coded at 3.

// check that the number of defaults matches the number of fieldnames
if (TravelerList::fieldnames.size() != TravelerList::defaults.size())
{ el.add_error("Number of defaults does not match number of fieldnames in listinfo.csv.");
return;
}

For this errorcheck, we may be better off making sure the number of fields in each line is the same as a constant, e.g. DBFieldLength::listEntriesNumFields.
TravelerList::read_listinfo could compare each field against DBFieldLength::listDescription.

// report any remaining entries in the listinfo map as errors (TODO)

Difficulty here is that by the time we're writing the DB, we've already checked for ErrorList entries, found none, and failed to abort.
To do the TMBitset stuff, something has to happen earlier in the program.
Makes sense to combine this stuff.
Maybe something like...

  • Gather traveler IDs
  • Read listinfo, store necessary info with the ID, and delete entries from the map
    • One option is to expand the traveler IDs, to a pair, tuple or custom struct
    • Or maybe construct barebones traveler objects with the listinfo, separating out .list reading into its own function, separate from the ctor, to happen later.
    • ...depending on how everything is done.
  • Sort travelers.
  • Later on, read .list files.

@jteresco
Copy link
Contributor Author

jteresco commented Jul 1, 2024

I don't think it's worth pursuing the more generic .csv idea. If we come up with more fields to add some time, we can update the site update code to handle.

Also, I have no objection to combining/moving around how and when the .csv is read in o help with error checking and efficiency.

Thanks!

jteresco added a commit that referenced this issue Jul 2, 2024
@jteresco
Copy link
Contributor Author

jteresco commented Jul 4, 2024

This has been merged into master and was used for a morning site update on 7/4/2024. Closing this issue. Please open new issues for feature requests and bug reports related to this new functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants