Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SQL sanitization #4

Open
gzt5142 opened this issue Dec 12, 2022 · 9 comments
Open

SQL sanitization #4

gzt5142 opened this issue Dec 12, 2022 · 9 comments
Assignees

Comments

@gzt5142
Copy link
Owner

gzt5142 commented Dec 12, 2022

means by which sql statements are sanitized to remove attempts at injection.

@gzt5142
Copy link
Owner Author

gzt5142 commented Dec 30, 2022

User inputs a source ID for display, validate, and ingest commands. A user could try to use SQL injection by supplying a source id such as 13; drop table features

Need to sanitize provided crawler_source_id input

@gzt5142
Copy link
Owner Author

gzt5142 commented Dec 30, 2022

A side benefit of using a library such as sqlalchemy for interacting with the database is that the python constructs take care of a lot of injection worries for us. We don't ever send real SQL to the database; sqlalchemy does that. So the opportunities for injection are more limited.

Example:

from sqlalchemy import select
statement = select(CrawlerSource).order_by(CrawlerSource.crawler_source_id).where(CrawerSource.crawler_source_id == 13)

The python statement is meant to mimic what SQL would look like, but we don't get to write the SQL ourselves... sqlalchemy writes the SQL behind the scenes and submits. It does rudimentary sanity checks for us as a part of that.

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 3, 2023

Need to sanitize provided crawler_source_id input

Following up on this idea.... all crawler source IDs are integers, so we could use click to validate types from the command line, which would also remove this worry (and would probably be more robust than what I'm doing). However, I want the option to nldi-cli validate all... So I need the arg to the validate command to be a string in order to accommodate that all possibility.

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 5, 2023

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 5, 2023

Related issue...

Some of the data we get back as GeoJSON has unprintable characters in strings. In and of itself, this is not a problem. But sometimes, those characters are backspace and delete characters, which can mess with the strings that get inserted into SQL statements.

A brute-force method for dealing with this is to encode strings as ASCII, replacing non-ASCII with ?. The resulting string is re-encoded as UTF-8 for passing to the database. This is not an elegant solution, but it does quickly eliminate suspicious characters from strings.

@dblodgett-usgs
Copy link

Really? That should probably just be an error that the provider should have to fix?

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 17, 2023

Really? That should probably just be an error that the provider should have to fix?

I discovered it in some of the GeoJSON ... but none that we actually use. that is, such characters were included in properties that we don't rely on. So we could probably get away with not manipulating the strings as I describe.

But it is also a super-fast operation that guarantees clean strings. In terms of having the provider fix them... I could generate a list of records that contain non-printables so we can notify them. I'll see how many and from which providers.

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 17, 2023

sqlalchemy does not offer API calls within its core module for dropping and renaming tables -- something that we have to do when a complete set of features is ingested. That is still done with raw SQL -- a security vulnerability that I'd like to get away from if we can.

The only piece of that raw SQL that varies is the table name, which is adapted from the crawler_source.source_suffix value. We have been assuming that this value is clean. Need to come up with a way to ensure it doesn't contain SQL. low-ish priority, given how hard it would be to manipulate that value in the database.

EDIT:
This is higher priority than I thought.... turns out that the crawler source table contains an illegal source_suffix value for source 13: geoconnex-demo. The hyphen is taken to be an infix minus.

I now sanitize the source_suffix value to discard any 'non-word' characters. Word characters are specified by the regex pattern [a-zA-Z0-9_] (shorthand: \w). This will take care of error characters as well as any malicious strings.

@gzt5142
Copy link
Owner Author

gzt5142 commented Jan 19, 2023

all crawler source IDs are integers, so we could use click to validate types from the command line,

I have changed my perspective on this. I have refactored to have click validate source_id values as integers. On the validate subcommand, I default to all sources if none is specified:

nldi-cli validate 13

Validates source 13

nldi-cli validate

Validates all sources found in the crawler_sources table.

@gzt5142 gzt5142 self-assigned this Feb 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants