Skip to content

Commit

Permalink
version 3.0.0 - IP Geolocation integration
Browse files Browse the repository at this point in the history
see changelog for more information
  • Loading branch information
WillTheFarmer committed Jan 28, 2025
1 parent c3232ee commit 374aa8a
Show file tree
Hide file tree
Showing 6 changed files with 48 additions and 33 deletions.
4 changes: 2 additions & 2 deletions .github/CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
- version 2.1.4 - 01/02/2025 - add import_device TABLE to separate import_client TABLE
- version 2.1.5 - 01/03/2025 - move platformNode column from import_client to import_device
- version 2.1.6 - 01/09/2025 - repository name change - ApacheLogs2MySQL to apache-logs-to-mysql
- version 3.0.0 - 01/28/2025 - IP Geolocation integration, several table & column renames, many process refinements - see changelog
- version 3.0.0 - 01/28/2025 - IP Geolocation integration, several table & column renames, many process refinements
- [1.0.1] apache_logs.error_systemCodeID corrected line - INTO logsystemCode to INTO logsystemCodeID
- [1.0.1] remove debugging - SELECT statement from apache_logs.process_access_import, process_error_import & normalize_useragent.
- [1.0.1] remove whitespace and commented out old code on all stored programs
Expand Down Expand Up @@ -98,7 +98,7 @@
- [3.0.0] rename TABLES `log_clientname` to `log_client`, `log_servername` to `log_server`
- [3.0.0] rename COLUMNS `clientnameid` to `clientid`, `servernameid` to `serverid` throughout application tables and processes.
- [3.0.0] modify `process_access_parse` and `process_error_parse` WHERE CLAUSES for server_name UPDATE commands.
- [3.0.0] add 16 stored functions for log attribute tables to return names for Slice and dice is a data analysis in drill-down Web interface.
- [3.0.0] add 16 stored functions for primary attribute tables to return names for Slice and dice is a data analysis in drill-down Web interface.
- [3.0.0] modify and reworded all console log messages in `logs2mysql.py` to standardize messages for each process. Added COLORS to coordinate message types for better readability.
- [3.0.0] modify all database INDEX NAMES for standardization and consolidation.
- [3.0.0] tested simultaneously uploading logs from 10 VPS with multiple VirtualHosts on each Server processing thousands of files in different formats and millions of log records.
18 changes: 2 additions & 16 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,9 @@
To contribute any Issues or Errors found using application please create a `New issue` under repository `Issues` tab.
To contribute Issues or Errors found using application please create a `New issue` under repository `Issues` tab.

To contribute Ideas or Comments please create a `New discussion` under repository `Discussions` tab.

To contribute Apache Access or Error Log Formats commonly used that application should process please start `New discussion` about that.
To contribute Apache Access or Error Log Formats commonly used that application should process please start `New discussion`.

Any organizations, people or person with multiple Apache servers that find application a godsend in log collection monetary contributions are appreciated. Repository has my ***Buy Me a Coffee*** & ***Venmo*** links.

I volunteer for a nonprofit organization that wanted to import their Apache website logs into MySQL tables to query data. The Executive Director loves MySQL so I decided to research existing solutions that used MySQL. I thought it would be two or three days of my time.

First I installed the Apache log_sql_mysql modules which did create a single MySQL mostly empty table of the access log with no control or customization and many other issues. Next I experimented with several simple log file parsers but none normalized the parsed log data into a MySQL database. Finally I reviewed other available Apache logging solutions that didn't use MySQL including GoAccess, Logstash, Apache Viewer, DataDog and others as well as CrowdStrike and Solarwinds Loggly.

Mid-September 2024 after all my research I decided to write a simple solution which snowballed into a complete application. All October I worked long hours around the clock. November I spent incorporating the application into VPS websites and applications I oversee while making improvements along the way. Version 2.0.0 fixed the major issues encountered and is the application baseline. December I spent refining the major changes made in Version 2.0.0. Version 2.1.5 was last code change to fix client identification issue when OS version changes by adding `import_device` TABLE.

First 2 weeks of January 2025 I spent processing millions of records from 10 VPS simultaneously to single MySQL Server. Version 3.0.0 is last major change with IP Address geoLocation and a final pass through to fine tune processes and rename some tables and columns. This version of application is production ready.

The final version is less Python and more SQL and much faster processing millions of records. At this point, I have over 1050 hours of research, design, iteration & development into application. It is much more time then I intended to invest into this project but it did produce my first open-source software.

That's how volunteering, lack of a viable MySQL solution and a flexible schedule came together just right to allow me to dive deep into this project.

### “Timing, degree and conviction are the three wise men in this life.” — Robert I. Fitzhenry

Monetary contributions made will be reflected in development of [Web Interface](https://github.com/WillTheFarmer/mysql-to-apache-echarts) for this MySQL `apache_logs` schema.
25 changes: 16 additions & 9 deletions .github/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,18 +6,24 @@ and normalizing data into database designed for reports & data analysis.
Imports Access Logs in LogFormats - ***common***, ***combined*** and ***vhost_combined*** & additional ***csv2mysql***
LogFormat defined :point_down:

Imports Error Logs in ***default*** ErrorLogFormat & ***additional*** ErrorLogFormat defined below performing data harmonization on Apache Codes & Messages, System Codes & Messages, and Log Messages to create a unified, standardized dataset. Error Log view images :point_down:
Imports Error Logs in ***default*** ErrorLogFormat & ***additional*** ErrorLogFormat defined below performing data harmonization
on Apache Codes & Messages, System Codes & Messages, and Log Messages to create a unified, standardized dataset. Error Log view images :point_down:

All processing stages are encapsulated within one "Import Load" that captures process metrics, notifications and errors into MySQL import tables. Every log data record is traceable back to the computer, folder, file, load process, parse process and import process it came from.
All processing stages are encapsulated within one "Import Load" that captures process metrics, notifications and errors into MySQL import tables.
Every log data record is traceable back to the computer, folder, file, load process, parse process and import process it came from.

Multiple Access and Error logs and formats can be loaded, parsed and imported along with User Agent parsing and IP Address geoLocation retrieval in a single execution. A single execution can also be configured to only load logs to Server.
### Console Process Messages - 4 LogFormats, 2 ErrorLogFormats & 6 MySQL Stored Procedures
Multiple Access and Error logs and formats can be loaded, parsed and imported along with User Agent parsing and IP Address geolocation retrieval in a single execution.
A single execution can also be configured to only load logs to Server.
### Process Messages in Console - 4 LogFormats, 2 ErrorLogFormats & 6 MySQL Stored Procedures
![Processing Messages Console](./assets/processing_messages_console.png)
New version has [MaxMind GeoIP2](https://github.com/maxmind/GeoIP2-python) Python API integration with 5 additional MySQL tables for IP geoLocation data. Two DB-IP Lite databases are required - `IP to City` and `IP to ASN`. Free DB-IP Lite databases can be found at [DB-IP](https://db-ip.com/db/lite.php)

A visualization tool for the MySQL Schema ***apache_logs*** is [MySQL2ApacheECharts](https://github.com/willthefarmer/mysql-to-apache-echarts) and currently under development. The Web interface consists of Express.js web application frameworks with Drill Down Capability & [Apache ECharts](https://github.com/apache/echarts) frameworks for Data Visualization.
ApacheLogs2MySQL has [MaxMind GeoIP2](https://github.com/maxmind/GeoIP2-python) Python API integration with 6 MySQL tables for IP geolocation data normalization.
Two DB-IP Lite databases are required - `IP to City` and `IP to ASN`. Free DB-IP Lite databases can be found at [DB-IP](https://db-ip.com/db/lite.php)

Database Schema ***apache_logs*** designed to accommodate unlimited servers & domains. Step-by-step guide for easy installation :point_down:

A visualization tool for the MySQL Schema ***apache_logs*** is [MySQL2ApacheECharts](https://github.com/willthefarmer/mysql-to-apache-echarts) and currently under development.
The Web interface consists of [Express](https://github.com/expressjs/express) web application frameworks with Drill Down Capability
& [Apache ECharts](https://github.com/apache/echarts) frameworks for Data Visualization.
## Entity Relationship Diagram of apache_logs schema tables
![Entity Relationship Diagram](./assets/entity_relationship_diagram.png)
Diagram created with open-source database diagrams editor [chartdb/chartdb](https://github.com/chartdb/chartdb)
Expand Down Expand Up @@ -105,7 +111,8 @@ LogFormat "%v,%p,%h,%l,%u,%t,%I,%O,%S,%B,%{ms}T,%D,%^FB,%>s,\"%H\",\"%m\",\"%U\"
|%{VARNAME}C|ADDED - The contents of cookie VARNAME in request sent to server. Only version 0 cookies are fully supported. Format String is optional.|
|%L|ADDED - The request log ID from the error log (or '-' if nothing has been logged to the error log for this request). Look for the matching error log line to see what request| caused what error.
## Two supported Error Log Formats
Application processes Error Logs with ***default format*** for threaded MPMs (Multi-Processing Modules). If running Apache 2.4 on any platform and ErrorLogFormat is not defined in config files this is the Error Log format.
Application processes Error Logs with ***default format*** for threaded MPMs (Multi-Processing Modules). If running Apache 2.4 on any platform
and ErrorLogFormat is not defined in config files this is the Error Log format.
Information from: https://httpd.apache.org/docs/2.4/mod/core.html#errorlogformat
```
ErrorLogFormat "[%{u}t] [%-m:%l] [pid %P:tid %T] %7F: %E: [client\ %a] %M% ,\ referer\ %{Referer}i"
Expand Down Expand Up @@ -289,7 +296,7 @@ Normalization ensures that data is organized in a way that makes sense for the d
MySQL `apache_logs` schema currently has 55 Tables, 908 Columns, 188 Indexes, 72 Views, 8 Stored Procedures and 90 Functions to process Apache Access log in 4 formats
& Apache Error log in 2 formats. Database normalization at work!
## MySQL Access Log View by Browser - 1 of 66 schema views
Current schema views are Access and Error Attribute Primary tables created in normalization process with simple aggregate values.
Current schema views are Access and Error primary attribute tables created in normalization process with simple aggregate values.
These are primitive data presentations of the log data warehouse. ApacheLogs2MySQL is the 'EL' of the 'ELK' Stack. The Web interface
[MySQL2ApacheECharts](https://github.com/willthefarmer/mysql-to-apache-echarts) in development is the 'K' of the 'ELK' Stack.

Expand Down
22 changes: 22 additions & 0 deletions apache_logs_schema.sql
Original file line number Diff line number Diff line change
@@ -1,3 +1,25 @@
-- # Licensed under the Apache License, Version 2.0 (the "License");
-- # you may not use this file except in compliance with the License.
-- # You may obtain a copy of the License at
-- #
-- # http://www.apache.org/licenses/LICENSE-2.0
-- #
-- # Unless required by applicable law or agreed to in writing, software
-- # distributed under the License is distributed on an "AS IS" BASIS,
-- # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-- # See the License for the specific language governing permissions and
-- # limitations under the License.
-- #
-- # version 3.0.0 - 01/28/2025 - IP Geolocation integration, table & column renames, refinements - see changelog
-- #
-- # Copyright 2024 Will Raymond <farmfreshsoftware@gmail.com>
-- #
-- # CHANGELOG.md in repository - https://github.com/WillTheFarmer/apache-logs-to-mysql
-- #
-- file: apache_logs_schema.sql
-- synopsis: Data definition language (DDL) for creating MySQL schema apache_logs for ApacheLogs2MySQL application
-- author: Will Raymond <farmfreshsoftware@gmail.com>

CREATE DATABASE IF NOT EXISTS `apache_logs` /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci */ /*!80016 DEFAULT ENCRYPTION='N' */;
USE `apache_logs`;
-- MySQL dump 10.13 Distrib 8.0.40, for Win64 (x86_64)
Expand Down
8 changes: 4 additions & 4 deletions logs2mysql.py
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
from time import time
from time import ctime
from datetime import datetime
load_dotenv() # Loads variables from .env into the environment
load_dotenv() # Loads variables from .env into the environment
mysql_host = getenv('MYSQL_HOST')
mysql_port = int(getenv('MYSQL_PORT'))
mysql_user = getenv('MYSQL_USER')
Expand Down Expand Up @@ -76,7 +76,7 @@
geoip2_city = getenv('GEOIP2_CITY')
geoip2_asn = getenv('GEOIP2_ASN')
geoip2_process = int(getenv('GEOIP2_PROCESS'))
# makes process start, complete, info and error messages noticeable in console - all error messages start with 'ERROR - ' for keyword log search
# Readability of process start, complete, info and error messages in console - all error messages start with 'ERROR - ' for keyword log search
class bcolors:
GREEN = '\33[32m'
GREENER = '\033[92m'
Expand All @@ -98,7 +98,7 @@ class bcolors:
'database': mysql_schema,
'local_infile': True
}
# information to identify & register import upload client
# Information to identify & register import load clients
def get_device_id():
sys_os = system()
if sys_os == "Windows":
Expand Down Expand Up @@ -591,7 +591,7 @@ def processLogs():
# SECONDARY PROCESSES BELOW: Client Module UPLOAD is done with load, parse and import processes of access and error logs. The below processes enhance User Agent and Client IP log data.
# Initially UserAgent and GeoIP2 processes were each in separate files. After much design consideration and application experience and Code Redundancy being problematic
# the decision was made to encapsulate all processes within the same "Import Load" which captures and logs all execution metrics, notifications and errors
# into MySQL tables for each execution. Every log datarecord can be tracked back to the file, folder, computer, load process, parse process and import process it came from.
# into MySQL tables for each execution. Every log data record can be tracked back to the file, folder, computer, load process, parse process and import process it came from.
# Processes may require individual execution even when NONE of above processes are executed. If this Module is run automatically on a client server to upload Apache Logs to centralized
# MySQL Server the processes below will never be executed. In some cases, only the processes below are needed for execution on MySQL Server or another centralized computer.
# In some cases, ALL processes above and below will be executed in a single "Import Load" execution. Therefore, the encapsulation of all processes in a single module.
Expand Down
4 changes: 2 additions & 2 deletions watch4logs.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,12 +28,12 @@
from watchdog.events import FileSystemEventHandler
from dotenv import load_dotenv
from logs2mysql import processLogs
load_dotenv() # Loads variables from .env into the environment
load_dotenv() # Loads variables from .env into the environment
watch_path = os.getenv('WATCH_PATH')
watch_recursive = bool(int(os.getenv('WATCH_RECURSIVE')))
watch_interval = int(os.getenv('WATCH_INTERVAL'))
watch_log = int(os.getenv('WATCH_LOG'))
# make error messages noticeable in console - all error messages start with 'ERROR - ' for keyword log search
# Readability of event messages in console
class bcolors:
GREEN = '\33[32m'
GREENER = '\033[92m'
Expand Down

0 comments on commit 374aa8a

Please sign in to comment.