Skip to content

ima-hima/InsightData_Edgar

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EDGAR parser for Insight Data Engineer fellowship

BSD3 license

Python code to parse EDGAR output and create new log files containing summary.

Project status: 1.0 release

Directory Description of Contents
src main code base
run.sh shell script to run source
insight_testsuite test suites and test data
input where EDGAR files will reside
output log files will be written

####Approximate pseudocode

ip_lookup -> dictionary of last query time -> set(ip)
log_table -> dictionary of ip -> [original_query_time, num_documents]
prev_time -> previous value of query time in log file; not value of last query

interval  <- interval from file
prev_time <- minimum time available in datetime

for each line in log, check current_time:
    if current_time's changed:
        for log_item in ip_lookup for each where query_time in
            [prev_time + 1 - interval, current_time - interval]:
                for log_item in log_table, sorted by original_query_time:
                    print log_item
                delete log_table[log_item]
            delete ip_lookup[current_time - interval]
        change prev_time to current_time
    if line[ip] exists in log_table:    ### this means we're still inside interval
        update log_table[ip][num_documents] += 1
        delete ip_lookup[current_time - interval + 1][ip]
        add ip_lookup[current_time][ip]
    else:  ### line[ip] doesn't exist
        add ip_lookup[current_time] = ip
        add log_table[ip][original_query_time] = current_time
        add log_table[ip][num_documents] = 1

for log_item in log_table, sorted by original_query_time:
    print log_item, final_time = current_time

Test cases in my-tests

  1. Time has expired but IP doesn't appear again.
  2. Change in time > 1.
  3. Queries are still active at end of log.
  4. Other inactivity periods.

TODOs

  1. Could be more modular.
  2. The data structure is a little complicated. Making it an object would be nice.
  3. Not currently using Pythonic EAFP coding style vs. LBYL.
  4. Dictionary comprehension and row printing could be a little prettier.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published