Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help: looping through a table of CSV files #18

Closed
Gavin-Holt opened this issue Jun 28, 2021 · 7 comments
Closed

Help: looping through a table of CSV files #18

Gavin-Holt opened this issue Jun 28, 2021 · 7 comments

Comments

@Gavin-Holt
Copy link

Gavin-Holt commented Jun 28, 2021

Hi,

I am using SQLean in a script to process UK NHS waiting times (I'm a surgeon - not an IT professional). This uses 50 large (90+MB) CSV files and I have been using vsv.dll to load each file, extract data, then drop the file, finally dumping the extract.

The main attraction is using SQLite3.exe to run scripts or as a REPL. This neatly avoids using any other programming language i.e. a single script to maintain/share. (I would be reluctant to resort to batch files or another programming language if I can possibly achieve my goals with just SQLite3.exe).

The whole process of reading the CSV and dumping into Excel is brilliant. I guess much work has gone into the CSV reading library - many thanks.

With cut/paste I have made it work, but there is a great deal of repetition making further development difficult.

Googling I suspect there is a way to loop through a table of file names? This would seem to be a likely usage pattern for vsv.dll and I was hoping you might show me an example to follow?

I don't expect you to write the script for me but I have appended below my work so far.

Any guidance welcome.

Kind Regards Gavin Holt

Raw data from:
https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/
I hand edited these to remove top lines and create consistent filenames

Load vsv

.load vsv

My table of files:

-- Create a table of files to process
CREATE TABLE filelist(
filename,                      -- UNIX path from USB root
census,                         -- Date as 2021-10-31
UNIQUE(filename,census)
);

INSERT INTO filelist (filename,census)
VALUES
("/MyProjects/#NHSWaitingTimes/201704.csv","2017-04-30"),
("/MyProjects/#NHSWaitingTimes/201705.csv","2017-05-31"),
("/MyProjects/#NHSWaitingTimes/201706.csv","2017-06-30"),
("/MyProjects/#NHSWaitingTimes/201707.csv","2017-07-31"),
("/MyProjects/#NHSWaitingTimes/201708.csv","2017-08-31"),
("/MyProjects/#NHSWaitingTimes/201709.csv","2017-09-30"),
("/MyProjects/#NHSWaitingTimes/201710.csv","2017-10-31"),
("/MyProjects/#NHSWaitingTimes/201711.csv","2017-11-30"),
("/MyProjects/#NHSWaitingTimes/201712.csv","2017-12-31"),
("/MyProjects/#NHSWaitingTimes/201801.csv","2018-01-31"),
("/MyProjects/#NHSWaitingTimes/201802.csv","2018-02-27"),
("/MyProjects/#NHSWaitingTimes/201803.csv","2018-03-31"),
("/MyProjects/#NHSWaitingTimes/201804.csv","2018-04-30"),
("/MyProjects/#NHSWaitingTimes/201805.csv","2018-05-31"),
("/MyProjects/#NHSWaitingTimes/201806.csv","2018-06-30"),
("/MyProjects/#NHSWaitingTimes/201807.csv","2018-07-31"),
("/MyProjects/#NHSWaitingTimes/201808.csv","2018-08-31"),
("/MyProjects/#NHSWaitingTimes/201809.csv","2018-09-30"),
("/MyProjects/#NHSWaitingTimes/201810.csv","2018-10-31"),
("/MyProjects/#NHSWaitingTimes/201811.csv","2018-11-30"),
("/MyProjects/#NHSWaitingTimes/201812.csv","2018-12-31"),
("/MyProjects/#NHSWaitingTimes/201901.csv","2019-01-31"),
("/MyProjects/#NHSWaitingTimes/201902.csv","2019-02-27"),
("/MyProjects/#NHSWaitingTimes/201903.csv","2019-03-31"),
("/MyProjects/#NHSWaitingTimes/201904.csv","2019-04-30"),
("/MyProjects/#NHSWaitingTimes/201905.csv","2019-05-31"),
("/MyProjects/#NHSWaitingTimes/201906.csv","2019-06-30"),
("/MyProjects/#NHSWaitingTimes/201907.csv","2019-07-31"),
("/MyProjects/#NHSWaitingTimes/201908.csv","2019-08-31"),
("/MyProjects/#NHSWaitingTimes/201909.csv","2019-09-30"),
("/MyProjects/#NHSWaitingTimes/201910.csv","2019-10-31"),
("/MyProjects/#NHSWaitingTimes/201911.csv","2019-11-30"),
("/MyProjects/#NHSWaitingTimes/201912.csv","2019-12-31"),
("/MyProjects/#NHSWaitingTimes/202001.csv","2020-01-31"),
("/MyProjects/#NHSWaitingTimes/202002.csv","2020-02-27"),
("/MyProjects/#NHSWaitingTimes/202003.csv","2020-03-31"),
("/MyProjects/#NHSWaitingTimes/202004.csv","2020-04-30"),
("/MyProjects/#NHSWaitingTimes/202005.csv","2020-05-31"),
("/MyProjects/#NHSWaitingTimes/202006.csv","2020-06-30"),
("/MyProjects/#NHSWaitingTimes/202007.csv","2020-07-31"),
("/MyProjects/#NHSWaitingTimes/202008.csv","2020-08-31"),
("/MyProjects/#NHSWaitingTimes/202009.csv","2020-09-30"),
("/MyProjects/#NHSWaitingTimes/202010.csv","2020-10-31"),
("/MyProjects/#NHSWaitingTimes/202011.csv","2020-11-30"),
("/MyProjects/#NHSWaitingTimes/202012.csv","2020-12-31"),
("/MyProjects/#NHSWaitingTimes/202101.csv","2021-01-31"),
("/MyProjects/#NHSWaitingTimes/202102.csv","2021-02-27"),
("/MyProjects/#NHSWaitingTimes/202103.csv","2021-03-31")
;

Create a table to hold summary

CREATE TABLE waiting(
census,                         -- Year, format varies from file to file!
provider,                       -- [Provider Org Name]
department,                     -- [Treatment Function Code]
dataset,                        -- [RTT Part Description]
months_sum int,                 -- Area under the curve
upto52_count int,               -- [Total] to get average months
unknown_count int,              -- [Patients with unknown clock start date
over52_count int,               -- [Gt 52 Weeks SUM 1]
case_count int,                 -- [Total All]
mean_wait,                       -- Excel formula
UNIQUE(census,provider,department,dataset)
);

My analysis for each file:

CREATE virtual TABLE temp.vsv
USING vsv(filename="/MyProjects/#NHSWaitingTimes/201704.csv", header=yes );

INSERT or IGNORE INTO waiting
SELECT
'2017-04-30' AS census,         -- [census]
[Provider Org Name],            -- [provider]
[Treatment Function Name],      -- [department]
[RTT Part Description],         -- [dataset]
sum(
([Gt 00 to 01 Weeks SUM 1]*1)+
([Gt 01 to 02 Weeks SUM 1]*2)+
([Gt 02 to 03 Weeks SUM 1]*3)+
([Gt 03 to 04 Weeks SUM 1]*4)+
([Gt 04 to 05 Weeks SUM 1]*5)+
([Gt 05 to 06 Weeks SUM 1]*6)+
([Gt 06 to 07 Weeks SUM 1]*7)+
([Gt 07 to 08 Weeks SUM 1]*8)+
([Gt 08 to 09 Weeks SUM 1]*9)+
([Gt 09 to 10 Weeks SUM 1]*10)+
([Gt 10 to 11 Weeks SUM 1]*11)+
([Gt 11 to 12 Weeks SUM 1]*12)+
([Gt 12 to 13 Weeks SUM 1]*13)+
([Gt 13 to 14 Weeks SUM 1]*14)+
([Gt 14 to 15 Weeks SUM 1]*15)+
([Gt 15 to 16 Weeks SUM 1]*16)+
([Gt 16 to 17 Weeks SUM 1]*17)+
([Gt 17 to 18 Weeks SUM 1]*18)+
([Gt 18 to 19 Weeks SUM 1]*19)+
([Gt 19 to 20 Weeks SUM 1]*20)+
([Gt 20 to 21 Weeks SUM 1]*21)+
([Gt 21 to 22 Weeks SUM 1]*22)+
([Gt 22 to 23 Weeks SUM 1]*23)+
([Gt 23 to 24 Weeks SUM 1]*24)+
([Gt 24 to 25 Weeks SUM 1]*25)+
([Gt 25 to 26 Weeks SUM 1]*26)+
([Gt 26 to 27 Weeks SUM 1]*27)+
([Gt 27 to 28 Weeks SUM 1]*28)+
([Gt 28 to 29 Weeks SUM 1]*29)+
([Gt 29 to 30 Weeks SUM 1]*30)+
([Gt 30 to 31 Weeks SUM 1]*31)+
([Gt 31 to 32 Weeks SUM 1]*32)+
([Gt 32 to 33 Weeks SUM 1]*33)+
([Gt 33 to 34 Weeks SUM 1]*34)+
([Gt 34 to 35 Weeks SUM 1]*35)+
([Gt 35 to 36 Weeks SUM 1]*36)+
([Gt 36 to 37 Weeks SUM 1]*37)+
([Gt 37 to 38 Weeks SUM 1]*38)+
([Gt 38 to 39 Weeks SUM 1]*39)+
([Gt 39 to 40 Weeks SUM 1]*40)+
([Gt 40 to 41 Weeks SUM 1]*41)+
([Gt 41 to 42 Weeks SUM 1]*42)+
([Gt 42 to 43 Weeks SUM 1]*43)+
([Gt 43 to 44 Weeks SUM 1]*44)+
([Gt 44 to 45 Weeks SUM 1]*45)+
([Gt 45 to 46 Weeks SUM 1]*46)+
([Gt 46 to 47 Weeks SUM 1]*47)+
([Gt 47 to 48 Weeks SUM 1]*48)+
([Gt 48 to 49 Weeks SUM 1]*49)+
([Gt 49 to 50 Weeks SUM 1]*50)+
([Gt 50 to 51 Weeks SUM 1]*51)+
([Gt 51 to 52 Weeks SUM 1]*52)+
([Gt 52 Weeks SUM 1]*52)    ),  -- [months_sum]
sum([Total]),                   -- [upto52_coubt]
sum([Patients with unknown clock start date]), -- [unknown_count]
sum([Gt 52 Weeks SUM 1]),       -- [over52_count]
sum([Total All]),               -- [all_count]
'=INT(IF(INDIRECT("F" & ROW())>0,INDIRECT("E" & ROW())/INDIRECT("F" & ROW()),INDIRECT("E" & ROW())/INDIRECT("I" & ROW())))' AS mean_wait
FROM vsv
WHERE [Treatment Function Name]<>'Total'
GROUP BY [census],
[Provider Org Name],
[Treatment Function Name],
[RTT Part Description];

DROP TABLE vsv;
-- Loop ends

Export to excel:

.headers on
.excel
SELECT * FROM waiting ORDER BY [provider];

End:

@nalgeon
Copy link
Owner

nalgeon commented Jun 29, 2021

SQLite does not include a scripting language like Oracle or PostgreSQL. I believe you have to use Python or like to automate the task.

@Gavin-Holt
Copy link
Author

Hi,

I am running my sqlite "script" using sqlite3.exe and it's "-init" option and no db name. This loads the script file with dot commands and SQL, runs and then leaves the REPL open for interaction. It is also gives great error messages to help me learn. See Command Line Shell https://sqlite.org/cli.html

I am simply looking for a way to use SQLite to repeat a process using vsv on a list (table) of filenames. I know SQLite does not have stored procedures, but I was hoping there is a native SQL/SQLite syntax to repeat the same analysis using parameters from a table?

The WITH RECURSIVE statement seems to be a likely candidate, but making it do what I want is beyond my ability.

I known this may be stretching the Appropriate Uses, but it's so tempting to try maximising what can be done with just SQLite3.exe with your add-on libraries😊

Kind Regards

@Gavin-Holt
Copy link
Author

Sorry clicked close by mistake!

@nalgeon
Copy link
Owner

nalgeon commented Jun 30, 2021

I know SQLite does not have stored procedures, but I was hoping there is a native SQL/SQLite syntax to repeat the same analysis using parameters from a table?

Not that I know of.

@Gavin-Holt
Copy link
Author

Gavin-Holt commented Jun 30, 2021

Many thanks for your time. I will report back if I can find a solution.

Kind Regards Gavin Holt

@nalgeon
Copy link
Owner

nalgeon commented Jun 30, 2021

Sorry I couldn't help 🤷

@Dialga
Copy link

Dialga commented Jan 4, 2022

There is the csv extension, which allows for the following syntax:

**    CREATE VIRTUAL TABLE temp.csv USING csv(filename=FILENAME);
**    SELECT * FROM csv;
**
** The columns are named "c1", "c2", "c3", ... by default.  Or the
** application can define its own CREATE TABLE statement using the
** schema= parameter, like this:
**
**    CREATE VIRTUAL TABLE temp.csv2 USING csv(
**       filename = "../http.log",
**       schema = "CREATE TABLE x(date,ipaddr,url,referrer,userAgent)"
**    );
**
** Instead of specifying a file, the text of the CSV can be loaded using
** the data= parameter.

https://sqlite.org/src/file?name=ext/misc/csv.c&ci=tip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants