Help: looping through a table of CSV files #18

Gavin-Holt · 2021-06-28T11:54:44Z

Hi,

I am using SQLean in a script to process UK NHS waiting times (I'm a surgeon - not an IT professional). This uses 50 large (90+MB) CSV files and I have been using vsv.dll to load each file, extract data, then drop the file, finally dumping the extract.

The main attraction is using SQLite3.exe to run scripts or as a REPL. This neatly avoids using any other programming language i.e. a single script to maintain/share. (I would be reluctant to resort to batch files or another programming language if I can possibly achieve my goals with just SQLite3.exe).

The whole process of reading the CSV and dumping into Excel is brilliant. I guess much work has gone into the CSV reading library - many thanks.

With cut/paste I have made it work, but there is a great deal of repetition making further development difficult.

Googling I suspect there is a way to loop through a table of file names? This would seem to be a likely usage pattern for vsv.dll and I was hoping you might show me an example to follow?

I don't expect you to write the script for me but I have appended below my work so far.

Any guidance welcome.

Kind Regards Gavin Holt

Raw data from:
https://www.england.nhs.uk/statistics/statistical-work-areas/rtt-waiting-times/
I hand edited these to remove top lines and create consistent filenames

Load vsv

.load vsv

My table of files:

-- Create a table of files to process
CREATE TABLE filelist(
filename,                      -- UNIX path from USB root
census,                         -- Date as 2021-10-31
UNIQUE(filename,census)
);

INSERT INTO filelist (filename,census)
VALUES
("/MyProjects/#NHSWaitingTimes/201704.csv","2017-04-30"),
("/MyProjects/#NHSWaitingTimes/201705.csv","2017-05-31"),
("/MyProjects/#NHSWaitingTimes/201706.csv","2017-06-30"),
("/MyProjects/#NHSWaitingTimes/201707.csv","2017-07-31"),
("/MyProjects/#NHSWaitingTimes/201708.csv","2017-08-31"),
("/MyProjects/#NHSWaitingTimes/201709.csv","2017-09-30"),
("/MyProjects/#NHSWaitingTimes/201710.csv","2017-10-31"),
("/MyProjects/#NHSWaitingTimes/201711.csv","2017-11-30"),
("/MyProjects/#NHSWaitingTimes/201712.csv","2017-12-31"),
("/MyProjects/#NHSWaitingTimes/201801.csv","2018-01-31"),
("/MyProjects/#NHSWaitingTimes/201802.csv","2018-02-27"),
("/MyProjects/#NHSWaitingTimes/201803.csv","2018-03-31"),
("/MyProjects/#NHSWaitingTimes/201804.csv","2018-04-30"),
("/MyProjects/#NHSWaitingTimes/201805.csv","2018-05-31"),
("/MyProjects/#NHSWaitingTimes/201806.csv","2018-06-30"),
("/MyProjects/#NHSWaitingTimes/201807.csv","2018-07-31"),
("/MyProjects/#NHSWaitingTimes/201808.csv","2018-08-31"),
("/MyProjects/#NHSWaitingTimes/201809.csv","2018-09-30"),
("/MyProjects/#NHSWaitingTimes/201810.csv","2018-10-31"),
("/MyProjects/#NHSWaitingTimes/201811.csv","2018-11-30"),
("/MyProjects/#NHSWaitingTimes/201812.csv","2018-12-31"),
("/MyProjects/#NHSWaitingTimes/201901.csv","2019-01-31"),
("/MyProjects/#NHSWaitingTimes/201902.csv","2019-02-27"),
("/MyProjects/#NHSWaitingTimes/201903.csv","2019-03-31"),
("/MyProjects/#NHSWaitingTimes/201904.csv","2019-04-30"),
("/MyProjects/#NHSWaitingTimes/201905.csv","2019-05-31"),
("/MyProjects/#NHSWaitingTimes/201906.csv","2019-06-30"),
("/MyProjects/#NHSWaitingTimes/201907.csv","2019-07-31"),
("/MyProjects/#NHSWaitingTimes/201908.csv","2019-08-31"),
("/MyProjects/#NHSWaitingTimes/201909.csv","2019-09-30"),
("/MyProjects/#NHSWaitingTimes/201910.csv","2019-10-31"),
("/MyProjects/#NHSWaitingTimes/201911.csv","2019-11-30"),
("/MyProjects/#NHSWaitingTimes/201912.csv","2019-12-31"),
("/MyProjects/#NHSWaitingTimes/202001.csv","2020-01-31"),
("/MyProjects/#NHSWaitingTimes/202002.csv","2020-02-27"),
("/MyProjects/#NHSWaitingTimes/202003.csv","2020-03-31"),
("/MyProjects/#NHSWaitingTimes/202004.csv","2020-04-30"),
("/MyProjects/#NHSWaitingTimes/202005.csv","2020-05-31"),
("/MyProjects/#NHSWaitingTimes/202006.csv","2020-06-30"),
("/MyProjects/#NHSWaitingTimes/202007.csv","2020-07-31"),
("/MyProjects/#NHSWaitingTimes/202008.csv","2020-08-31"),
("/MyProjects/#NHSWaitingTimes/202009.csv","2020-09-30"),
("/MyProjects/#NHSWaitingTimes/202010.csv","2020-10-31"),
("/MyProjects/#NHSWaitingTimes/202011.csv","2020-11-30"),
("/MyProjects/#NHSWaitingTimes/202012.csv","2020-12-31"),
("/MyProjects/#NHSWaitingTimes/202101.csv","2021-01-31"),
("/MyProjects/#NHSWaitingTimes/202102.csv","2021-02-27"),
("/MyProjects/#NHSWaitingTimes/202103.csv","2021-03-31")
;

Create a table to hold summary

CREATE TABLE waiting(
census,                         -- Year, format varies from file to file!
provider,                       -- [Provider Org Name]
department,                     -- [Treatment Function Code]
dataset,                        -- [RTT Part Description]
months_sum int,                 -- Area under the curve
upto52_count int,               -- [Total] to get average months
unknown_count int,              -- [Patients with unknown clock start date
over52_count int,               -- [Gt 52 Weeks SUM 1]
case_count int,                 -- [Total All]
mean_wait,                       -- Excel formula
UNIQUE(census,provider,department,dataset)
);

My analysis for each file:

CREATE virtual TABLE temp.vsv
USING vsv(filename="/MyProjects/#NHSWaitingTimes/201704.csv", header=yes );

INSERT or IGNORE INTO waiting
SELECT
'2017-04-30' AS census,         -- [census]
[Provider Org Name],            -- [provider]
[Treatment Function Name],      -- [department]
[RTT Part Description],         -- [dataset]
sum(
([Gt 00 to 01 Weeks SUM 1]*1)+
([Gt 01 to 02 Weeks SUM 1]*2)+
([Gt 02 to 03 Weeks SUM 1]*3)+
([Gt 03 to 04 Weeks SUM 1]*4)+
([Gt 04 to 05 Weeks SUM 1]*5)+
([Gt 05 to 06 Weeks SUM 1]*6)+
([Gt 06 to 07 Weeks SUM 1]*7)+
([Gt 07 to 08 Weeks SUM 1]*8)+
([Gt 08 to 09 Weeks SUM 1]*9)+
([Gt 09 to 10 Weeks SUM 1]*10)+
([Gt 10 to 11 Weeks SUM 1]*11)+
([Gt 11 to 12 Weeks SUM 1]*12)+
([Gt 12 to 13 Weeks SUM 1]*13)+
([Gt 13 to 14 Weeks SUM 1]*14)+
([Gt 14 to 15 Weeks SUM 1]*15)+
([Gt 15 to 16 Weeks SUM 1]*16)+
([Gt 16 to 17 Weeks SUM 1]*17)+
([Gt 17 to 18 Weeks SUM 1]*18)+
([Gt 18 to 19 Weeks SUM 1]*19)+
([Gt 19 to 20 Weeks SUM 1]*20)+
([Gt 20 to 21 Weeks SUM 1]*21)+
([Gt 21 to 22 Weeks SUM 1]*22)+
([Gt 22 to 23 Weeks SUM 1]*23)+
([Gt 23 to 24 Weeks SUM 1]*24)+
([Gt 24 to 25 Weeks SUM 1]*25)+
([Gt 25 to 26 Weeks SUM 1]*26)+
([Gt 26 to 27 Weeks SUM 1]*27)+
([Gt 27 to 28 Weeks SUM 1]*28)+
([Gt 28 to 29 Weeks SUM 1]*29)+
([Gt 29 to 30 Weeks SUM 1]*30)+
([Gt 30 to 31 Weeks SUM 1]*31)+
([Gt 31 to 32 Weeks SUM 1]*32)+
([Gt 32 to 33 Weeks SUM 1]*33)+
([Gt 33 to 34 Weeks SUM 1]*34)+
([Gt 34 to 35 Weeks SUM 1]*35)+
([Gt 35 to 36 Weeks SUM 1]*36)+
([Gt 36 to 37 Weeks SUM 1]*37)+
([Gt 37 to 38 Weeks SUM 1]*38)+
([Gt 38 to 39 Weeks SUM 1]*39)+
([Gt 39 to 40 Weeks SUM 1]*40)+
([Gt 40 to 41 Weeks SUM 1]*41)+
([Gt 41 to 42 Weeks SUM 1]*42)+
([Gt 42 to 43 Weeks SUM 1]*43)+
([Gt 43 to 44 Weeks SUM 1]*44)+
([Gt 44 to 45 Weeks SUM 1]*45)+
([Gt 45 to 46 Weeks SUM 1]*46)+
([Gt 46 to 47 Weeks SUM 1]*47)+
([Gt 47 to 48 Weeks SUM 1]*48)+
([Gt 48 to 49 Weeks SUM 1]*49)+
([Gt 49 to 50 Weeks SUM 1]*50)+
([Gt 50 to 51 Weeks SUM 1]*51)+
([Gt 51 to 52 Weeks SUM 1]*52)+
([Gt 52 Weeks SUM 1]*52)    ),  -- [months_sum]
sum([Total]),                   -- [upto52_coubt]
sum([Patients with unknown clock start date]), -- [unknown_count]
sum([Gt 52 Weeks SUM 1]),       -- [over52_count]
sum([Total All]),               -- [all_count]
'=INT(IF(INDIRECT("F" & ROW())>0,INDIRECT("E" & ROW())/INDIRECT("F" & ROW()),INDIRECT("E" & ROW())/INDIRECT("I" & ROW())))' AS mean_wait
FROM vsv
WHERE [Treatment Function Name]<>'Total'
GROUP BY [census],
[Provider Org Name],
[Treatment Function Name],
[RTT Part Description];

DROP TABLE vsv;
-- Loop ends

Export to excel:

.headers on
.excel
SELECT * FROM waiting ORDER BY [provider];

End:

The text was updated successfully, but these errors were encountered:

nalgeon · 2021-06-29T17:00:47Z

SQLite does not include a scripting language like Oracle or PostgreSQL. I believe you have to use Python or like to automate the task.

Gavin-Holt · 2021-06-29T18:55:38Z

Hi,

I am running my sqlite "script" using sqlite3.exe and it's "-init" option and no db name. This loads the script file with dot commands and SQL, runs and then leaves the REPL open for interaction. It is also gives great error messages to help me learn. See Command Line Shell https://sqlite.org/cli.html

I am simply looking for a way to use SQLite to repeat a process using vsv on a list (table) of filenames. I know SQLite does not have stored procedures, but I was hoping there is a native SQL/SQLite syntax to repeat the same analysis using parameters from a table?

The WITH RECURSIVE statement seems to be a likely candidate, but making it do what I want is beyond my ability.

I known this may be stretching the Appropriate Uses, but it's so tempting to try maximising what can be done with just SQLite3.exe with your add-on libraries😊

Kind Regards

Gavin-Holt · 2021-06-29T20:13:16Z

Sorry clicked close by mistake!

nalgeon · 2021-06-30T10:03:19Z

I know SQLite does not have stored procedures, but I was hoping there is a native SQL/SQLite syntax to repeat the same analysis using parameters from a table?

Not that I know of.

Gavin-Holt · 2021-06-30T17:41:31Z

Many thanks for your time. I will report back if I can find a solution.

Kind Regards Gavin Holt

nalgeon · 2021-06-30T20:09:23Z

Sorry I couldn't help 🤷

Dialga · 2022-01-04T10:59:20Z

There is the csv extension, which allows for the following syntax:

**    CREATE VIRTUAL TABLE temp.csv USING csv(filename=FILENAME);
**    SELECT * FROM csv;
**
** The columns are named "c1", "c2", "c3", ... by default.  Or the
** application can define its own CREATE TABLE statement using the
** schema= parameter, like this:
**
**    CREATE VIRTUAL TABLE temp.csv2 USING csv(
**       filename = "../http.log",
**       schema = "CREATE TABLE x(date,ipaddr,url,referrer,userAgent)"
**    );
**
** Instead of specifying a file, the text of the CSV can be loaded using
** the data= parameter.

https://sqlite.org/src/file?name=ext/misc/csv.c&ci=tip

Gavin-Holt closed this as completed Jun 29, 2021

Gavin-Holt reopened this Jun 29, 2021

Gavin-Holt closed this as completed Jun 30, 2021

LifeCANvs mentioned this issue Jun 10, 2024

upstream issues LifeCANvs/LCvs-Lib-SQLite-ext#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help: looping through a table of CSV files #18

Help: looping through a table of CSV files #18

Gavin-Holt commented Jun 28, 2021 •

edited

Loading

nalgeon commented Jun 29, 2021

Gavin-Holt commented Jun 29, 2021

Gavin-Holt commented Jun 29, 2021

nalgeon commented Jun 30, 2021

Gavin-Holt commented Jun 30, 2021 •

edited

Loading

nalgeon commented Jun 30, 2021

Dialga commented Jan 4, 2022

Help: looping through a table of CSV files #18

Help: looping through a table of CSV files #18

Comments

Gavin-Holt commented Jun 28, 2021 • edited Loading

nalgeon commented Jun 29, 2021

Gavin-Holt commented Jun 29, 2021

Gavin-Holt commented Jun 29, 2021

nalgeon commented Jun 30, 2021

Gavin-Holt commented Jun 30, 2021 • edited Loading

nalgeon commented Jun 30, 2021

Dialga commented Jan 4, 2022

Gavin-Holt commented Jun 28, 2021 •

edited

Loading

Gavin-Holt commented Jun 30, 2021 •

edited

Loading