-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathlci.txt
375 lines (237 loc) · 22.3 KB
/
lci.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
==========
Mon Sep 03 16:39:24 -0700 2018
4 months on...
I don't think we ever solved the sign reversal issue-- that is something to revisit as / after we complete the context refactor.
For now- we need to get some LCI results so that we can investigate variability in ecoinvent.
Problem is, the process for doing that is a little convoluted-- so let's recap it here.
1. The components
* EcospoldV2Archive - source data for inventory interface
* BackgroundEngine - performs the ordering with a captive tstack
* FlatBackground - stores the results of the ordering as a set of numpy objects
* TarjanBackground - wraps the flat background inside of an LcArchive object
* TarjanBackgroundImplementation - implements background interface
(simply uses archive as a host to hold _flat; all the computations use _flat)
So the key question here is whether we really need a TarjanBackground, or whether instead we can get by with a more strict use of the TarjanBackgroundImplementation. Right now it looks as though the only reason we need to subclass LcArchive is so that we can install TarjanBackgroundImplementation as the output of make_interface('background')
If we abandon that and make the interface ourselves manually, what would that look like?
Wed 2018-09-05 23:04:55 -0700
Interruption: performance failure
Cause: using an RxRef to query value...
leads to an interface exchange_values() call
which leads to _gen_exchanges, which is a terrible way to repeatedly find a reference exchange
the exchange data could maybe be a lot better as a single, massive sql table
with select queries.
hm.
In this case we can deal with it by changing how _reference_ exchange values are retrieved-- so reimplementing process_ref.reference_value() --> which means changing the interface--
-- really just adding the reference=False kwarg to the signature
-- and leaving _gen_exchanges and the sql backend to the scaling implementation.
Thu 2018-09-06 01:13:14 -0700
Anyway, that's sorted and we're back to <30s, profiled, for 3.4 APOS (which has just short of 15,000 nodes).
==========
Wed May 02 11:53:08 -0700 2018
On point now: the tarjan algorithm does NOT need to run every time. Instead I should serialize the Tarjan Stack and make that the only thing the bg needs.
So-- what does the Background manager really need?
Implementation methods:
(general)
foreground_flows
background_flows
exterior_flows
(product-flow specific)
_ensure_ref_flow: needs a local list of flow entities
foreground(process, flow)
dependencies(process, flow)
emissions(process, flow)
is_in_background(process, flow)
lci(process, flow)
ad_tilde(process, flow)
bf_tilde(process, flow)
BackgroundManager methods:
tstack: background is not None
Wed 2018-05-02 12:24:14 -0700
Well-- a lot of things can be moved, but a lot of things still live in the background engine.
There are two issues here: one is that the tarjan stack + background engine were designed to be built incrementally- which is nice and sweet and cool- even if it is a bit false, because it doesn't really support changes to the data, only incremental additions of the data, so it's not really required- but more importantly it introduces complexity.
Second: there is POOR separation of concerns between tstack and be. They are tightly symbiotic, which is nice in a biological setting but really suggests the programmer was snoozing through this (which is not actually true, the programmer was implementing a highly complex algorithm, and successfully, but couldn't solve this many problems at once).
So the solution in both cases is to leave the machinery as it is, and change the background manager to have a clear INTERFACIAL dependence on tstack. Then re-implement tstack to a serializeable form, and then serialize it inside an LcArchive subclass.
Wed 2018-05-02 14:02:27 -0700
Ultimately what has to happen here is the creation of an Scc object that encapsulates the matrix construction and inversion. So, most of be._update_component_graph moves into that; make_foreground needs to get generalized to deal with multiple Sccs, and moved into tstack; lci computation needs to be generalized (to multiple Sccs) and moved to tstack, and background engine runs and terminates on __init__, adding all reference products and returning a [static, fully-populated] t-stack object.
==========
Thu May 10 14:27:51 -0700 2018
Now we are testing the background computation from a natural inventory source via the catalog for apparently the first time, and it doesn't work. why? because it uses the inventory archive instead of the static archive, for complicated reasons.
The relationship between the background engine and the catalog is complex. The background implementation instantiates the background manager with a query interface in setup_bm(), which in the archive case is called when the _bg property is first invoked, and in the catalog case is called when the implementation is first created, before it is returned to the user.
This already feels like a big hack.
Anyhow, the catalog query seeds the background engine with itself, thus providing a mechanism for the background engine to answer any query its heart desires. This is great in theory.
In practice, there are two ways the background engine is expected to work:
1- from a static, comprehensive archive that has been loaded into memory
2- from a dynamic query
We definitely want to support (2) though ensure that it operates efficiently. However, our performance benchmark is based on (1), which is also the way we expect the system to work in this interim period.
What does the background engine require of the query interface?
- count('process') -- to set recursion limit
- get(termination.external_ref) -- to provide a working reference to the terminating process
- get(exch.termination) -- ditto
- terminate(flow, direction) -- to find terminations when one is not specified explicitly
- get(term.external_ref) -- to finish the job
- processes() -- to add all ref products
In the current failure, it is the get() call that is breaking things, and that is because the ecospold2 implementation calls list_datasets() on the zip archive on every invocation because (as of ca393ba) we no longer want ecospold to permit partial loading of any process.
That is fair, really, but the list_datasets() approach is terrible.
on 3.2 apos, list_datasets() takes 100 ms to run, every time, even when a search term is provided.
Solution: list_datasets() once on __init__, and make a mapping of process uuid to flow uuids.
Thu 2018-05-10 15:51:12 -0700
OK. Having fixed that, the ecoinvent background now runs in 1m56s, which is 14s longer than it took to read in all spold files in generating the static background.
OK! so we are back to benchmark performance.
Except now there is the second matter, which is: in the presence of a static background archive, should we (shouldn't we?) use it for the background computation?
Indeed, the whole point of the 'background' interface designation was to specify that the static data are available (and allocated)
also, there is complexity because: we are planning on phasing out the current 'background' engine in favor of a binary numpy structure that will preserve the ordering and presumably load in much faster.
so really, the static archive should NOT have the background label- instead the background label should come from the outcome of the background engine computation, and be saved as a distinct resource.
the cached static archive is really just an index+inventory interface that happens to be static.
it should also be datestamped, just like the cached index already is.
In fact, there is no reason to have two different methods index_resource() and create_static_archive()
the only question is whether both should be saved after load_all(), or only one. As we know, an unrestricted inventory and index interface are all that's required to operate the background engine. so it should be run on demand when a background resource is requested.
In that case, the question is back to how best to resolve a query. In LcCatalog.get_interface(), we sort by whether a resource is loaded, but we should sort by whether it is both loaded AND static. That should take care of the problem.
An interesting entailment of this is that 'BackgroundRequired' goes away-- in the event that a background is required, we should just attempt to create one and the error will upgrade to 'InventoryRequired' instead.
Except that could result in large charges. So the catalog should allow the user to explicitly create a background-- create_static_archive() should be folded into index_resource() and replaced with create_background().
But then create_background() needs to properly select the data source to use-- that's where the sort by loaded and static comes in. If we have a static loaded resource, use it.
Thu 2018-05-10 16:13:43 -0700
OK. So, simply adding that into get_resource() solved the problem but now the runtime is 28 s!
time to profile.
51.791 s
51.410 add_ref_product
44.462 _add_ref_product
6.948 _update_component_graph
564,000 calls to CatalogQuery._iface() taking 13.4 s -- that probably does not help.
261,458 calls to CatalogQuery.get() taking 21.8 s--
" calls to AbstractQuery.make_ref() 4.568 s
" calls to BasicImpl.get() 3.331 s
" calls to BasicImpl._fetch() 3.134 s
" calls to CatalogQuery._grounded_qu 2.571 s
" calls to LcCatalog.query() 2.381 s
" calls to CatalogQuery.__init__ 0.155 s
273,925/
" calls to LcEntity.make_ref() 1.380 s
Fri 2018-05-11 11:02:27 -0700
Big change in having CatalogQuery cache the refs it creates. No need to redundantly `get` the same processes 2 million times.
To really make this sweet, the catalog itself should cache queries and return them instead of making new ones. then if the query objects retain the references, and the references themselves refer circularly to the queries, we will be on our way to creating a session-persistent dynamic mirror of content.
Smaller but nontrivial change in using `exch.is_reference` instead of `exch in process.reference_entity`
Not many other changes to make...
26.9 sec
26.5 s add_ref_product
20.5 s _add_ref_product
5.976 s _update_component_graph
of the 20.5:
2.872 s _traverse_term_exchanges foreground
5.256 s exchanges.__getitem__
3.989 s background.terminate
1.721 s background.add_cutoff
1.729 s inventory.inventory
0.845 s background.add_interior
0.681 s exchanges.__eq__
-----
17.093 s
Still seems like there's a chunk missing
There's at least 1s remaining of catalog.gen_interfaces because those are still called for every inventory() call.
It seems as though there is some refactoring to do on the CatalogQuery + LcResolver workflow.
==========
Fri May 11 17:43:26 -0700 2018
Here's where we're at with flat background:
* Complications regarding background as resource
The current approach is a complete kludge (obviously). Here's how we fix it:
1. The only background implementation available to the native LcArchive IS the proxy background. If a catalog maintainer creates a resource with a conventional archive source and 'background' interface specified, then the maintainer is ASSERTING that the database is a proxy background db, as in a GaBi bundle.
1a. The BackgroundImplementation class then becomes a proxy bg implementation. It should masquerade processes as background flows, provide no foreground flows, and masquerade emissions and all encountered cutoffs as exterior flows.
2. The scipy-based background is invoked automatically any time a background interface is requested and _iface fails with BackgroundRequired. The query asks the catalog to create the background.
2a. The catalog creates a new archive, supplying a bg store file as source and the index interface as a positional argument, and the archive goes ahead and creates the new FlatBackground on __init__. This will fail with the import error if scipy is not available, and with the index or inventory required errors if appropriate.
2b Assuming the __init__ doesn't fail, then by the time the LcBackground init returns, the background will be saved to the source. so then the catalog passes it to self.add_existing_archive()
3. The LcBackground class itself will need to load the FlatBackground from the source if the file exists, or create it from the index interface if it doesn't. and save it to source.
3a. the LcBackground class will need to override make_interface() to supply a rich background implementation, which relies on the FlatBackground being available.
3b. The LcBackground class will also need to populate the list of processes and flows AS catalog refs obtained from the index interface, which is still supplied as a positional parameter. There are definitely some details to be worked out here.
3c. so the LcBackground will present as an LcArchive because it will BE an LcArchive.
3d. The current BackgroundManager and BackgroundProxy will go away.
==========
Sat Jun 09 17:16:09 -0700 2018
Debugging the flat background implementation with ecoinvent and we're running into all kinds of sloppy problems:
1- goddamn no reference found for nitrogen venting, softwood, etc-- leading to premature cutoffs which may or may not be problems
a. why is this happening? need to validate behavior using a direct ecospold2 archive
b. does it even matter? this is at the bottom of the stack, only to be hit when/if the lci testing fails
2- on the first time through, when the flatbackground is created, encountered a key error in the lci_test method when we lookup z = lci_check[i.key]
3- on subsequent time through, we fail because there is no true index and self[x.flow] is coming up empty
- solution: properly index the archive before doing background
For all of these issues we need an interactive session. and more patience than I have right now.
Wed 2018-06-20 13:22:25 -0700
First time looking at this since 6/09...
"properly index the archive"...
_background_for_origin takes only an origin as input-- creates a 'background' resource that is self-instantiating
then we check() the resource--> which calls _instantiate() which calls create_archive() which calls archive_factory() which ultimately imports and calls antelope_background.background.init_fcn(source, ref=ref, save_after=True, filetype='.mat')
Need to do this interactively..
Wed 2018-06-20 23:16:30 -0700
need to come up with a way to remove resources-- and purge sources if they are internal
recurrent bug today was because of a stale resource entry (after I had manually purged sources)
specify by source-- then it finds the resource with source, and deletes it-- which needs to be written
- then, if the resource that was deleted was internal, AND if resources with source is now empty, then delete the source
that is easy enough
but not easy enough to do right now
Thu 2018-06-21 10:50:31 -0700
Wrong. The argument of delete resource should be a resource- it shouldn't be a query.
Thu 2018-06-21 16:22:57 -0700
ugh. Inversion errors. I cannot even tell which of the two senses is correct.
Summary of the problem:
* Comparing computed LCI result (challenge) against ecoinvent-distributed LCI result (lci_result)
= lci_result is delivered via the inventory interface
= challenge is delivered via the background interface (computed via the flat background)
- test case is APOS 18085d22-72d0-4588-9c69-7dbeb24f8e2f, "treatment of poultry manure, drying, pelleting"
- rx is 4aeb622f-8675-4449-889b-bca8bbb3d44a "Poultry manure, fresh" [Output]
- rx value is -1.0
= lci_result.inventory(rx) delivers a list of positive-valued exchanges
= challenge.lci(ref_flow=rx.flow.external_ref) delivers a list of negative-valued exchanges
? but the problem is that the lci doesn't seem to know the normative sign of the reference node-- the tarjan algorithm knows but the flat bg doesn't
= it does have an index interface, so it can ask and normalize- but is that even correct?
tbd.
Fri 2018-06-22 09:59:09 -0700
Correction- it's an inventory question, the magnitude of the reference flow- but the bg obviously needs complete index AND inventory information in order to build the matrix. I think the correct thing to do is to normalize- possibly just LCI, possibly every query.
But hold on. The A matrix is formed by processing incoming MatrixEntries, which already involves normalizing by the parent's inbound_ev.
What this means is, each column is properly signed with respect to other processes' invocations of it. If it's a waste column, then ... well, let's look and see.
What processes use the poultry manure process? how do I even detect?
- challenge.is_in_background(): False
- so I need to identify entries in Af whose row is equal to challenge's column
= and then look at the exchange values in the inventory and in the Af matrix and see if they are the same (1) magnitude and (2) sign
- obtain access to the FlatBackground, which is owned by the TarjanBackground, which is the archive attached to the background resource
-> cat.get_archive(challenge.origin, 'background') fails because the resource is internal and FOR SOME REASON we filter those out.
--> so we probably need a way around that; but for now, cat._resolver.resolve(challenge.origin, 'background') works.
* OK, so (1), nothing depends on the poultry manure process.
(2), the ad and bf methods return the SAME (i.e. non-normalized) sign, which I suppose is surprising
-- INTERESTING. The inbound EV is not queried! it's only assumed from the DIRECTION of the rx. This seems BAD. Do we query it anywhere? this is going to bollocks up the USLCI.
-- moreover, if we FIX this it's not clear that ecoinvent will still be computed properly-- given that it's working correctly NOW.
-- we need to find a transformation process that invokes a treatment process as a dependency
Fri 2018-06-22 10:58:33 -0700
OK, we've traced one back.
= ei32a process ref 64e3063e-66bf-444d-ad3f-760f5c4bdbac, coal-fired electricity [TZ], generates coal ash as a waste.
= That waste exchange is a NEGATIVELY-VALUED Input, meaning it will invert the signs of the dependency.
= The dependency is market for coal ash treatment (bg index: 497) [which is TRANSPARENTLY incorrect as TZ is not exporting its coal ash]
= The A-matrix column 497 is negatively-sensed, meaning it is CORRECTLY encoded to be INVERTED at computation time
- 2 transport processes have negative exchange values
- 26 supply processes have positive exchange values
==> So that's the first positive conclusive finding: the inventory IS getting normalized by the reference WITHIN the A matrix
-- explanation is that we're using exch[rx] to get the exchange value- normalization happens there
= Anyway, continuing: the negatively-valued waste input INVERTS the market process, so we end up with positively valued transport and negatively-valued supplier processes
= The supplier process (column 3337) is also negatively-sensed, meaning both its A-matrix exchange values are negative (landfill; process-specific burdens)
- both of those processes are non-treatment (i.e. positively sensed) so all is normal after that
= The emissions are also uniformly negative, so they also become positive in the multiplication.
What does all this mean? It means:
(a) the matrix multiplication routine is CORRECT and NEEDS NO CHANGES -- this is confirmed by the LCI validation being NUMERICALLY correct.
(b) HOWEVER, when the flat matrix data are exposed to the outside world via the background interface, the negatively-sensed columns need to be adjusted.
What do we EXPECT to happen? Let's think of this from the perspective of the user.
We want to know about one process; let's use the poultry manure treatment process since it's already on hand.
This describes the inventory for TREATING 1 kg of poultry manure. The nominal direction of this exchange is INPUT. (That's NOT what ecoinvent reports)
-- indeed, when we look at the reference exchange for poultry manure it shows 'Output: (*) [--- [kg]] ...'
-- this is ALREADY wrong.
-- however, the issue is that catalog access is required in order to ensure that the exchange value field can be determined (because processes from non-value archives may be used to originate RxRefs)
And RxRefs wrongly store native processes and not process refs. The reason for this is that the RxRef is an input argument to the ProcessRef definition.
This being python, we can cheat- and create a method to post-populate the RxRef with a pointer to the ProcessRef as it's being instantiated. THEN on string output we can use the reference_value() function to correctly populate the exchange value (if inventory interface is available)
OK. Let's do that real quick.
Anyway, now (after this fix) when we take an RxRef for an ecoinvent treatment process and print it, we'll get: 'Output: (*) [-1.0 [kg]] ...'
When, however, we run an inventory(rx) we'll get normalized outputs, which would be the "avoided" inputs of an actual Output of 1.0 kg.
Arguably, what we want would be the non-inverted (though still normalized) inventory for an Input of 1.0 kg.
Except that doesn't help us solve our current problem, which is the behavior of the background interface.
The real question is whether we keep background as it is and make a manual correction of the lci_result data, or whether we make a correction of the background data
The answer depends on how we plan to use the background data.
Nominally, we want to use them for foreground building. That means, we have an output of poultry manure, and we want to terminate it-- and immediately we fail because no ecoinvent process will terminate an outflow of poultry manure, whereas three processes terminate an input of poultry manure.
The termination is handled by the index interface- but the naive ecospold2 index simply assumes 'output' since that is the normative mode.
Anyway, ignoring that and saying we terminate with an undoctored ecoinvent treatment process, the term and fragment will have noncomplementary directions- which negates the exchange- but the inbound ev will be -1 which negates it again. So the termination activity level will be positive.
Then , when we build child flows, this shit hits the fan because we use term.inventory(term.term_flow), which will normalize all the flows to the reference exchange-- that is, it will generate exchange values consistent with an OUTPUT of 1kg (normalized to the rx).
Both inventory(rx) and ad(rx.flow.external_ref) do this-- except ad flips the signs and directions of the exchanges to be positively-valued.