Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service hangs when returning full year of data for many points #732

Open
dhblum opened this issue Jan 30, 2025 · 9 comments
Open

Service hangs when returning full year of data for many points #732

dhblum opened this issue Jan 30, 2025 · 9 comments
Assignees
Labels

Comments

@dhblum
Copy link
Collaborator

dhblum commented Jan 30, 2025

This issue is observed when running a simulation for a full year and then requesting a full year's worth of data for many points at a time using the /results API. The process hangs with no return to the client's request. To reproduce, select a test case and run a full year simulation, then use:

measurements = requests.get('{0}/measurements/{1}'.format(url, testid)).json()['payload']
inputs = requests.get('{0}/inputs/{1}'.format(url, testid)).json()['payload']
points_list = list(inputs.keys()) + list(measurements.keys())
res = requests.put('{0}/results/{1}'.format(url,testid),json={'point_names':points_list,
                                                              'start_time':starting_time,
                                                              'final_time':step}).json()['payload']

For the test case bestest_air, the following Service log is observed, the end of which is where Service seems to hang.

worker-1     | 01/30/2025 03:19:27 PM UTC       worker              INFO        Request ID:   '95633108-c432-4fef-84bf-cb9889cd4b03', with method 'get_results', was received
worker-1     | 01/30/2025 03:19:32 PM UTC       root                INFO        Queried results data successfully for point names ['con_oveTSetCoo_activate', 'con_oveTSetCoo_u', 'con_oveTSetHea_activate', 'con_oveTSetHea_u', 'fcu_oveFan_activate', 'fcu_oveFan_u', 'fcu_oveTSup_activate', 'fcu_oveTSup_u', 'fcu_reaFloSup_y', 'fcu_reaPCoo_y', 'fcu_reaPFan_y', 'fcu_reaPHea_y', 'zon_reaCO2RooAir_y', 'zon_reaPLig_y', 'zon_reaPPlu_y', 'zon_reaTRooAir_y', 'zon_weaSta_reaWeaCeiHei_y', 'zon_weaSta_reaWeaCloTim_y', 'zon_weaSta_reaWeaHDifHor_y', 'zon_weaSta_reaWeaHDirNor_y', 'zon_weaSta_reaWeaHGloHor_y', 'zon_weaSta_reaWeaHHorIR_y', 'zon_weaSta_reaWeaLat_y', 'zon_weaSta_reaWeaLon_y', 'zon_weaSta_reaWeaNOpa_y', 'zon_weaSta_reaWeaNTot_y', 'zon_weaSta_reaWeaPAtm_y', 'zon_weaSta_reaWeaRelHum_y', 'zon_weaSta_reaWeaSolAlt_y', 'zon_weaSta_reaWeaSolDec_y', 'zon_weaSta_reaWeaSolHouAng_y', 'zon_weaSta_reaWeaSolTim_y', 'zon_weaSta_reaWeaSolZen_y', 'zon_weaSta_reaWeaTBlaSky_y', 'zon_weaSta_reaWeaTDewPoi_y', 'zon_weaSta_reaWeaTDryBul_y', 'zon_weaSta_reaWeaTWetBul_y', 'zon_weaSta_reaWeaWinDir_y', 'zon_weaSta_reaWeaWinSpe_y'].
worker-1     | 01/30/2025 03:19:35 PM UTC       worker              INFO        Response for, '95633108-c432-4fef-84bf-cb9889cd4b03', was sent

It seems to hang on the reply from the worker back to web.

Note that requesting only one or two data point names at a time works ok for me. Thus, I wonder if this is a through-put issue on the response from worker back to web. Need to look into this further.

I would appreciate insights from @kbenne on this.

Also FYI @icupeiro and @EttoreZ.

@dhblum dhblum added the service label Jan 30, 2025
@kbenne kbenne self-assigned this Jan 31, 2025
@kbenne
Copy link
Contributor

kbenne commented Jan 31, 2025

I was certainly able to reproduce this issue with the attached script. Should be able to get to the bottom of it soon.

import requests
import time

def run_boptest_simulation():
    # Define the base URL for the BOPTEST server
    base_url = "http://localhost"

    # Initialize the test case
    testcase = "bestest_air"

    testid = requests.post("{0}/testcases/{1}/select".format(base_url, testcase)).json()["testid"]

    init_url = f"{base_url}/initialize/{testid}"
    init_params = {
        "start_time": 0,
        "warmup_period": 0,
        "end_time": 365 * 24 * 3600  # One year in seconds
    }
    response = requests.put(init_url, json=init_params)
    if response.status_code != 200:
        raise Exception(f"Failed to initialize test case: {response.text}")

    # Run the simulation
    advance_url = f"{base_url}/advance/{testid}"
    step_size = 3600  # One hour in seconds
    current_time = 0
    end_time = 365 * 24 * 3600

    while current_time < end_time:
        response = requests.post(advance_url)
        if response.status_code != 200:
            raise Exception(f"Failed to advance simulation: {response.text}")
        current_time += step_size
        print(f"Advanced to {current_time / 3600} hours")

    # Retrieve results

    measurements = requests.get('{0}/measurements/{1}'.format(base_url, testid)).json()['payload']
    inputs = requests.get('{0}/inputs/{1}'.format(base_url, testid)).json()['payload']
    points_list = list(inputs.keys()) + list(measurements.keys())
    res = requests.put('{0}/results/{1}'.format(base_url,testid),json={'point_names':points_list,
                                                                  'start_time':0.0,
                                                                  'final_time':current_time}).json()['payload']

    res = requests.put("{0}/stop/{1}".format(base_url,testid))
    if res.status_code == 200:
        print('Done shutting down test case.')
    else:
        print('Error shutting down test case.')

if __name__ == "__main__":
    run_boptest_simulation()

@kbenne
Copy link
Contributor

kbenne commented Jan 31, 2025

Hey guys,

So here's the deal. The results payload for the example that I pasted above is 360 MB. Over 1/3 of a GB! Redis has a soft limit of 8 MB for messages, which could be increased to 32 MB perhaps, but it is not advisable.

I'm not actually sure what redis is doing when it receives this massive payload as I don't see it passing through when I watch the redis stream. I see the request for results come in and then no record of the response. The worker thinks it fired the response down the pipe, but I think Redis just outright rejects it, although I'm not sure where the log of that might be. (I can elaborate on how you can monitor the traffic through redis at some point). I did finally get a timeout from the client after 20 minutes, which is the timeout time when you run boptest locally.

I think though that the issue is more than just how much can we cram through redis. I think sending this much data over an HTTP GET response is also ill advised without some care about what headers we are sending back and how the client deals with it. If the client just does a requests.get(...) and the server just naively sends back content-type application/json with no other headers, then both client and server are going to try to hold the entire message in memory. It would be better to use headers to signal to the client that the response is large, and perhaps force it to stream the response to a file instead of memory.

On the server side, it might be something like this....

HTTP/1.1 200 OK
Content-Type: application/json
Content-Disposition: attachment; filename="results.tar"
Content-Length: <really big number>
Connection: keep-alive
...

Then the client would need to stream the results to a file, by doing something like this...

import os
import requests

url = "https://boptest.net/<testid>/results"
file_path = "results.tar"

# Check how many bytes we have already downloaded
resume_header = {}
if os.path.exists(file_path):
    file_size = os.path.getsize(file_path)
    resume_header = {"Range": f"bytes={file_size}-"}

# Make the request with a Range header
with requests.get(url, headers=resume_header, stream=True) as response:
    response.raise_for_status()
    with open(file_path, "ab") as f:  # "ab" to append if resuming
        for chunk in response.iter_content(chunk_size=8192):
            f.write(chunk)

I don't know if this is a burden we want to put on clients though.

Putting aside, for the moment, how we serve the large responses back to clients over http, and back to the redis issue. I think this is easily enough solved as we have object storage readily available in the form of minio / s3. The worker can simply put the results payload in object storage, send a message to the server (over redis) that the results are in, and the web server can serve it up however we decide we want to handle that.

@icupeiro
Copy link
Contributor

icupeiro commented Feb 2, 2025

Hey guys,

My 2 cents
I put the counter for 'multizone_office_simple_hydronic' after a yearly simulation: I was able to extract 56/159 points. Got hang at point 57

Best,
Iago

@dhblum
Copy link
Collaborator Author

dhblum commented Feb 3, 2025

Thanks @kyle for the insight and thoughts! Given what @icupeiro just mentioned, is there anything that clears memory in redis after each message, or some number of messages or over time?

@kbenne
Copy link
Contributor

kbenne commented Feb 3, 2025

I don't think I"m 100% following what @icupeiro is saying. The response to the /results api is sent as one very large message (where the response size is very dependent on what you ask for in the parameters). Can one of you help me understand a bit more what the test setup is like when chunking "n of 159" points?

To your question @dhblum , the messages that we send over redis don't stick around. In the pub/sub scenario like we are using in BOPTEST they just flow through from publisher to subscriber and then they are gone.

@dhblum
Copy link
Collaborator Author

dhblum commented Feb 3, 2025

I think @icupeiro is saying that he tried requesting data from /results one point at a time in a for-loop, rather than all at once by requesting a list of points. But @icupeiro please clarify as needed.

@icupeiro
Copy link
Contributor

icupeiro commented Feb 3, 2025

what @dhblum said is correct. Sorry for not clarifying!

@kbenne
Copy link
Contributor

kbenne commented Feb 3, 2025

ok I see. That puts @dhblum's comment about how the messages are stored into context. I will create a test for this and see what is going on. In general, I think my previous comments remain valid in that these responses are too large to send as a message. I will create a test and log message sizes to see what we exactly the details are.

@dhblum
Copy link
Collaborator Author

dhblum commented Feb 3, 2025

Thanks @kbenne. I agree we'll have to do something about this, and your idea of using minio for large data requests (or all data requests?) seems like it could work. But it would still be good to debug more what's going on even in the for-loop case to understand how redis is operating right now in this case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants