Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

THREDDS 5.4 WMS is much slower than 4.6 #406

Open
billyz313 opened this issue Jul 26, 2023 · 32 comments
Open

THREDDS 5.4 WMS is much slower than 4.6 #406

billyz313 opened this issue Jul 26, 2023 · 32 comments

Comments

@billyz313
Copy link

billyz313 commented Jul 26, 2023

We are still running V4.6 for production because 5.4 is many times slower. We want to move production to 5.4 but simply can't.

We are using Ubuntu 22.04.2 LTS, OpenJDK 64-Bit Server VM Temurin-11.0.18+10 (build 11.0.18+10, mixed mode), python 3.10

To reproduce I add a wms layer to a leaflet map for v4.6 https://csthredds.servirglobal.net/thredds/climateserv_aggregated.html?dataset=emodis-ndvi_eastafrica_250m_10dy
and on a different leaflet map i add v5.4 https://threddsx.servirglobal.net/thredds/catalog/climateserv_aggregated.html?dataset=emodis-ndvi_eastafrica_250m_10dy

I also have them added as L.timeDimension.layer.wms because we animate the layers. This is where it really makes a lot of difference. Animation using THREDDS 5.4 currently seems not possible due to the lag of the responses.

Any help and suggestions are welcome.

@mnlerman
Copy link
Contributor

Hello, we had a user encounter a similar issue a few months ago and after some investigation, we found that they were able to improve the performance by removing "-ea" from the JAVA_OPTS in their tomcat setenv.sh. Can you test that to see if it helps improve the performance issues that you're seeing?

@billyz313
Copy link
Author

@mnlerman Hi Megan, thank you for the suggestion. I just tried that and it didn't seem to make any difference. Here are the options I have set, maybe there is something else that may need adjusting and i'm just missing it?

NORMAL="-Xmx16384m -Xms512m -server -XX:+UseParallelGC -Djava.awt.headless=true"
HEAP_DUMP="-XX:+HeapDumpOnOutOfMemoryError"
HEADLESS="-Djava.awt.headless=true"

JAVA_OPTS="$CONTENT_ROOT $NORMAL $HEAP_DUMP $HEADLESS $JAVA_PREFS_ROOTS"

I do realize I have headless set twice. I was having an issue with thredds crashing when I would zoom the layer on the map and I saw adding headless might help. I added it in the NORMAL variable and it stopped crashing. Not long after that I realized it was actually already there in the HEADLESS variable. Not sure how it stopped it from crashing or if it was just a coincidence.

@billyz313
Copy link
Author

Still looking for suggestions. We're concerned that support for V4.6 has ended and we can't get V5 to produce anywhere near the same speed. At this point we're being forced to look into other options which I would prefer not to have to do. Any help would be greatly appreciated.

@haileyajohnson
Copy link

@billyz313 have you tested performance with the latest snapshot release of the TDS?

@billyz313
Copy link
Author

@haileyajohnson No, we are using 5.4, maybe I can convince them to give it a shot. I should just be able to replace the .war file and all the configs and custom palettes will just be read in with no issue?

@billyz313
Copy link
Author

@haileyajohnson I upgraded to V5.5 this morning to see if it helped. Unfortunately there seems to be no improvement in performance. It's a huge lag from V4.6 when using the wms endpoint which is what we use to animate the data on the map.

@tdrwenski
Copy link
Contributor

Thanks for testing that out.

Can you send a bit more info to help us try to reproduce your issue? If possible can you give us:

  • The catalog or catalog entry for the slow dataset
  • The data file(s) for that dataset or a link where we can download them
  • The exact WMS URL that is slow

@billyz313
Copy link
Author

billyz313 commented Aug 8, 2023

@tdrwenski Thank you for taking the time to try to assist. I'll start with a live example of the issue. I have one production layer pointing at thredds v5.5. All of the rest point to the 4.6 version. The application is located
https://climateserv.servirglobal.net/map
To see the issue you will need to:

  1. Click the white layer stack icon at the top left in the green and white panel.
  2. Click in the "filter layers..."
  3. Type ndvi
  4. Check the USGS eMODIS NDVI East Africa box to turn it on. (East Africa is the only one pointing to V5.5)
  5. Initial load is slow compared to v4.6 but we could live probably with that if that was all it was supposed to do.
  6. Click the play button on the time dimension control at the bottom of the map to animate the layer. Notice that it takes forever to load the steps to animate.
  7. Click the pause button (or the % loading if it's still loading)
  8. Uncheck East Africa. Check Central Asia (it's about the same size)
  9. Click the play button. Notice there is a much shorter lag for the initial load of the animation, and it continues to animate with no issue just as it should.

So, now that you can see the issue I will try to get information on how to produce the backend to test. Let me start off by saying the performance issue effects every dataset.

I am attaching the thredds config files usr_local_thredds.zip. If you need more than this please let me know.

More information about the system:

When we use the wms, we use the virtual aggregation which is located https://csthredds.servirglobal.net/thredds/wms/Agg/emodis-ndvi_eastafrica_250m_10dy.nc4?service=WMS&version=1.3.0&request=GetCapabilities (V4.6)
https://threddsx.servirglobal.net/thredds/wms/Agg/emodis-ndvi_eastafrica_250m_10dy.nc4?service=WMS&version=1.3.0&request=GetCapabilities (V5.5)

About getting the data. I'm not sure the best way to get the data to you, we have several TB. The thredds endpoint for the dataset we were just testing is https://csthredds.servirglobal.net/thredds/catalog/climateserv/emodis-ndvi/eastafrica/250m/10dy/catalog.html (V4.6) https://threddsx.servirglobal.net/thredds/catalog/climateserv/all/emodis-ndvi/eastafrica/250m/10dy/catalog.html (V5.5) but it did try downloading and the download is just a touch smaller file size which means something is not exactly the same (not sure if that makes a difference.)
We have a Generalized ETL that we could help you setup
to download the data to your test system which would have it exactly as we have it in thredds.
Another option is that I could drop a handful of nc files in a google drive that you could grab.

I also just exposed the data directly from our server for you if that's easier. https://eandvi.servirglobal.net

Let me know what you would prefer.

@tdrwenski
Copy link
Contributor

tdrwenski commented Aug 8, 2023

Thank you for sending the extra info! I see how slow it is on your server and I am seeing similar performance issues locally using your datasets.

It seems slow with other services like NCSS as well so I don't think it's only WMS that's slow. It looks like this is related to a performance issue we are working on fixing for version 5.5-- Enhancements (such as scale, offset, fill value) are not handled well in the current version and can cause performance issues for large datasets.

We will keep you updated on our progress on this!

@billyz313
Copy link
Author

Thank you. Just a quick question if you know. Will disabling everything except wms help performance at all? I'm guessing it wouldn't, but we only actively use the wms for ClimateSERV so I think turning the rest of the options off is prolly a good idea just in general.

@tdrwenski
Copy link
Contributor

Will disabling everything except wms help performance at all? I'm guessing it wouldn't, but we only actively use the wms for ClimateSERV so I think turning the rest of the options off is prolly a good idea just in general.

Unfortunately, I don't think that will help at all. The only thing I think would help is to not have enhancements (scale, offset, fill value) in your data, which probably isn't a feasible workaround. I hope we will have a fix for you soon!

@billyz313
Copy link
Author

billyz313 commented Aug 8, 2023

Ahhh, I see what u mean. The stuff in the actual data. I'll have to look into that, we use that for the calculations we do, but I wonder if it could be removed for the thredds data.

@tdrwenski So when we're creating the NetCDF it's this encoding that needs to be changed?

ds[self.etl_parent_pipeline_instance.dataset.dataset_nc4_variable_name].encoding = {
                        '_FillValue': np.int8(127),
                        'missing_value': np.int8(127),
                        'dtype': np.dtype('int8'),
                        'scale_factor': 0.01,
                        'add_offset': 0.0,
                        'chunksizes': (1, 256, 256)
                    }

just remove _FillValue, scale_factor, and add_offset?

@tdrwenski
Copy link
Contributor

I don't think there is any easy way to turn them off unless you change the data itself. For instance, with ncdump on one of your data files I see:

variables:
	double latitude(latitude) ;
		latitude:_FillValue = NaN ;
		latitude:long_name = "latitude" ;
		latitude:units = "degrees_north" ;
		latitude:axis = "Y" ;
	int time(time) ;
		time:long_name = "time" ;
		time:axis = "T" ;
		time:bounds = "time_bnds" ;
		time:units = "seconds since 1970-01-01T00:00:00+00:00" ;
		time:calendar = "proleptic_gregorian" ;
	double longitude(longitude) ;
		longitude:_FillValue = NaN ;
		longitude:long_name = "longitude" ;
		longitude:units = "degrees_east" ;
		longitude:axis = "X" ;
	byte ndvi(time, latitude, longitude) ;
		ndvi:_FillValue = 127b ;
		ndvi:long_name = "ndvi" ;
		ndvi:units = "unitless" ;
		ndvi:comment = "Maximum value composite over dekad defined by time_bnds" ;
		ndvi:add_offset = 0. ;
		ndvi:scale_factor = 0.01 ;
		ndvi:missing_value = 127b ;
	int time_bnds(time, nbnds) ;
		time_bnds:long_name = "time_bounds" ;

The attributes like add_offset, scale_factor, missing_value, and _FillValue are the enhancements I am referring to. If you are creating the netcdf files yourself you could test if things are faster without those attributes. Otherwise you can use tools like ncgen/ncdump to test removing them, but I guess that would not be an actual workaround but only useful for testing.

@billyz313
Copy link
Author

Hi @tdrwenski , I was just checking back to see if y'all were able to publish a fix for this yet. Also I did look into removing the enhancements, but it will not work in the rest of the system without these.

@tdrwenski
Copy link
Contributor

Hi, we are still working on this performance issue. We will try to let you know if we have an update!

@haileyajohnson
Copy link

Hi @billyz313 , the latest snapshot of the TDS includes some performance improvements for datasets with enhancements that may help with your issues, though there may still be other problems causing slowdowns. If you get a chance to check it out, we'd be interested to hear if it helps at all.

@billyz313
Copy link
Author

billyz313 commented Dec 11, 2023

@haileyajohnson Thank you! I will see if we can get it deployed asap! Is it the 5.5 from https://downloads.unidata.ucar.edu/tds/ ?

@billyz313
Copy link
Author

@haileyajohnson We upgraded to 5.5 and unfortunately there is no change in the performance.

@ashokgj
Copy link

ashokgj commented Dec 20, 2023

yes, we too are facing the same... 4.6 was a lot faster than 5.5

@tdrwenski
Copy link
Contributor

We have a few more performance fixes that I think may help you. You can download the latest snapshot here: https://downloads.unidata.ucar.edu/tds/5.5/thredds-5.5-SNAPSHOT.war

Let us know if that helps or not!

@tdrwenski
Copy link
Contributor

Hi @billyz313, have you had a chance to test performance with the latest snapshot yet?

@billyz313
Copy link
Author

@tdrwenski yes, we installed the new 5.5 snapshot you mentioned and didn't seem to have any effect on the performance. Some of the folks thought it got slower, but i think it's about the same. So, we're still running 4.6 in production...

@haileyajohnson
Copy link

@billyz313 we used your data as a benchmark for our performance improvements, so it's surprising that it hasn't fixed your issues (not that we're doubting you, we can see that it's slow). In our own tests, performance on serving your data vis WMS was at least twice as fast with recent changes...
Is it possible that your network could be blocking/scanning/slowing something down?
There are definitely things we could have overlooked here, but it be good to verify that it's not a network config issue.

@billyz313
Copy link
Author

@haileyajohnson Do you think the virtual aggregation could be causing the delay?

@tdrwenski
Copy link
Contributor

Very large joinExisting aggregations can be slow on the first request, as all files in the aggregation will be opened to get the coordinate info. We have an aggregation cache that persists this info, so it should be much faster on the second request. I don't believe I saw this behavior on your server, however, it seems consistently slow.

You may want to compare the performance of a single unaggregated file vs an aggregation if you want to be sure it's not the aggregation causing the slow down. If do you think your performance issues are related to the aggregation, there are a couple things you could try. The aggregation cache is scoured daily, but you can turn this off by setting the scour period to -1 sec (see here). If each file in your joinExisting aggregation has one time value, you can also try using a dateFormatMark to extract the value from the file name, so that the file won't need to be opened to get this info.

@billyz313
Copy link
Author

@tdrwenski We're considering switching our files from netcdf to zarr. The 5.x version supports zarr right? And is the wms tiling a lot faster thru THREDDS using a zarr file or is it similar tiling speed as netcdf?

@haileyajohnson
Copy link

@billyz313 unfortunately Zarr is still very in beta in the TDS library, but you're welcome to try it out

@billyz313
Copy link
Author

@haileyajohnson thank you, I think we should throw a few files in, cross our fingers, and see what it does :)

@tlvu
Copy link

tlvu commented Nov 19, 2024

Just curious, has the 5.5 release fixed this issue?

@billyz313
Copy link
Author

@tlvu Unfortunately no.

@tdrwenski We tried to deploy 5.6 snapshot and it failed. Here are messages from the catalina.out, not sure if it will be helpful. After it failed, he rolled back to 5.5, Do you have any suggestions?

20-Nov-2024 15:23:43.819 INFO [localhost-startStop-2] org.apache.catalina.startup.HostConfig.deployWAR Deploying web application archive [/usr/local/apache-tomcat-8.5.85/webapps/thredds##5.6SS.war]
20-Nov-2024 15:23:44.504 WARNING [localhost-startStop-2] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:23:44.511 WARNING [localhost-startStop-2] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:23:47.074 INFO [localhost-startStop-2] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
20-Nov-2024 15:23:47.082 SEVERE [localhost-startStop-2] org.apache.catalina.core.StandardContext.startInternal One or more listeners failed to start. Full details will be found in the appropriate container log file
20-Nov-2024 15:23:47.083 SEVERE [localhost-startStop-2] org.apache.catalina.core.StandardContext.startInternal Context [/thredds##5.6SS] startup failed due to previous errors
20-Nov-2024 15:23:47.096 INFO [localhost-startStop-2] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/usr/local/apache-tomcat-8.5.85/webapps/thredds##5.6SS.war] has finished in [3,276] ms
20-Nov-2024 15:25:53.545 WARNING [http-nio-8080-exec-9] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:25:53.552 WARNING [http-nio-8080-exec-9] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:25:55.705 INFO [http-nio-8080-exec-9] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
20-Nov-2024 15:25:55.711 SEVERE [http-nio-8080-exec-9] org.apache.catalina.core.StandardContext.startInternal One or more listeners failed to start. Full details will be found in the appropriate container log file
20-Nov-2024 15:25:55.712 SEVERE [http-nio-8080-exec-9] org.apache.catalina.core.StandardContext.startInternal Context [/thredds##5.6SS] startup failed due to previous errors
20-Nov-2024 15:26:08.897 INFO [http-nio-8080-exec-10] org.apache.catalina.util.LifecycleBase.stop The stop() method was called on component [StandardEngine[Catalina].StandardHost[localhost].StandardContext[/thredds##5.6SS]] after stop() had already been called. The second call will be ignored.
20-Nov-2024 15:26:09.435 INFO [http-nio-8080-exec-10] org.apache.catalina.startup.HostConfig.undeploy Undeploying context [/thredds##5.6SS]
20-Nov-2024 15:30:47.162 INFO [localhost-startStop-3] org.apache.catalina.startup.HostConfig.deployWAR Deploying web application archive [/usr/local/apache-tomcat-8.5.85/webapps/thredds##56SS.war]
20-Nov-2024 15:30:47.808 WARNING [localhost-startStop-3] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:30:47.813 WARNING [localhost-startStop-3] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:30:49.952 INFO [localhost-startStop-3] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
20-Nov-2024 15:30:49.958 SEVERE [localhost-startStop-3] org.apache.catalina.core.StandardContext.startInternal One or more listeners failed to start. Full details will be found in the appropriate container log file
20-Nov-2024 15:30:49.959 SEVERE [localhost-startStop-3] org.apache.catalina.core.StandardContext.startInternal Context [/thredds##56SS] startup failed due to previous errors
20-Nov-2024 15:30:49.966 INFO [localhost-startStop-3] org.apache.catalina.startup.HostConfig.deployWAR Deployment of web application archive [/usr/local/apache-tomcat-8.5.85/webapps/thredds##56SS.war] has finished in [2,804] ms
20-Nov-2024 15:31:04.599 WARNING [http-nio-8080-exec-13] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:31:04.604 WARNING [http-nio-8080-exec-13] org.apache.tomcat.util.descriptor.web.WebXml.setVersion Unknown version string [5.0]. Default version will be used.
20-Nov-2024 15:31:06.708 INFO [http-nio-8080-exec-13] org.apache.jasper.servlet.TldScanner.scanJars At least one JAR was scanned for TLDs yet contained no TLDs. Enable debug logging for this logger for a complete list of JARs that were scanned but no TLDs were found in them. Skipping unneeded JARs during scanning can improve startup time and JSP compilation time.
20-Nov-2024 15:31:06.713 SEVERE [http-nio-8080-exec-13] org.apache.catalina.core.StandardContext.startInternal One or more listeners failed to start. Full details will be found in the appropriate container log file
20-Nov-2024 15:31:06.714 SEVERE [http-nio-8080-exec-13] org.apache.catalina.core.StandardContext.startInternal Context [/thredds##56SS] startup failed due to previous errors

@lesserwhirls
Copy link
Collaborator

Greetings @billyz313! It looks like we upgraded from Spring 5 to Spring 6 in August, and that required a jump to Java 17 and Tomcat 10. Based on the path to tomcat in your log files, one or both of those might be the issue.

@billyz313
Copy link
Author

@lesserwhirls Thank you, we are setting up a new machine that meets the requirements and will give it a shot on there. We are hoping to see an increase in the WMS speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants