Skip to content

SitemapLoader URLs are not properly strippe #30829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
5 tasks done
givemelove opened this issue Apr 14, 2025 · 0 comments · Fixed by #30830
Closed
5 tasks done

SitemapLoader URLs are not properly strippe #30829

givemelove opened this issue Apr 14, 2025 · 0 comments · Fixed by #30830
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@givemelove
Copy link
Contributor

Checked other resources

  • I added a very descriptive title to this issue.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
  • I posted a self-contained, minimal, reproducible example. A maintainer can copy it and run it AS IS.

Example Code

The following code returns urls that are not properly stripped.

loader = SitemapLoader(
	"https://docs.snowflake.com/sitemap.xml",
	filter_urls=["https://docs.snowflake.com/en/sql-reference-data-types"]
)
data = loader.load()
print(data[0].metadata)

Error Message and Stack Trace (if applicable)

No response

Description

The result of that query returns URLs that are not stripped.
{'source': '\n https://docs.snowflake.com/en/sql-reference-data-types\n ', 'loc': '\n https://docs.snowflake.com/en/sql-reference-data-types\n '}

They should instead be as follows:
{'source': 'https://docs.snowflake.com/en/sql-reference-data-types', 'loc': 'https://docs.snowflake.com/en/sql-reference-data-types'}

System Info

System Information

OS: Darwin
OS Version: Darwin Kernel Version 24.3.0: Thu Jan 2 20:22:58 PST 2025; root:xnu-11215.81.4~3/RELEASE_ARM64_T8132
Python Version: 3.11.11 (main, Feb 13 2025, 11:29:41) [Clang 16.0.0 (clang-1600.0.26.6)]

Package Information

langchain_core: 0.3.51
langchain: 0.3.23
langchain_community: 0.3.21
langsmith: 0.3.28
langchain_text_splitters: 0.3.8

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.27.2
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.51: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-perplexity;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.8: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.23: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
opentelemetry-api: 1.27.0
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: 1.27.0
orjson: 3.10.7
packaging: 24.1
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.3
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 8.3.3
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: 14.0.0
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Apr 14, 2025
givemelove added a commit to givemelove/langchain that referenced this issue Apr 14, 2025
@ccurme ccurme closed this as completed in 7ecdac5 Apr 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant