Skip to content

Commit

Permalink
Merge pull request #54 from ernado-x/content-loaders
Browse files Browse the repository at this point in the history
Version 2.0.2
  • Loading branch information
ernado-x authored Jul 9, 2024
2 parents b641d4c + dcc0a14 commit feb751f
Show file tree
Hide file tree
Showing 29 changed files with 716 additions and 360 deletions.
27 changes: 0 additions & 27 deletions .vscode/launch.json

This file was deleted.

42 changes: 0 additions & 42 deletions .vscode/tasks.json

This file was deleted.

92 changes: 91 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,97 @@
[![NuGet version](https://badge.fury.io/nu/X.Web.MetaExtractor.svg)](https://badge.fury.io/nu/X.Web.MetaExtractor)
[![Twitter URL](https://img.shields.io/twitter/url/https/twitter.com/andrew_gubskiy.svg?style=social&label=Follow%20me!)](https://twitter.com/intent/user?screen_name=andrew_gubskiy)

The library allows extracting meta information from any webpage.
**X.Web.MetaExtractor** is a powerful library that allows you to extract meta information from any web page URL. It provides a variety of content loaders to handle HTTP requests using different libraries.

## Breaking Changes

- **Metadata class was changes**: The `Content` field has been removed from the `Metadata` class. Ensure to update your code to reflect this change if you were using the `Content` field.
- **Description Extraction Logic**: The `Extractor` class now only extracts the description from meta tags, without attempting to parse the content of the page. Adjust your implementation if it relied on content parsing for the description.

## Features

- Extract meta information from any web page URL.
- Support for multiple HTTP libraries:
- Flurl
- FsHttp
- RestSharp
- Detect the language of the page content.

## Installation

To install the library, use the following command:

```bash
dotnet add package X.Web.MetaExtractor
```

## Usage

Here is a basic example of how to use the `X.Web.MetaExtractor` library:

```csharp
using X.Web.MetaExtractor;
using X.Web.MetaExtractor.ContentLoaders;
using X.Web.MetaExtractor.LanguageDetectors;

// Create instances of the necessary components
IPageContentLoader contentLoader = new FlurlPageContentLoader();
ILanguageDetector languageDetector = new LanguageDetector();
string defaultImage = "https://example.com/example.jpg";

// Create an instance of the Extractor
IExtractor extractor = new Extractor(defaultImage, contentLoader, languageDetector);

// Extract meta information from a URL
var metaInfo = await extractor.ExtractAsync( new Uri("https://example.com"));

// Display the extracted meta information
Console.WriteLine($"Title: {metaInfo.Title}");
Console.WriteLine($"Description: {metaInfo.Description}");
Console.WriteLine($"Keywords: {metaInfo.Keywords}");
Console.WriteLine($"Language: {metaInfo.Language}");
```

## Interfaces and Classes

### IExtractor

`IExtractor` defines the interface for extracting meta information.

### ILanguageDetector

`ILanguageDetector` defines the interface for detecting the language of the page content.

### IPageContentLoader

`IPageContentLoader` defines the interface for loading the content of a web page.

### Metadata

`Metadata` is a class that holds the meta information of a web page, including the title, description, keywords, and language.

## Content Loaders

### Flurl

`X.Web.MetaExtractor.ContentLoaders.Flurl` provides a content loader using the Flurl HTTP library, enabling efficient and fluent HTTP request handling for meta information extraction from any page URL.

### FsHttp

`X.Web.MetaExtractor.ContentLoaders.FsHttp` leverages the FsHttp library to load content, facilitating robust and type-safe HTTP request execution for extracting meta information from any page URL.

### HttpClient

`X.Web.MetaExtractor.ContentLoaders.HttpClient` utilizes the HttpClient class to load content, offering a flexible and reliable approach to perform HTTP requests for meta information extraction from any page URL.

### RestSharp

`X.Web.MetaExtractor.ContentLoaders.RestSharp` uses the RestSharp library for content loading, providing an intuitive and powerful way to handle HTTP requests for extracting meta information from any page URL.

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License. See the [LICENSE](https://github.com/ernado-x/X.Web.MetaExtractor/blob/master/LICENSE) file for more details.
35 changes: 28 additions & 7 deletions X.Web.MetaExtractor.sln
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,16 @@ Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "X.Web.MetaExtractor", "src\
EndProject
Project("{9A19103F-16F7-4668-BE54-9A1E7A4F7556}") = "X.Web.MetaExtractor.Tests", "tests\X.Web.MetaExtractor.Tests\X.Web.MetaExtractor.Tests.csproj", "{84F45AE0-AD03-4442-9C89-636FAFDC3B56}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "X.Web.MetaExtractor.Example", "example\X.Web.MetaExtractor.Example\X.Web.MetaExtractor.Example.csproj", "{99BA0721-5C1B-45CF-8600-A29D9AC86BD1}"
EndProject
Project("{2150E333-8FDC-42A3-9474-1A3956D46DE8}") = "example", "example", "{EC055B9D-336D-4A3E-A41E-2220AB52DA7A}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "X.Web.MetaExtractor.ContentLoaders.RestSharp", "src\X.Web.MetaExtractor.ContentLoaders.RestSharp\X.Web.MetaExtractor.ContentLoaders.RestSharp.csproj", "{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "X.Web.MetaExtractor.ContentLoaders.Flurl", "src\X.Web.MetaExtractor.ContentLoaders.Flurl\X.Web.MetaExtractor.ContentLoaders.Flurl.csproj", "{0B015156-1341-4358-BA12-AF10A355810B}"
EndProject
Project("{F2A71F9B-5D33-465A-A702-920D77279786}") = "X.Web.MetaExtractor.ContentLoaders.FsHttp", "src\X.Web.MetaExtractor.ContentLoaders.FsHttp\X.Web.MetaExtractor.ContentLoaders.FsHttp.fsproj", "{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D}"
EndProject
Project("{FAE04EC0-301F-11D3-BF4B-00C04F79EFBC}") = "X.Web.MetaExtractor.Example", "example\X.Web.MetaExtractor.Example\X.Web.MetaExtractor.Example.csproj", "{AA4EC204-2D54-49BA-BA7F-660FE0B382C8}"
EndProject
Global
GlobalSection(SolutionConfigurationPlatforms) = preSolution
Debug|Any CPU = Debug|Any CPU
Expand All @@ -29,17 +35,32 @@ Global
{84F45AE0-AD03-4442-9C89-636FAFDC3B56}.Debug|Any CPU.Build.0 = Debug|Any CPU
{84F45AE0-AD03-4442-9C89-636FAFDC3B56}.Release|Any CPU.ActiveCfg = Release|Any CPU
{84F45AE0-AD03-4442-9C89-636FAFDC3B56}.Release|Any CPU.Build.0 = Release|Any CPU
{99BA0721-5C1B-45CF-8600-A29D9AC86BD1}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{99BA0721-5C1B-45CF-8600-A29D9AC86BD1}.Debug|Any CPU.Build.0 = Debug|Any CPU
{99BA0721-5C1B-45CF-8600-A29D9AC86BD1}.Release|Any CPU.ActiveCfg = Release|Any CPU
{99BA0721-5C1B-45CF-8600-A29D9AC86BD1}.Release|Any CPU.Build.0 = Release|Any CPU
{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E}.Debug|Any CPU.Build.0 = Debug|Any CPU
{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E}.Release|Any CPU.ActiveCfg = Release|Any CPU
{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E}.Release|Any CPU.Build.0 = Release|Any CPU
{0B015156-1341-4358-BA12-AF10A355810B}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{0B015156-1341-4358-BA12-AF10A355810B}.Debug|Any CPU.Build.0 = Debug|Any CPU
{0B015156-1341-4358-BA12-AF10A355810B}.Release|Any CPU.ActiveCfg = Release|Any CPU
{0B015156-1341-4358-BA12-AF10A355810B}.Release|Any CPU.Build.0 = Release|Any CPU
{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D}.Debug|Any CPU.Build.0 = Debug|Any CPU
{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D}.Release|Any CPU.ActiveCfg = Release|Any CPU
{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D}.Release|Any CPU.Build.0 = Release|Any CPU
{AA4EC204-2D54-49BA-BA7F-660FE0B382C8}.Debug|Any CPU.ActiveCfg = Debug|Any CPU
{AA4EC204-2D54-49BA-BA7F-660FE0B382C8}.Debug|Any CPU.Build.0 = Debug|Any CPU
{AA4EC204-2D54-49BA-BA7F-660FE0B382C8}.Release|Any CPU.ActiveCfg = Release|Any CPU
{AA4EC204-2D54-49BA-BA7F-660FE0B382C8}.Release|Any CPU.Build.0 = Release|Any CPU
EndGlobalSection
GlobalSection(SolutionProperties) = preSolution
HideSolutionNode = FALSE
EndGlobalSection
GlobalSection(NestedProjects) = preSolution
{9947A2DD-CDE3-4C53-877D-C8B0381376BB} = {483AC3EF-1A72-4790-961C-7E2B5B80DA83}
{84F45AE0-AD03-4442-9C89-636FAFDC3B56} = {3DF02813-3F46-4CFA-879E-2B9639F0BD29}
{99BA0721-5C1B-45CF-8600-A29D9AC86BD1} = {EC055B9D-336D-4A3E-A41E-2220AB52DA7A}
{3FCCD2AF-A72A-48F1-8C8F-15219EBA975E} = {483AC3EF-1A72-4790-961C-7E2B5B80DA83}
{0B015156-1341-4358-BA12-AF10A355810B} = {483AC3EF-1A72-4790-961C-7E2B5B80DA83}
{47440FEB-7ED2-4109-A47D-F0B5FEF50A1D} = {483AC3EF-1A72-4790-961C-7E2B5B80DA83}
{AA4EC204-2D54-49BA-BA7F-660FE0B382C8} = {EC055B9D-336D-4A3E-A41E-2220AB52DA7A}
EndGlobalSection
EndGlobal
109 changes: 24 additions & 85 deletions example/X.Web.MetaExtractor.Example/Program.cs
Original file line number Diff line number Diff line change
@@ -1,85 +1,24 @@
using System;
using System.Collections.Concurrent;
using System.Collections.Generic;
using System.Linq;
using System.Threading.Tasks;

namespace X.Web.MetaExtractor.Example;

class Program
{
static async Task Main(string[] args)
{
Console.Clear();

IReadOnlyCollection<Uri> linksTemplate = new List<Uri>
{
new Uri("https://diepresse.com/home/wirtschaft/unternehmen/5399476/TeslaChef-Elon-Musk_Das-AutoGeschaeft-ist-die-Hoelle"),
new Uri("https://andrew.gubskiy.com/"),
new Uri("https://devdigest.today/post/458"),
new Uri("https://blogs.msdn.microsoft.com/dotnet/2018/04/11/announcing-net-core-2-1-preview-2/"),
new Uri("https://github.com/dotnet/corefx/milestone/12"),
new Uri("https://stackoverflow.com/questions/49790807/can-net-core-1-1-4-run-net-standard-2"),
new Uri("https://dotnetcoretutorials.com"),
new Uri("https://softwareengineering.stackexchange.com/questions/305933/json-api-specification-when-do-i-need-to-return-a-404-not-found"),
new Uri("https://devdigest.today/post/469"),
new Uri("https://diepresse.com/home/panorama/wien/5386805/Polizist-attackiert_Parlament-verstaerkt-Bewachung"),
new Uri("https://www.diepresse.com/5748483/thiem-unterliegt-bei-atp-cup-gegen-den-polen-hurkacz")
};

var links = new List<Uri>();

for (var i = 0; i < 1; i++)
{
links.AddRange(Generate(linksTemplate));
}

var extractor = new Extractor();

var collection = new BlockingCollection<Metadata>();


await ForEach(links, async uri =>
{
Console.WriteLine($"Start extracting {uri}");

try
{
var metadata = await extractor.ExtractAsync(uri);
collection.Add(metadata);
}
catch (Exception ex)
{
Console.WriteLine($"Url: {uri}. Message: {ex.Message}");
}

return true;
});

foreach (var m in collection)
{
Console.WriteLine($"{m.Title}, {m.Description}");
}

Console.Write("OK");
Console.ReadKey();
}

private static IReadOnlyCollection<Uri> Generate(IReadOnlyCollection<Uri> links)
{
var result = new List<Uri>();

foreach (var link in links)
{
result.Add(new Uri($"{link}?cb={Guid.NewGuid()}"));
}

return result;
}

private static Task ForEach<T>(IEnumerable<T> items, Func<T, Task<bool>> action)
{
var tasks = items.Select(action).ToList();
return Task.WhenAll(tasks);
}
}
using X.Web.MetaExtractor;
using X.Web.MetaExtractor.ContentLoaders.Flurl;
using X.Web.MetaExtractor.LanguageDetectors;

IPageContentLoader contentLoader = new FlurlPageContentLoader();
ILanguageDetector languageDetector = new LanguageDetector();
string defaultImage = "https://example.com/example.jpg";

// Create an instance of the Extractor
IExtractor extractor = new Extractor(defaultImage, contentLoader, languageDetector);

// Extract meta information from a URL
var uri = new Uri("https://andrew.gubskiy.com/content/item/about");
var metaInfo = await extractor.ExtractAsync(uri);

Console.Clear();

// Display the extracted meta information
Console.WriteLine($"Url: {metaInfo.Url}");
Console.WriteLine($"Title: {metaInfo.Title}");
Console.WriteLine($"Description: {metaInfo.Description}");
Console.WriteLine($"Keywords: {string.Join(", ", metaInfo.Keywords)}");
Console.WriteLine($"Image: {metaInfo.Images.FirstOrDefault()}");
Console.WriteLine($"Language: {metaInfo.Language}");
Original file line number Diff line number Diff line change
@@ -1,10 +1,16 @@
<Project Sdk="Microsoft.NET.Sdk">
<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net6.0</TargetFramework>
<LangVersion>default</LangVersion>
</PropertyGroup>
<ItemGroup>
<ProjectReference Include="..\..\src\X.Web.MetaExtractor\X.Web.MetaExtractor.csproj" />
</ItemGroup>
</Project>

<PropertyGroup>
<OutputType>Exe</OutputType>
<TargetFramework>net8.0</TargetFramework>
<ImplicitUsings>enable</ImplicitUsings>
<Nullable>enable</Nullable>
<LangVersion>default</LangVersion>
</PropertyGroup>

<ItemGroup>
<ProjectReference Include="..\..\src\X.Web.MetaExtractor.ContentLoaders.Flurl\X.Web.MetaExtractor.ContentLoaders.Flurl.csproj" />
<ProjectReference Include="..\..\src\X.Web.MetaExtractor\X.Web.MetaExtractor.csproj" />
</ItemGroup>

</Project>
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
using System;
using System.Threading.Tasks;
using Flurl.Http;
using JetBrains.Annotations;

namespace X.Web.MetaExtractor.ContentLoaders.Flurl;

[PublicAPI]
public class FlurlPageContentLoader : IPageContentLoader
{
public async Task<string> LoadPageContentAsync(Uri uri)
{
var html = await uri.ToString().GetStringAsync();

return html;
}
}
Loading

0 comments on commit feb751f

Please sign in to comment.