GitHub - alturkovic/domain-utils: Domain matching

Inspired by: https://github.com/whois-server-list/public-suffix-list

How would you extract the registered domain name from these domains? How would you determine what is the subdomain?

net.hr
user.net.hr
news.bbc.co.uk
commoncrawl.s3.amazonaws.com

domain-utils aims to solve these issues. It uses a rule list to determine how domains are constructed. The recommended rule list to use is Mozilla's Public Suffix List.

A "public suffix" is one under which Internet users can directly register names. Some examples of public suffixes are com, co.uk and pvt.k12.ma.us.

Using the library

Create a DomainRegistry using the DomainRegistryBuilder. There are several ways how to do this explained in one of the following chapters.

API methods all return Optional results:

DomainRegistry.getPublicSuffix: extract the public suffix
DomainRegistry.getRegistrableName: extract the registrable domain name (one level under the public suffix)
DomainRegistry.getSubDomain: extract the subdomain
DomainRegistry.stripSubDomain: remove the subdomain if the domain is under a public suffix

Examples

Assuming you are using the suggested rule list from Mozilla:

alturkovic.blogspot.com

DomainRegistry.getPublicSuffix: blogspot.com
DomainRegistry.getRegistrableName: alturkovic
DomainRegistry.getSubDomain: Optional.empty
DomainRegistry.stripSubDomain: alturkovic.blogspot.com

en.wikipedia.org

DomainRegistry.getPublicSuffix: org
DomainRegistry.getRegistrableName: wikipedia
DomainRegistry.getSubDomain: en
DomainRegistry.stripSubDomain: wikipedia.org

alturkovic.invalid

DomainRegistry.getPublicSuffix: Optional.empty
DomainRegistry.getRegistrableName: Optional.empty
DomainRegistry.getSubDomain: Optional.empty
DomainRegistry.stripSubDomain: Optional.empty

What about www?

The domain name must exclude www.. This library will treat it as a subdomain otherwise.

Check out my URL handling project here to help you with that.

Importing into your project using Maven

Add the JitPack repository into your pom.xml.

<repositories>
  <repository>
    <id>jitpack.io</id>
    <url>https://jitpack.io</url>
  </repository>
</repositories>

Add the following under your <dependencies>:

<dependencies>
  <dependency>
    <groupId>com.github.alturkovic</groupId>
    <artifactId>domain-utils</artifactId>
    <version>1.1.0</version>
  </dependency>
</dependencies>

IDN

You can use the API's methods with UTF-8 domain names or Punycode encoded ASCII domain names. The API will return the results in the same format as the input was. I.e. if you use a UTF-8 string the result will be a UTF-8 String as well. Same for Punycode.

Keep Public Suffix List up to date

Using Maven build process

You can integrate the download of the latest list in your maven build process:

    <plugin>
        <groupId>com.googlecode.maven-download-plugin</groupId>
        <artifactId>download-maven-plugin</artifactId>
        <version>1.6.3</version>
        <executions>
            <execution>
                <phase>generate-resources</phase>
                <goals>
                    <goal>wget</goal>
                </goals>
                <configuration>
                    <url>https://publicsuffix.org/list/effective_tld_names.dat</url>
                    <outputDirectory>${project.build.outputDirectory}</outputDirectory>
                </configuration>
            </execution>
        </executions>
    </plugin>

Then instantiate the DomainRegistryBuilder:

DomainRegistry registry = new DomainRegistryBuilder()
    .fromFile("/effective_tld_names.dat")
    .build();

By downloading it manually

DomainRegistry registry = new DomainRegistryBuilder()
    .from(new URL("https://publicsuffix.org/list/effective_tld_names.dat").openStream())
    .build();

By using the API call

This is the same as downloading it manually over HTTP.

DomainRegistry registry = new DomainRegistryBuilder()
    .withDefaultRules()
    .build();

Build your own rules

You can build the DomainRegistry using any rules you might need.

DomainRegistry registry = new DomainRegistryBuilder()
    .withRule("tld")
    .build();

Use case example

You have one set of rules for all org domains except for wikipedia.org.

You can create the following DomainRegistry:

DomainRegistry registry = new DomainRegistryBuilder()
    .withRule("org")
    .withRule("wikipedia.org")
    .build();

registry.getPublicSuffix("any.org"); // org
registry.getPublicSuffix("sub.any.org"); // org
registry.getPublicSuffix("wikipedia.org"); // wikipedia.org
registry.getPublicSuffix("en.wikipedia.org"); // wikipedia.org

The configured DomainRegistry can be used to see which ruleset should be applied depending on the domain name.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Using the library

Examples

What about www?

Importing into your project using Maven

IDN

Keep Public Suffix List up to date

Using Maven build process

By downloading it manually

By using the API call

Build your own rules

Use case example

About

Releases 2

Packages

Languages

License

alturkovic/domain-utils

Folders and files

Latest commit

History

Repository files navigation

Using the library

Examples

What about www?

Importing into your project using Maven

IDN

Keep Public Suffix List up to date

Using Maven build process

By downloading it manually

By using the API call

Build your own rules

Use case example

About

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages