Skip to content

lexborisov/myhtml

Repository files navigation

MyHTML — a pure C HTML parser

Build Status

MyHTML is a fast HTML Parser using Threads implemented as a pure C99 library with no outside dependencies.

Now

Important announcement!

Please use the HTML parser from the Lexbor project. It is stable, has more features, and — yes — it's very fast.

Features

  • Asynchronous Parsing, Build Tree and Indexation
  • Fully conformant with the HTML5 specification
  • Two API - high and low-level
  • Manipulation of elements: add, change, delete and other
  • Manipulation of elements attributes: add, change, delete and other
  • Support 39 character encoding by specification encoding.spec.whatwg.org
  • Support detecting character encodings
  • Support Single Mode parsing
  • Support Build without POSIX Threads
  • Support for fragment parsing
  • Support for parsing by chunks
  • No outside dependencies
  • C99 support
  • Passes all tree construction tests from html5lib-tests
  • Tested by 1 billion HTML pages (by commoncrawl.org)

Changes

Please, see CHANGELOG.md file

Further developments

  • Modest — Modest is a fast HTML Render implemented as a pure C99 library with no outside dependencies
  • MyCSS — Fast C/C++ CSS Parser (Cascading Style Sheets Parser)

Support encodings for InputStream

X_USER_DEFINED, UTF_8, UTF_16LE, UTF_16BE, BIG5, EUC_KR, GB18030,
IBM866, ISO_8859_10, ISO_8859_13, ISO_8859_14, ISO_8859_15, ISO_8859_16, ISO_8859_2, ISO_8859_3,
ISO_8859_4, ISO_8859_5, ISO_8859_6, ISO_8859_7, ISO_8859_8, KOI8_R, KOI8_U, MACINTOSH,
WINDOWS_1250, WINDOWS_1251, WINDOWS_1252, WINDOWS_1253, WINDOWS_1254, WINDOWS_1255, WINDOWS_1256,
WINDOWS_1257, WINDOWS_1258, WINDOWS_874, X_MAC_CYRILLIC, ISO_2022_JP, GBK, SHIFT_JIS, EUC_JP, ISO_8859_8_I

Support encodings for output

Program working in UTF-8 and returns all in UTF-8

Detecting character encodings

Now it UTF-8, UTF-16LE, UTF16BE and russian windows-1251, koi8-r, iso-8859-5, x-mac-cyrillic, ibm866

Installation

See INSTALL.md

Introduction

Introduction

Benchmark

Dependencies

None

External Bindings and Wrappers

Examples

See examples directory

Simple example

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <myhtml/api.h>

int main(int argc, const char * argv[])
{
    char html[] = "<div><span>HTML</span></div>";
    
    // basic init
    myhtml_t* myhtml = myhtml_create();
    myhtml_init(myhtml, MyHTML_OPTIONS_DEFAULT, 1, 0);
    
    // first tree init
    myhtml_tree_t* tree = myhtml_tree_create();
    myhtml_tree_init(tree, myhtml);
    
    // parse html
    myhtml_parse(tree, MyENCODING_UTF_8, html, strlen(html));
    
    // print result
    // or see serialization function with callback: myhtml_serialization_tree_callback
    mycore_string_raw_t str = {0};
    myhtml_serialization_tree_buffer(myhtml_tree_get_document(tree), &str);
    printf("%s\n", str.data);
    
    // release resources
    mycore_string_raw_destroy(&str, false);
    myhtml_tree_destroy(tree);
    myhtml_destroy(myhtml);
    
    return 0;
}

AUTHOR

Alexander Borisov lex.borisov@gmail.com

COPYRIGHT AND LICENSE

Copyright (C) 2015-2018 Alexander Borisov

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

See the LICENSE file.