Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error: redefinition of group name 'm' as group 5; was group 2 at position 116 #54

Closed
kinoute opened this issue Jun 6, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@kinoute
Copy link

kinoute commented Jun 6, 2022

Hello there,

Thanks for this great project! I encountered a problem while crawling different websites and trying to extract dates with this package. Especially on this URL: https://osmh.dev

Here is the error using iPython and Python 3.8.12:

# works
In [3]: from htmldate import find_date

In [4]: find_date("https://osmh.dev")
Out[4]: '2020-11-29'

# doesn't work
In [6]: find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

The last example throws an error:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-6-9988648ad55b> in <module>
----> 1 find_date("https://osmh.dev", extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in find_date(htmlobject, extensive_search, original_date, outputformat, url, verbose, min_date, max_date)
    653
    654     # try time elements
--> 655     time_result = examine_time_elements(
    656         search_tree, outputformat, extensive_search, original_date, min_date, max_date
    657     )

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in examine_time_elements(tree, outputformat, extensive_search, original_date, min_date, max_date)
    389                         return attempt
    390                 else:
--> 391                     reference = compare_reference(reference, elem.get('datetime'), outputformat, extensive_search, original_date, min_date, max_date)
    392                     if reference > 0:
    393                         break

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/core.py in compare_reference(reference, expression, outputformat, extensive_search, original_date, min_date, max_date)
    300     attempt = try_expression(expression, outputformat, extensive_search, min_date, max_date)
    301     if attempt is not None:
--> 302         return compare_values(reference, attempt, outputformat, original_date)
    303     return reference
    304

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/site-packages/htmldate/validators.py in compare_values(reference, attempt, outputformat, original_date)
    110 def compare_values(reference, attempt, outputformat, original_date):
    111     """Compare the date expression to a reference"""
--> 112     timestamp = time.mktime(datetime.datetime.strptime(attempt, outputformat).timetuple())
    113     if original_date is True and (reference == 0 or timestamp < reference):
    114         reference = timestamp

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime_datetime(cls, data_string, format)
    566     """Return a class cls instance based on the input string and the
    567     format string."""
--> 568     tt, fraction, gmtoff_fraction = _strptime(data_string, format)
    569     tzname, gmtoff = tt[-2:]
    570     args = tt[:6] + (fraction,)

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in _strptime(data_string, format)
    331         if not format_regex:
    332             try:
--> 333                 format_regex = _TimeRE_cache.compile(format)
    334             # KeyError raised when a bad format is found; can be specified as
    335             # \\, in which case it was a stray % but with a space after it

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/_strptime.py in compile(self, format)
    261     def compile(self, format):
    262         """Return a compiled re object for the format string."""
--> 263         return re_compile(self.pattern(format), IGNORECASE)
    264
    265 _cache_lock = _thread_allocate_lock()

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in compile(pattern, flags)
    250 def compile(pattern, flags=0):
    251     "Compile a regular expression pattern, returning a Pattern object."
--> 252     return _compile(pattern, flags)
    253
    254 def purge():

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/re.py in _compile(pattern, flags)
    302     if not sre_compile.isstring(pattern):
    303         raise TypeError("first argument must be string or compiled pattern")
--> 304     p = sre_compile.compile(pattern, flags)
    305     if not (flags & DEBUG):
    306         if len(_cache) >= _MAXCACHE:

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--> 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in parse(str, flags, state)
    946
    947     try:
--> 948         p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
    949     except Verbose:
    950         # the VERBOSE flag was switched on inside the pattern.  to be

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse_sub(source, state, verbose, nested)
    441     start = source.tell()
    442     while True:
--> 443         itemsappend(_parse(source, state, verbose, nested + 1,
    444                            not nested and not items))
    445         if not sourcematch("|"):

/opt/homebrew/Caskroom/miniconda/base/envs/osint-crawler/lib/python3.8/sre_parse.py in _parse(source, state, verbose, nested, first)
    829                     group = state.opengroup(name)
    830                 except error as err:
--> 831                     raise source.error(err.msg, len(name) + 1) from None
    832             sub_verbose = ((verbose or (add_flags & SRE_FLAG_VERBOSE)) and
    833                            not (del_flags & SRE_FLAG_VERBOSE))

error: redefinition of group name 'm' as group 5; was group 2 at position 116
@adbar
Copy link
Owner

adbar commented Jun 7, 2022

Hi @kinoute, I cannot reproduce the bug, I think it has to do with your setting. The error log hints at another function also named strptime which interferes with datetime's strptime function.

@kinoute
Copy link
Author

kinoute commented Jun 10, 2022

Here is a one-liner to reproduce the error using vanilla official Python docker image:

docker run --rm python:3.8.12 /bin/bash -c "pip3 install htmldate; python3 -c \"from htmldate import find_date; find_date('https://osmh.dev', extensive_search=False, outputformat='%Y-%m-%d %H:%m:%S')\""

@adbar
Copy link
Owner

adbar commented Jun 13, 2022

Thanks, I can see the problem now.

@adbar adbar added the bug Something isn't working label Jun 13, 2022
adbar added a commit that referenced this issue Jun 13, 2022
@adbar
Copy link
Owner

adbar commented Jun 13, 2022

@kinoute it's fixed, I will ship a new release very soon.

Please not that changing extraction granularity affects the result for the case you mention:

  • outputformat='%Y-%m-%d %H:%m:%S' does not return any date because of the datetime issue above (which I have no time to investigate further)
  • outputformat='%Y-%m-%d' returns the correct date

@adbar adbar closed this as completed Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants