-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat(semanticTextSim): Semantic Text sim algorithm using doc2vec
- Loading branch information
Showing
439 changed files
with
65,448 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,73 @@ | ||
#!/usr/bin/env python3 | ||
""" | ||
Copyright 2019 Ayush Bhardwaj (classicayush@gmail.com) | ||
SPDX-License-Identifier: GPL-2.0 | ||
This program is free software; you can redistribute it and/or | ||
modify it under the terms of the GNU General Public License | ||
version 2 as published by the Free Software Foundation. | ||
This program is distributed in the hope that it will be useful, | ||
but WITHOUT ANY WARRANTY; without even the implied warranty of | ||
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | ||
GNU General Public License for more details. | ||
You should have received a copy of the GNU General Public License along | ||
with this program; if not, write to the Free Software Foundation, Inc., | ||
51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA. | ||
""" | ||
import gensim | ||
import os | ||
import argparse | ||
import code_comment | ||
from gensim.models.doc2vec import Doc2Vec | ||
from atarashi.libs.commentPreprocessor import CommentPreprocessor | ||
|
||
__author__ = "Ayush Bhardwaj" | ||
__email__ = "classicayush@gmail.com" | ||
|
||
temp = os.path.dirname(os.path.abspath(__file__)) | ||
path = os.path.join(temp, 'spdxDoc2Vec.model') | ||
|
||
def semanticTextSim(filePath): | ||
''' | ||
The function loads the trained model and returns the most similar doc to the input doc. | ||
It preprocess the files and extract the comments out of it i.e. License statements. | ||
The doc is converted to vector and most similar doc (highest cosine sim) is returned. | ||
:param filePath: Input file path to scan | ||
:return: result with license name, sim score, sim type and description | ||
:rtype: list (JSON Format) | ||
''' | ||
commentFile = CommentPreprocessor.extract(filePath) | ||
with open(commentFile) as file: | ||
doc = file.read() | ||
matches = [] | ||
|
||
# Load the trained model | ||
model = Doc2Vec.load(path) | ||
|
||
# To find the vector of a document | ||
data = ((doc).lower()).split() | ||
vector = model.infer_vector(data) | ||
|
||
# to find most similar docs | ||
similar_doc = model.docvecs.most_similar([vector]) | ||
|
||
matches.append({ | ||
'shortname': similar_doc[0][0], | ||
'sim_score': similar_doc[0][1], | ||
'sim_type': "semanticTextSim", | ||
'description': "" | ||
}) | ||
|
||
return matches | ||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("inputFile", help="Specify the input file which needs to be scanned") | ||
|
||
args = parser.parse_args() | ||
filename = args.inputFile | ||
|
||
scanner = semanticTextSim(filename) |
Binary file not shown.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,12 @@ | ||
Copyright (C) 2006 by Rob Landley <rob@landley.net> | ||
|
||
Permission to use, copy, modify, and/or distribute this software for any purpose | ||
with or without fee is hereby granted. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS" AND THE AUTHOR DISCLAIMS ALL WARRANTIES WITH | ||
REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY | ||
AND FITNESS. IN NO EVENT SHALL THE AUTHOR BE LIABLE FOR ANY SPECIAL, DIRECT, | ||
INDIRECT, OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM | ||
LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE | ||
OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR | ||
PERFORMANCE OF THIS SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
This Program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 2 of the License. | ||
|
||
This Program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. | ||
|
||
You should have received a copy of the GNU General Public License along with this Program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA. | ||
|
||
In addition, as a special exception, Red Hat, Inc. gives You the additional right to link the code of this Program with code not covered under the GNU General Public License ("Non-GPL Code") and to distribute linked combinations including the two, subject to the limitations in this paragraph. Non-GPL Code permitted under this exception must only link to the code of this Program through those well defined interfaces identified in the file named EXCEPTION found in the source code files (the "Approved Interfaces"). The files of Non-GPL Code may instantiate templates or use macros or inline functions from the Approved Interfaces without causing the resulting work to be covered by the GNU General Public License. Only Red Hat, Inc. may make changes or additions to the list of Approved Interfaces. You must obey the GNU General Public License in all respects for all of the Program code and other code used in conjunction with the Program except the Non-GPL Code covered by this exception. If you modify this file, you may extend this exception to your version of the file, but you are not obligated to do so. If you do not wish to provide this exception without modification, you must delete this exception statement from your version and license this file solely under the GPL without exception. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
Attribution Assurance License Copyright (c) 2002 by AUTHOR PROFESSIONAL IDENTIFICATION | ||
* URL "PROMOTIONAL SLOGAN FOR AUTHOR'S PROFESSIONAL PRACTICE" | ||
|
||
All Rights Reserved ATTRIBUTION ASSURANCE LICENSE (adapted from the original | ||
BSD license) | ||
|
||
Redistribution and use in source and binary forms, with or without modification, | ||
are permitted provided that the conditions below are met. These conditions | ||
require a modest attribution to <AUTHOR> (the "Author"), who hopes that its | ||
promotional value may help justify the thousands of dollars in otherwise billable | ||
time invested in writing this and other freely available, open-source software. | ||
|
||
1. Redistributions of source code, in whole or part and with or without modification | ||
(the "Code"), must prominently display this GPG-signed text in verifiable | ||
form. | ||
|
||
2. Redistributions of the Code in binary form must be accompanied by this | ||
GPG-signed text in any documentation and, each time the resulting executable | ||
program or a program dependent thereon is launched, a prominent display (e.g., | ||
splash screen or banner text) of the Author's attribution information, which | ||
includes: | ||
|
||
(a) Name ("AUTHOR"), | ||
|
||
(b) Professional identification ("PROFESSIONAL IDENTIFICATION"), and | ||
|
||
(c) URL ("URL"). | ||
|
||
3. Neither the name nor any trademark of the Author may be used to endorse | ||
or promote products derived from this software without specific prior written | ||
permission. | ||
|
||
4. Users are entirely responsible, to the exclusion of the Author and any | ||
other persons, for compliance with (1) regulations set by owners or administrators | ||
of employed equipment, (2) licensing terms of any other software, and (3) | ||
local regulations regarding use, including those regarding import, export, | ||
and use of encryption software. | ||
|
||
THIS FREE SOFTWARE IS PROVIDED BY THE AUTHOR "AS IS" AND ANY EXPRESS OR IMPLIED | ||
WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY | ||
AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE | ||
AUTHOR OR ANY CONTRIBUTOR BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, | ||
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, | ||
EFFECTS OF UNAUTHORIZED OR MALICIOUS NETWORK ACCESS; PROCUREMENT OF SUBSTITUTE | ||
GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) | ||
HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT | ||
LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY | ||
OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH | ||
DAMAGE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
This software code is made available "AS IS" without warranties of any kind. | ||
You may copy, display, modify and redistribute the software code either by | ||
itself or as incorporated into your code; provided that > you do not remove | ||
any proprietary notices. Your use of this software code is at your own risk | ||
and you waive any claim against Amazon Digital Services, Inc. or its affiliates | ||
with respect to your use of this software code. (c) 2006 Amazon Digital Services, | ||
Inc. or its affiliates. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
Academic Free License | ||
|
||
Version 1.1 The Academic Free License applies to any original work of authorship | ||
(the "Original Work") whose owner (the "Licensor") has placed the following | ||
notice immediately following the copyright notice for the Original Work: | ||
|
||
"Licensed under the Academic Free License version 1.1." | ||
|
||
Grant of License. Licensor hereby grants to any person obtaining a copy of | ||
the Original Work ("You") a world-wide, royalty-free, non-exclusive, perpetual, | ||
non-sublicenseable license | ||
|
||
(1) to use, copy, modify, merge, publish, perform, distribute and/or sell | ||
copies of the Original Work and derivative works thereof, and | ||
|
||
(2) under patent claims owned or controlled by the Licensor that are embodied | ||
in the Original Work as furnished by the Licensor, to make, use, sell and | ||
offer for sale the Original Work and derivative works thereof, subject to | ||
the following conditions. | ||
|
||
Right of Attribution. Redistributions of the Original Work must reproduce | ||
all copyright notices in the Original Work as furnished by the Licensor, both | ||
in the Original Work itself and in any documentation and/or other materials | ||
provided with the distribution of the Original Work in executable form. | ||
|
||
Exclusions from License Grant. Neither the names of Licensor, nor the names | ||
of any contributors to the Original Work, nor any of their trademarks or service | ||
marks, may be used to endorse or promote products derived from this Original | ||
Work without express prior written permission of the Licensor. | ||
|
||
WARRANTY AND DISCLAIMERS. LICENSOR WARRANTS THAT THE COPYRIGHT IN AND TO THE | ||
ORIGINAL WORK IS OWNED BY THE LICENSOR OR THAT THE ORIGINAL WORK IS DISTRIBUTED | ||
BY LICENSOR UNDER A VALID CURRENT LICENSE FROM THE COPYRIGHT OWNER. EXCEPT | ||
AS EXPRESSLY STATED IN THE IMMEDIATELY PRECEEDING SENTENCE, THE ORIGINAL WORK | ||
IS PROVIDED UNDER THIS LICENSE ON AN "AS IS" BASIS, WITHOUT WARRANTY, EITHER | ||
EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, THE WARRANTY OF NON-INFRINGEMENT | ||
AND WARRANTIES THAT THE ORIGINAL WORK IS MERCHANTABLE OR FIT FOR A PARTICULAR | ||
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL WORK IS WITH YOU. | ||
THIS DISCLAIMER OF WARRANTY CONSTITUTES AN ESSENTIAL PART OF THIS LICENSE. | ||
NO LICENSE TO ORIGINAL WORK IS GRANTED HEREUNDER EXCEPT UNDER THIS DISCLAIMER. | ||
|
||
LIMITATION OF LIABILITY. UNDER NO CIRCUMSTANCES AND UNDER NO LEGAL THEORY, | ||
WHETHER TORT (INCLUDING NEGLIGENCE), CONTRACT, OR OTHERWISE, SHALL THE LICENSOR | ||
BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR | ||
CONSEQUENTIAL DAMAGES OF ANY CHARACTER ARISING AS A RESULT OF THIS LICENSE | ||
OR THE USE OF THE ORIGINAL WORK INCLUDING, WITHOUT LIMITATION, DAMAGES FOR | ||
LOSS OF GOODWILL, WORK STOPPAGE, COMPUTER FAILURE OR MALFUNCTION, OR ANY AND | ||
ALL OTHER COMMERCIAL DAMAGES OR LOSSES, EVEN IF SUCH PERSON SHALL HAVE BEEN | ||
INFORMED OF THE POSSIBILITY OF SUCH DAMAGES. THIS LIMITATION OF LIABILITY | ||
SHALL NOT APPLY TO LIABILITY FOR DEATH OR PERSONAL INJURY RESULTING FROM SUCH | ||
PARTY'S NEGLIGENCE TO THE EXTENT APPLICABLE LAW PROHIBITS SUCH LIMITATION. | ||
SOME JURISDICTIONS DO NOT ALLOW THE EXCLUSION OR LIMITATION OF INCIDENTAL | ||
OR CONSEQUENTIAL DAMAGES, SO THIS EXCLUSION AND LIMITATION MAY NOT APPLY TO | ||
YOU. | ||
|
||
License to Source Code. The term "Source Code" means the preferred form of | ||
the Original Work for making modifications to it and all available documentation | ||
describing how to access and modify the Original Work. Licensor hereby agrees | ||
to provide a machine-readable copy of the Source Code of the Original Work | ||
along with each copy of the Original Work that Licensor distributes. Licensor | ||
reserves the right to satisfy this obligation by placing a machine-readable | ||
copy of the Source Code in an information repository reasonably calculated | ||
to permit inexpensive and convenient access by You for as long as Licensor | ||
continues to distribute the Original Work, and by publishing the address of | ||
that information repository in a notice immediately following the copyright | ||
notice that applies to the Original Work. | ||
|
||
Mutual Termination for Patent Action. This License shall terminate automatically | ||
and You may no longer exercise any of the rights granted to You by this License | ||
if You file a lawsuit in any court alleging that any OSI Certified open source | ||
software that is licensed under any license containing this "Mutual Termination | ||
for Patent Action" clause infringes any patent claims that are essential to | ||
use that software. | ||
|
||
This license is Copyright (C) 2002 Lawrence E. Rosen. All rights reserved. | ||
|
||
Permission is hereby granted to copy and distribute this license without modification. | ||
This license may not be modified without the express written permission of | ||
its copyright owner. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
Academic Free License | ||
|
||
Version 1.2 This Academic Free License applies to any original work of authorship | ||
(the "Original Work") whose owner (the "Licensor") has placed the following | ||
notice immediately following the copyright notice for the Original Work: | ||
|
||
Licensed under the Academic Free License version 1.2 | ||
|
||
Grant of License. Licensor hereby grants to any person obtaining a copy of | ||
the Original Work ("You") a world-wide, royalty-free, non-exclusive, perpetual, | ||
non-sublicenseable license (1) to use, copy, modify, merge, publish, perform, | ||
distribute and/or sell copies of the Original Work and derivative works thereof, | ||
and (2) under patent claims owned or controlled by the Licensor that are embodied | ||
in the Original Work as furnished by the Licensor, to make, use, sell and | ||
offer for sale the Original Work and derivative works thereof, subject to | ||
the following conditions. | ||
|
||
Attribution Rights. You must retain, in the Source Code of any Derivative | ||
Works that You create, all copyright, patent or trademark notices from the | ||
Source Code of the Original Work, as well as any notices of licensing and | ||
any descriptive text identified therein as an "Attribution Notice." You must | ||
cause the Source Code for any Derivative Works that You create to carry a | ||
prominent Attribution Notice reasonably calculated to inform recipients that | ||
You have modified the Original Work. | ||
|
||
Exclusions from License Grant. Neither the names of Licensor, nor the names | ||
of any contributors to the Original Work, nor any of their trademarks or service | ||
marks, may be used to endorse or promote products derived from this Original | ||
Work without express prior written permission of the Licensor. | ||
|
||
Warranty and Disclaimer of Warranty. Licensor warrants that the copyright | ||
in and to the Original Work is owned by the Licensor or that the Original | ||
Work is distributed by Licensor under a valid current license from the copyright | ||
owner. Except as expressly stated in the immediately proceeding sentence, | ||
the Original Work is provided under this License on an "AS IS" BASIS and WITHOUT | ||
WARRANTY, either express or implied, including, without limitation, the warranties | ||
of NON-INFRINGEMENT, MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. | ||
THE ENTIRE RISK AS TO THE QUALITY OF THE ORIGINAL WORK IS WITH YOU. This DISCLAIMER | ||
OF WARRANTY constitutes an essential part of this License. No license to Original | ||
Work is granted hereunder except under this disclaimer. | ||
|
||
Limitation of Liability. Under no circumstances and under no legal theory, | ||
whether in tort (including negligence), contract, or otherwise, shall the | ||
Licensor be liable to any person for any direct, indirect, special, incidental, | ||
or consequential damages of any character arising as a result of this License | ||
or the use of the Original Work including, without limitation, damages for | ||
loss of goodwill, work stoppage, computer failure or malfunction, or any and | ||
all other commercial damages or losses. This limitation of liability shall | ||
not apply to liability for death or personal injury resulting from Licensor's | ||
negligence to the extent applicable law prohibits such limitation. Some jurisdictions | ||
do not allow the exclusion or limitation of incidental or consequential damages, | ||
so this exclusion and limitation may not apply to You. | ||
|
||
License to Source Code. The term "Source Code" means the preferred form of | ||
the Original Work for making modifications to it and all available documentation | ||
describing how to modify the Original Work. Licensor hereby agrees to provide | ||
a machine-readable copy of the Source Code of the Original Work along with | ||
each copy of the Original Work that Licensor distributes. Licensor reserves | ||
the right to satisfy this obligation by placing a machine-readable copy of | ||
the Source Code in an information repository reasonably calculated to permit | ||
inexpensive and convenient access by You for as long as Licensor continues | ||
to distribute the Original Work, and by publishing the address of that information | ||
repository in a notice immediately following the copyright notice that applies | ||
to the Original Work. | ||
|
||
Mutual Termination for Patent Action. This License shall terminate automatically | ||
and You may no longer exercise any of the rights granted to You by this License | ||
if You file a lawsuit in any court alleging that any OSI Certified open source | ||
software that is licensed under any license containing this "Mutual Termination | ||
for Patent Action" clause infringes any patent claims that are essential to | ||
use that software. | ||
|
||
Right to Use. You may use the Original Work in all ways not otherwise restricted | ||
or conditioned by this License or by law, and Licensor promises not to interfere | ||
with or be responsible for such uses by You. | ||
|
||
This license is Copyright (C) 2002 Lawrence E. Rosen. All rights reserved. | ||
|
||
Permission is hereby granted to copy and distribute this license without modification. | ||
This license may not be modified without the express written permission of | ||
its copyright owner. |
Oops, something went wrong.