Skip to content

Commit

Permalink
Merge branch origin/java-support into master
Browse files Browse the repository at this point in the history
  • Loading branch information
dlidstrom committed Jul 6, 2024
2 parents b93e917 + d158d03 commit 441f467
Show file tree
Hide file tree
Showing 4 changed files with 136 additions and 35 deletions.
122 changes: 87 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
# Duplo (C/C++/Java Duplicate Source Code Block Finder) <!-- omit in toc -->
# Duplo - Duplicate Source Code Block Finder <!-- omit in toc -->

![C/C++ CI](https://github.com/dlidstrom/Duplo/workflows/C/C++%20CI/badge.svg)

**Updates:**

🔥 v0.8 adds improved Java support

🙌 Help needed! See [8.3](#83-additional-language-support) on how to support more languages.

**Table of Contents:**

- [1. General Information](#1-general-information)
- [2. Maintainer](#2-maintainer)
- [3. File Format Support](#3-file-format-support)
Expand Down Expand Up @@ -31,15 +39,15 @@
## 1. General Information

Duplicated source code blocks can harm maintainability of software systems.
Duplo is a tool to find duplicated code blocks in large code bases. Duplo has special support for some
programming languages, meaning it can filter out (multi-line) comments and compiler directives.
For example: C, C++, Java, C#, and VB.NET. Any other text format is also supported.
Duplo is a tool to find duplicated code blocks in large code bases. Duplo has
special support for some programming languages, meaning it can filter out
(multi-line) comments and compiler directives. For example: C, C++, Java, C#,
and VB.NET. Any other text format is also supported.

## 2. Maintainer

Duplo was originally developed by Christian
M. Ammann and is now maintained and developed by Daniel
Lidström.
Duplo was originally developed by Christian M. Ammann and is now maintained and
developed by Daniel Lidström.

## 3. File Format Support

Expand All @@ -53,14 +61,10 @@ file formats:
- GCC assembly
- Ada

This means that Duplo will remove
preprocessor directives, block comments, using
statements, etc, to only consider duplicates
in actual code.
In addition, Duplo can be used as a general
(without special support) duplicates detector
in arbitrary text files and will even detect
duplicates found in the same file.
This means that Duplo will remove preprocessor directives, block comments, using
statements, etc, to only consider duplicates in actual code. In addition, Duplo
can be used as a general (without special support) duplicates detector in
arbitrary text files and will even detect duplicates found in the same file.

Sample output snippet:

Expand Down Expand Up @@ -92,23 +96,29 @@ If you have Docker, the way to run Duplo is to use this command:
> docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo
```

This pulls the latest image and runs duplo. Note that you'll have to pipe the filenames into this command. A complete commandline sample will be shown below.
This pulls the latest image and runs duplo. Note that you'll have to pipe the
filenames into this command. A complete commandline sample will be shown below.

### 4.2. Pre-built binaries

Duplo is also available as a pre-built binary for (alpine) linux and macos. Grab the executable from the [releases](https://github.com/dlidstrom/Duplo/releases) page.
Duplo is also available as a pre-built binary for (alpine) linux and macos. Grab
the executable from the [releases](https://github.com/dlidstrom/Duplo/releases)
page.

You can of course build from source as well, and you'll have to do so to get a binary for Windows.
You can of course build from source as well, and you'll have to do so to get a
binary for Windows.

## 5. Usage

Duplo works with a list of files. You can either specify a file that contains the list of files, or you can pass them using `stdin`.
Duplo works with a list of files. You can either specify a file that contains
the list of files, or you can pass them using `stdin`.

Run `duplo --help` on the command line to see the detailed options.

### 5.1. Passing files using `stdin`

In each of the following commands, `duplo` will write the duplicated blocks into `out.txt` in addition to the information written to stdout.
In each of the following commands, `duplo` will write the duplicated blocks into
`out.txt` in addition to the information written to stdout.

#### 5.1.1. Bash

Expand All @@ -117,7 +127,13 @@ In each of the following commands, `duplo` will write the duplicated blocks into
> find . -type f \( -iname "*.cpp" -o -iname "*.h" \) | duplo - out.txt
```

Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is a syntax to look recursively in the current directory (the `.` part) for files (the `-type f` part) matching `*.cpp` or `*.h` (case insensitive). The output from `find` is piped into `duplo` which then reads the filenames from `stdin` (the `-` tells `duplo` to get the filenames from `stdin`, a common unix convention in many commandline applications). The result of the analysis is then written to `out.txt`.
Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is
a syntax to look recursively in the current directory (the `.` part) for files
(the `-type f` part) matching `*.cpp` or `*.h` (case insensitive). The output
from `find` is piped into `duplo` which then reads the filenames from `stdin`
(the `-` tells `duplo` to get the filenames from `stdin`, a common unix
convention in many commandline applications). The result of the analysis is then
written to `out.txt`.

#### 5.1.2. Windows

Expand All @@ -126,7 +142,8 @@ Let's break this down. `find . -type f \( -iname "*.cpp" -o -iname "*.h" \)` is
> Get-ChildItem -Include "*.cpp", "*.h" -Recurse | % { $_.FullName } | Duplo.exe - out.txt
```
This works similarly to the Bash command, but uses PowerShell commands to achieve the same effect.
This works similarly to the Bash command, but uses PowerShell commands to
achieve the same effect.
#### 5.1.3. Docker
Expand All @@ -135,9 +152,22 @@ This works similarly to the Bash command, but uses PowerShell commands to achiev
> find . -type f \( -iname "*.cpp" -or -iname "*.h" \) | docker run --rm -i -w /src -v $(pwd):/src dlidstrom/duplo - out.txt
```
This command also works in a similar fashion to the Bash command, but instead of piping into a local `duplo` executable, it will pipe into `duplo` running inside Docker. This is very convenient as you do not have to install `duplo` separately. You will have to install Docker though, if you haven't already. That is a good thing to do anyway, since it opens up a lot of possibilities apart from running `duplo`.
Again, similarly to the Bash command, this uses `find` to find files in the current directory, then passes the file list to Docker which will pass it further into an instance of the latest version of `duplo`. The working directory in the `duplo` container should be `/src` (that's where the `duplo` executable is located) and the current path of your host machine will be mapped to `/src` when the container is running. The `-i` allows `stdin` of your host machine to be passed into Docker to allow `duplo` to read the filenames. Any parameters to `duplo` can be placed at the end of the command as you can see `- out.txt` has been.
This command also works in a similar fashion to the Bash command, but instead of
piping into a local `duplo` executable, it will pipe into `duplo` running inside
Docker. This is very convenient as you do not have to install `duplo`
separately. You will have to install Docker though, if you haven't already. That
is a good thing to do anyway, since it opens up a lot of possibilities apart
from running `duplo`.
Again, similarly to the Bash command, this uses `find` to find files in the
current directory, then passes the file list to Docker which will pass it
further into an instance of the latest version of `duplo`. The working directory
in the `duplo` container should be `/src` (that's where the `duplo` executable
is located) and the current path of your host machine will be mapped to `/src`
when the container is running. The `-i` allows `stdin` of your host machine to
be passed into Docker to allow `duplo` to read the filenames. Any parameters to
`duplo` can be placed at the end of the command as you can see `- out.txt` has
been.
### 5.2. Passing files using file
Expand All @@ -161,18 +191,19 @@ Again, the duplicated blocks are written to `out.txt`.
### 5.3. Xml output
Duplo can also output xml and there is a stylesheet that will format the result for viewing in a browser. This can be used as a report tab in your continuous integration tool (TeamCity, etc).
Duplo can also output xml and there is a stylesheet that will format the result
for viewing in a browser. This can be used as a report tab in your continuous
integration tool (GitHub Actions, TeamCity, etc).
## 6. Feedback and Bug Reporting
Please open an issue to discuss feedback,
feature requests and bug reports.
Please open an issue to discuss feedback, feature requests and bug reports.
## 7. Algorithm Background
Duplo uses the same techniques as Duploc to detect duplicated code blocks. See
[Duca99bCodeDuplication](http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf) for
further information.
[Duca99bCodeDuplication](http://scg.unibe.ch/archive/papers/Duca99bCodeDuplication.pdf)
for further information.
### 7.1. Performance Measurements
Expand Down Expand Up @@ -213,12 +244,26 @@ Use Visual Studio 2019 to open the included solution file (or try `CMake`).
### 8.3. Additional Language Support
Duplo can analyze all text files regardless of format, but it has special support for some programming languages (C++, C#, Java, for example). This allows Duplo to improve the duplication detection as it can ignore preprocessor directives and/or comments.
Duplo can analyze all text files regardless of format, but it has special
support for some programming languages (C++, C#, Java, for example). This allows
Duplo to improve the duplication detection as it can ignore preprocessor
directives and/or comments.
To implement support for a new language, there are a couple of options:
To implement support for a new language, there are a couple of options (in order of complexity):
1. Implement `FileTypeBase` which has support for handling comments and
preprocessor directives. You just need to decide what is a comment. With this
option you need to implement a couple of methods, one which is
`CreateLineFilter`. This is to remove multiline comments. Look at
`CstyleCommentsFilter` for an example.
2. Implement `IFileType` interface directly. This gives you the most freedom but
also is the hardest option.
1. Implement `FileTypeBase` which has support for handling comments and preprocessor directives. You just need to decide what is a comment. With this option you need to implement a couple of methods, one which is `CreateLineFilter`. This is to remove multiline comments. Look at `CstyleCommentsFilter` for an example.
2. Implement `IFileType` interface directly. This gives you the most freedom but also is the hardest option of course.
You can see an example of how Java support was added effortlessly. It involves
copying an existing file type implementation and adjusting the lines that should
be filtered and how comments should be removed. Finally, add a few lines in
`FileTypeFactory.cpp` to choose the correct implementation based on the file
extension. Refer to [this commit](https://github.com/dlidstrom/Duplo/commit/320f9474354d41c3b35c178bb4b7f6c667025976) for all the details.
### 8.4. Language Suggestions
Expand All @@ -238,6 +283,8 @@ Send me a pull request!
## 9. Changes
- 0.8
- Add support for Java which was lost or never there in the first place
- 0.7
- Add support for Ada (thanks [@Knaldgas](https://github.com/Knaldgas)!)
- 0.6
Expand All @@ -264,7 +311,12 @@ For a pretty ui you should check out [duploq](https://github.com/duploq/duploq)
From duploq's Readme file:
> duploq's approach is a pretty straighforward. First, duploq allows you to choose where to look for the duplicates (files or folders). Then it builds list of input files and passes it to the Duplo engine together with necessary parameters. After the files have been processed, duploq parses Duplo's output and visualises the results in easy and intuitive way. Also it provides additional statistics information which is not a part of Duplo output.
> duploq's approach is a pretty straighforward. First, duploq allows you to
> choose where to look for the duplicates (files or folders). Then it builds
> list of input files and passes it to the Duplo engine together with necessary
> parameters. After the files have been processed, duploq parses Duplo's output
> and visualises the results in easy and intuitive way. Also it provides
> additional statistics information which is not a part of Duplo output.
## 11. License
Expand Down
3 changes: 3 additions & 0 deletions src/FileTypeFactory.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
#include "FileType_Unknown.h"
#include "FileType_VB.h"
#include "FileType_Ada.h"
#include "FileType_Java.h"
#include "StringUtil.h"

#include <algorithm>
Expand All @@ -26,6 +27,8 @@ IFileTypePtr FileTypeFactory::CreateFileType(
fileType.reset(new FileType_VB(ignorePrepStuff, minChars));
else if (ext == "ads" || ext == "adb")
fileType.reset(new FileType_Ada(ignorePrepStuff, minChars));
else if (ext == "java")
fileType.reset(new FileType_Java(ignorePrepStuff, minChars));
else
fileType.reset(new FileType_Unknown(minChars));
return fileType;
Expand Down
30 changes: 30 additions & 0 deletions src/FileType_Java.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
#include "FileType_Java.h"
#include "CstyleCommentsFilter.h"
#include "CstyleUtils.h"
#include "SourceLine.h"

#include <cstring>

FileType_Java::FileType_Java(bool ignorePrepStuff, unsigned minChars)
: FileTypeBase(ignorePrepStuff, minChars) {
}

ILineFilterPtr FileType_Java::CreateLineFilter() const {
return std::make_shared<CstyleCommentsLineFilter>();
}

std::string FileType_Java::GetCleanLine(const std::string& line) const {
return CstyleUtils::RemoveSingleLineComments(line);
}

bool FileType_Java::IsPreprocessorDirective(const std::string& line) const {
// look for other markers to avoid
const char* markers[] = { "package", "import", "private", "protected", "public" };

for (auto v : markers) {
if (line.find(v, 0, std::strlen(v)) != std::string::npos)
return true;
}

return false;
}
16 changes: 16 additions & 0 deletions src/FileType_Java.h
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
#ifndef _FILETYPE_JAVA_H_
#define _FILETYPE_JAVA_H_

#include "FileTypeBase.h"

struct FileType_Java : public FileTypeBase {
FileType_Java(bool ignorePrepStuff, unsigned minChars);

ILineFilterPtr CreateLineFilter() const override;

std::string GetCleanLine(const std::string& line) const override;

bool IsPreprocessorDirective(const std::string& line) const override;
};

#endif

0 comments on commit 441f467

Please sign in to comment.