Skip to content

Commit

Permalink
post: week8 blog
Browse files Browse the repository at this point in the history
  • Loading branch information
afrid18 committed Jul 30, 2023
1 parent 57f722e commit 404261c
Show file tree
Hide file tree
Showing 2 changed files with 134 additions and 0 deletions.
134 changes: 134 additions & 0 deletions content/post/week8-at-GSoC/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,134 @@
---
title: "GSoC: Week8"
date: 2023-07-30T12:54:09+05:30
Description: "In this blog, I would like to share my progress of Google Summer of Code 2023, for week 8"
thumbnail: "images/post11/week8.png"
Tags: ["OpenSUSE", "Open Source", "Google", "Summer of Code"]
Categories: ["Open source", "Programming"]
Series:
- "gsoc-weekly-report"
---

Hello, welcome back again. In this blog post, I would like to share my progress on the Google Summer of Code with OpenSUSE for the RPMLint project for week 8.

As mentioned in the previous blog post, I started to work on `DuplicatesCheck.py`. Here is a detailed overview of my progress.

## Progress [########........]

For this week, I decided to focus on `DuplicatesCheck.py`. However, I soon realized that the tests required some additional capabilities of the `FakePkg` class.

Here is the current interface of `FakePkg` we use to create a `mockPkg`:

```python3
def get_tested_mock_package(files=None, real_files=None, header=None):
mockPkg = FakePkg('mockPkg')
if files is not None:
mockPkg.create_files(files, real_files)
if header is not None:
mockPkg.add_header(header)
return mockPkg
```

Using the `header` argument alone doesn't provide all the required information for some tests. Certain tests require much more information, which cannot be directly passed through the header parameter.

For the current discussion with `DuplicatesCheck`, the test function I am considering is `test_unexpanded_macros` in the file `test_files.py`. For this test, we need what's called `md5` hash values. These are the hashes of files that are in a binary RPM file.

To learn more about **MD5** follow this wiki [link](https://en.wikipedia.org/wiki/MD5)

For example, consider the test function:

```python3
@pytest.mark.parametrize('package', ['binary/duplicates'])
def test_duplicates(tmp_path, package, duplicatescheck):
output, test = duplicatescheck
test.check(get_tested_package(package, tmp_path))
out = output.print_results(output.results)

assert 'E: hardlink-across-partition /var/foo /etc/foo' in out
assert 'E: hardlink-across-config-files /var/foo2 /etc/foo2' in out
assert 'W: files-duplicate /etc/bar3 /etc/bar:/etc/bar2' in out
assert 'W: files-duplicate /etc/strace2.txt /etc/strace1.txt' in out
assert 'W: files-duplicate /etc/small2 /etc/small' not in out
assert 'E: files-duplicated-waste 270544' in out
```

Here, the binary file we are checking is `test/binary/duplicates-0-0.x86_64.rpm`, and the hashes of all the files are generated by md5 to find duplicate files with the same hashes. See the list of files in the below output sample of an RPM command to list all the files in a binary RPM file:

```bash
$ rpm -qlp test/binary/duplicates-0-0.x86_64.rpm

/etc/bar
/etc/bar2
/etc/bar3
/etc/foo
/etc/foo2
/etc/small
/etc/small2
/etc/strace1.txt
/etc/strace2.txt
/var/foo
/var/foo2
```

After hashing these files using md5 within pytest runtime, I obtained the hash values of these files. Because there are duplicate files, the same hash values are generated and stored with a key-value data structure (Dictionary). See the output sample below:

```python3
==> (Pdb) p md5s

{
'b3ab937fbdc55ae7bf96749074e816056f0605491d419f9f5b97dc00c8c04aae':
{
<rpmlint.pkgfile.PkgFile object at 0x7f69c1189430>,
<rpmlint.pkgfile.PkgFile object at 0x7f69c1189640>,
<rpmlint.pkgfile.PkgFile object at 0x7f69c11894e0>
},
'bc1a4e47244cdf6b4c4735453cf55503a995334f1735458ab2e3c01455e159e3':
{
<rpmlint.pkgfile.PkgFile object at 0x7f69c118a090>,
<rpmlint.pkgfile.PkgFile object at 0x7f69c11897a0>
},
'e18b816e748b2366af6cb7281bcf8fca7f65be8a41e456e562f6acd8b267fc32':
{
<rpmlint.pkgfile.PkgFile object at 0x7f69c1189900>,
<rpmlint.pkgfile.PkgFile object at 0x7f69c118a1f0>
},
'1618780f802ed0571225ec155527f82a0eaa540d16983c387baede6208ced745':
{
<rpmlint.pkgfile.PkgFile object at 0x7f69c1189f30>,
<rpmlint.pkgfile.PkgFile object at 0x7f69c1189dd0>
}
}
```

As shown, there are 9 files that are hashed, and some files have the same hash values. Additionally, in the `rpm` query, there are a total of 11 files. This differnece is because there are very small files, rpmlint ignores them. The file size limit id defined in configuration file, which are less than the minimum file size limit, all files will be ignored.

And yes, I found out these hash values using the Python Debugger (Pdb). These values are stored in a variable `md5s` during runtime in the `DuplicatesCheck.py` file. [Here]

I believe this would work, provided that we implement the `md5` variable within the `Pkgfile` class and pass header information while creating a mock package using `FakePkg`. I am not sure whether I should hard-code these hash values into the header object for passing to the test function. I even tried creating real files using the `real_files=True` argument, but it didn't work.

## Misc.

In addition to working on `DuplicatesCheck.py`, I have also made some progress with `FilesCheck.py`. This also requires some more capabilities of `FakePkg`. However, I haven't explored all the possible ways to create files and test them yet. I plan to do that in the coming week.

As mentioned in my last [post], I will be visiting the SUSE office in Bangalore. I will share the visit date. I am planning to visit around the 3rd week of August. I will also share the details on my LinkedIn page. Do follow me on <i class="fa-brands fa-linkedin"></i> [LinkedIn].

---

Links:
- [post]
- [md5 ref]
- [LinkedIn]
- [MD5](https://en.wikipedia.org/wiki/MD5)


[post]: /post/week7-at-gsoc/
[Here]: https://github.com/afrid18/rpmlint/blob/2494367319ad2603023aaa4ffd6a6c6330dca28d/rpmlint/checks/DuplicatesCheck.py#L31
[md5 ref]: https://github.com/afrid18/rpmlint/blob/2494367319ad2603023aaa4ffd6a6c6330dca28d/rpmlint/checks/DuplicatesCheck.py#L31
[LinkedIn]: https://www.linkedin.com/in/afridhussain/


<h1 style="text-align: center"> Thank You </h1>


___

Binary file added static/images/post11/week8.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 404261c

Please sign in to comment.