New legacy id generator #4875

fmarco76 · 2024-10-10T10:28:37Z

Current implementation of legacy id generator using serial number has the problem to leave gaps when new ranges are generated because some conversion problem from decimal to hexadecimal range value. The gap could create problem to third party tool/service if they expected all no gaps are present in the serials.

A new id generator is introduced to solve the problem. To use the new id generator the following value should be used in CS.cfg:

dbs.cert.id.generator=newLegacy
dbs.request.id.generator=newLegacy

This can be be specified in pkispawn with the parameters:

pki_request_id_generator=newLegacy
pki_cert_id_generator=newLegacy

The remaining configuration are similar to the legacy generator. The main difference is that hexadecimal value must have the prefix 0x, if it is not present then the value is read as decimal.

The number serial written in CS.cfg during range update can be forced to be only decimal or hexadecimal setting the value dbs.numberRangeRadix to 10 or 16 respectively. If not set or negative then the default for the specific generator will be used.

fmarco76 · 2024-10-11T09:56:15Z

To upgrade the id generator from legacy to newLegacy, in case there are no clones, the CS.cfg values for certificate serial numbers should be updated to hex. In partitular:

the beginSerialNumber can be directly converted adding the 0x prefix while;
the endSerialNumber is calcalted adding the increment read as hex.

Finally, the value of nextRange in DS node ou=certificateRepository,ou=ca,<DN suffix> has to be converted from hex to decimal (legacy use hes without prefix). If the previous values have been updated and the id modified then it could be enough the command: pki-server ca-range-update

fmarco76 · 2024-10-11T13:46:08Z

@parrjd testing the fix for next range gaps in PR #4840 create problem with update. If fixed all the radix issues (there are few other operation errors) then the ranges do not match properly generating errors in deployed instances (it is OK for new CA). To avoid problems we are implementing this fix which provides a new generator so in cases instances have no problem with gaps there is no need to modify it.
Is this approach working in your case?

edewata

@fmarco76 I'm still reviewing this, but could you add a test here (before switching to RSNv3) to show how to fix an existing CA that has a sequential serial number gap?

So basically the test might look like this:

make the necessary changes to the CA
allocate new ranges
enroll certs to exhaust the current range
repeat step 2 & 3 if necessary
verify that no new gap is created

For simplicity it's not necessary to verify the config params or LDAP entries. The verification in step 5 should be sufficient. Thanks!

.github/workflows/ca-sequential-test.yml

parrjd · 2024-10-12T23:35:35Z

@fmarco76 sorry I did not see this earlier

I think forcing the values in the CS.cfg to be read as base 10 is a bad idea and will only cause more problems for existing installs. What we have currently in our prod environment is a hex number including characters a-f that are allowed in a hex string, and if it is forced to be read as a base 10 that will likely break on startup, or when ever that vaule is required to be read.

Using 100 as an example for both end and increment
100 = 0x64
256 = 0x100

if you force it to be read and used as a decimal and the CA has already issued 0x80 then it will jump to the next range starting at 200 which will overlap the previous range. The only safe way to switch to Dec would be to convert the Incremement and the end values to Decimal so increment and end would need to be changed in the CS.cfg

dbs.beginSerialNumber=1
dbs.endSerialNumber=10000000
dbs.serialCloneTransferNumber=10000

dbs.serialIncrement=10000000
dbs.serialLowWaterMark=2000000
dbs.serialCloneTransferNumber=10000
dbs.serialDN=ou=certificateRepository, ou=ca
dbs.serialRangeDN=ou=certificateRepository, ou=ranges
dbs.beginReplicaNumber=[pki_replica_number_range_start]
dbs.endReplicaNumber=[pki_replica_number_range_end]

As a side note:
dbs.beginRequestNumber=[pki_request_number_range_start]
dbs.endRequestNumber=[pki_request_number_range_end]
dbs.requestIncrement=10000000
dbs.requestLowWaterMark=2000000
dbs.requestCloneTransferNumber=10000
dbs.requestDN=ou=ca, ou=requests
dbs.requestRangeDN=ou=requests, ou=ranges

pkispawn values
pki_serial_number_range_start=
pki_serial_number_range_end=
pki_request_number_range_start=
pki_request_number_range_end=
pki_replica_number_range_start=
pki_replica_number_range_end=

For an install that did not override the begin and end for serial numbers CS.cfg with the pkispawn vars the first range has 268,435,456 possible serial numbers, and the increment has the same amount while the request only has 10,000,000 since the request is done in base 10. as long as range management is enabled it should not be an issue since the requests will role to the next range separate of the serials.

The nextrange DS Values do not get populated for some time. We do not have them on most of our prod systems as they get populated at a point well after the CA is stood up when serial issuances reaches a point and populates the values to the DS instance.

For a stand alone CA the default end serialNumer Value will result in most installs of a dogtag CA never rolling over to a new range. Active clones are more likely to role faster since they only get a very small fraction of the top end of the existing CA's serial range for a sequential CA. For a Random serial CA I believe it pulls the new nextRange when the CA is built, but may be wrong since I have not deployed a random serial CA yet.

Ldap output from a serial CA that has a clone that has not reached the point where a range has been added to the DS instance.

dn: ou=certificateRepository,ou=ranges,dc=*****
objectClass: top
objectClass: organizationalUnit
ou: certificateRepository

I think the best solution would be to set a couple of extra values in the CS.cfg so that different subsystems can be contolled differently for like the TPS, TKS, or OSCP where everything I believe is in base 10, and treat all of the values for a type the same, so start, end, and increment for Serial would be base 16, and for request and replica they would be base 10.

dbs.serialRadix=16
dbs.requestRadix=10
dbs.replicaRadix=10

From DOGTAG_10_5_BRANCH
base/server/cms/src/com/netscape/cms/servlet/csadmin/UpdateNumberRange.java


            if (type.equals("request")) {
                radix = 10;
                endNumConfig = "dbs.endRequestNumber";
                cloneNumConfig = "dbs.requestCloneTransferNumber";
                nextEndConfig = "dbs.nextEndRequestNumber";
            } else if (type.equals("serialNo")) {
                radix = 16;
                endNumConfig = "dbs.endSerialNumber";
                cloneNumConfig = "dbs.serialCloneTransferNumber";
                nextEndConfig = "dbs.nextEndSerialNumber";
            } else if (type.equals("replicaId")) {
                radix = 10;
                endNumConfig = "dbs.endReplicaNumber";
                cloneNumConfig = "dbs.replicaCloneTransferNumber";
                nextEndConfig = "dbs.nextEndReplicaNumber";
            }

For a RHEL 7 amd RHEL 8 instance it can be assumed that for Serial should be interprtied as base 16, and for Requests and replica should be done as base 10.

Since the math error in the code results in the value being recroded in the ds being higher than the value stored in the CS.cfg when taking base into account I am not seeing the risk to fixing it by fixing how the math is done.

For dealing with existing systems I see a couple of paths.

Handle at subsystem startup

If nextRange values are not in DS, no issue since the problem code has not been executed.

If nextRange exists in the DS, read the existing range from the cn=<start_of_current_range> from the DS and compare to what is in CS.cfg, if they match nothing to do, if they do not match the write back the End of the Range to CS.cfg so that they do match. This will prevent a gap from being created. Fix the code so that Reads from DS Range Values and the CS.cfg always bigInt with the Radix and toString(radix) so that the math is all done using the proper base for the values, and also stored in the same manner in both the DS and the CS.cfg so they can easily be compared.

Switch to all Dec, but manually convert/ update the increment and end to be hex to dec values so that 100 would be 256 written back to the CS.cfg. The challenge here is I think this is a bigger lift, and anywhere in the code that currently uses the value as hex would have to be found, and anywhere it needs to be hex would need to do an integer.toHexString() to convert it.

Basicly I am looking at the best way to prevent this from happening when a CA Rolls to the next range. Not looking to go back and try to fill in where we have ended up with gaps. We are already having to update Other systems that pull stuff from the CAs to handle, the gap, and we have had to do that before when we did not use the serial management provides and we did manual clones with completely different ranges. and had huge gaps so most of the other systems have a method to deal with gaps.

If we read from DS and CS.cfg with a Radix then it is coming in to Bigint as a bigint from hex or Dec, and when writing back to either toString(radix), then it is put back in the expected value.

I think adding 0x to designate it as hex is also likely a bad Idea as most of the code in the repository class is used by different subsystems that have different bases and then every value would need to be tested to see if the 0x needs to be stripped, off then bigint with a 16 radix if it had a 0x. where as if every subsystem provides the radix that it uses, it is just used vs having to add logic to determine the value type of dec/hex.

I dont think a new id generator is really required since currently the existing has no potential of Serial overlap which would be really bad, it just can result in gaps since the math is not done the same way for the value written to the DS, and the value written to the CS.cfg

.github/workflows/ca-sequential-test.yml

.github/workflows/ca-clone-sequential-test.yml

base/ca/shared/conf/CS.cfg

edewata · 2024-10-18T17:56:16Z

.github/workflows/ca-clone-sequential-test.yml

+          endRange: 48
+          host: primary.example.com
+
+          EOF


Could you update the comment to indicate that the allocated cert range for the primary CA changed from 13-30 to 19-48.

Could you add a similar check for the request range too since we're migrating it to SSNv2 too?

To my understanding the beginRange and endRange attributes are always in decimal and we're not changing it in this PR:
https://github.com/dogtagpki/pki/blob/master/base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java#L574-L578

The values are derived from nextRange which was parsed as decimal:
https://github.com/dogtagpki/pki/blob/master/base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java#L543-L544

Yes, they are written in decimal and derived from nextRange. However, nextRange is read as decimal when assigned to the range but it is read as hex when assigned in nextMinSerialNumber:

pki/base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java

Line 701 in a76eb4b

mNextMinSerialNo = new BigInteger(nextRange, mRadix);

Therefore the value written in range match the value written in CS.cfg but the real value is different. With this patch the value will really match after the conversion.

Yeah, that makes it really complicated since we cannot really rely on the values written in the CS.cfg or DS so there are more possibilities of problems. I need to look at the code to see how mNextMinSerialNo is used.

So in this step the range size grows from 18 (which is the size specified during installation) to 30, and it temporarily overlaps the other server's range. Where does the 30 come from?

There is no real overlap because the one updated is in decimal while the other is in hex. If we took the range of the primary as example here. Initially the range was from 13 to 30 (30 = 13 + increment in dec - 1). The following range start from 31 and the begin is the real one. The gap is generated because in the CS.cfg the end is not 30 but 13 + incerement in hex - 1. = 24 and the next start from 31.

BTW, I think I found a possible conflict and I am updating the code.

.github/workflows/ca-clone-sequential-test.yml

edewata · 2024-10-18T19:13:38Z

.github/workflows/ca-clone-sequential-test.yml

+      - name: Check cert range config in primary CA
+        run: |
+          tests/ca/bin/ca-cert-range-config.sh primary | tee output
+
+          cat > expected << EOF
+          dbs.beginSerialNumber=0x13
+          dbs.endSerialNumber=0x30
+          dbs.serialCloneTransferNumber=0x9
+          dbs.serialIncrement=0x12
+          dbs.serialLowWaterMark=0x9
+          EOF
+
+          diff expected output
+
+      - name: Check cert range config in secondary CA
+        run: |
+          tests/ca/bin/ca-cert-range-config.sh secondary | tee output
+
+          cat > expected << EOF
+          dbs.beginSerialNumber=0x31
+          dbs.endSerialNumber=0x48
+          dbs.serialCloneTransferNumber=0x9
+          dbs.serialIncrement=0x12
+          dbs.serialLowWaterMark=0x9
+          EOF
+
+          diff expected output


So the ranges changed from 0x13-0x24 to 0x13-0x30 in the primary CA and from 0x31-0x42 to 0x31 to 0x48 in the secondary CA. Is there any possibility that the new ranges would conflict with other existing ranges?

If I get the math correct the range in CS.cfg are smaller then in DS and not full used but the start range are always correct in case of gap. The ID between 0x42 and 0x48 are lost because when a new range is computed it start from the value recorded in nextRange and do not take in account the range objects and/or configuration range so overlap should not be possible. Using the full recorded range we are requesting to utilise the full range.

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemRangeGeneratorUpdateCLI.java

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemRangeUpdateCLI.java

base/server/python/pki/server/cli/range.py

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemRangeGeneratorUpdateCLI.java

base/server/src/main/java/com/netscape/cmscore/dbs/Repository.java

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemIdGeneratorUpdateCLI.java

edewata · 2024-10-22T20:42:48Z

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemIdGeneratorUpdateCLI.java

+        LDAPSearchResults ranges = conn.search(rangeDN, LDAPv3.SCOPE_SUB, "(objectClass=pkiRange)", null, false);
+
+        BigInteger lastUsedSerial = BigInteger.ZERO;
+        boolean nextRangeToUpdate = true;
+        // Search for the last range entry. If it is associated to the CA to update or ranges are not defined
+        // then the nextRange is         
+        while (ranges.hasMoreElements()) {
+            LDAPEntry entry = ranges.next();
+            String endRange = entry.getAttribute("endRange").getStringValues().nextElement();
+            String host = entry.getAttribute("host").getStringValues().nextElement();
+            String port = entry.getAttribute("securePort").getStringValues().nextElement();
+            BigInteger next = new BigInteger(endRange, 16);
+            if (lastUsedSerial.compareTo(next) < 0) {
+                lastUsedSerial = next;
+                nextRangeToUpdate = host.equals(hostName) && port.equals(securePort);
+
+            }
+        }


IIUC in this loop we go through all ranges and parse the endRange as hexadecimal and use that to find the last range entry. However, in the following loop we're replacing the hexadecimal endRange with its decimal value. So if we run this command again for another replica, wouldn't the now decimal endRange be incorrectly parsed as hexadecimal?

The nextRange is updated only if last range is associated with the CA doing the generator update so only one CA will do the update and since the other are not creating new range there should be no conflict. In the following loop only the ranges associated with this CA are updated so if last range is not owned by this CA there is no update therefore there should not be any conflict.

I was actually concerned about the endRange, not nextRange. In the first loop above it's parsed as hexadecimal and used for comparison:

String endRange = entry.getAttribute("endRange").getStringValues().nextElement(); BigInteger next = new BigInteger(endRange, 16); if (lastUsedSerial.compareTo(next) < 0) {

In the second loop below, it's parsed again as hexadecimal, but written as decimal, basically replacing the old value:

String endRange = entry.getAttribute("endRange").getStringValues().nextElement(); BigInteger endRangeNo = new BigInteger(endRange, 16); attrs.add(new LDAPAttribute("endRange", endRangeNo.toString()));

So suppose we run the tool again for another clone, the first loop will run again, but this time it will be reading the endRange that has already been converted to decimal by the second loop, and it will parse the value as hexadecimal again. Is this correct?

Here are the ranges in the DS just before the migration:

beginRange: 13 endRange: 30 host: primary.example.com beginRange: 31 endRange: 48 host: secondary.example.com nextRange: 49

So in the DS at least there seems to be no gap. Is it possible to take these values as decimals, then update CS.cfg based on these values, but avoid updating the DS completely? Some serial numbers that used to belong to one server might "move" to another server, but since we're stopping all servers, fixing all ranges at once, and restarting the servers again, that might be OK. What do you think?

I have inverted the order of the operation. Update the range to decimal of the updated CA and then look for the next range. In case of not selecting the real last range is not a problem. Considering the case in CI but we suppose the primary range is from 13 to 31 and the secondary range is from 31 to 48. The first updated is the primary so it become 19 49 and it is considered as last. The nextRange will be 50. Then the other CA is update and the range is 50 -72 and the next range is updated again to 73. There are no other operation involved so it is not a problem.

The thing is we're still mix up radixes in the second loop:

LDAPSearchResults ranges = conn.search(rangeDN, LDAPv3.SCOPE_SUB, "(objectClass=pkiRange)", null, false); String endRange = entry.getAttribute("endRange").getStringValues().nextElement(); BigInteger next = new BigInteger(endRange); // could be decimal or hexadecimal that has not been converted to decimal

While the code might work with the above example, there isn't really a guarantee that it will work fine in all cases.

As discussed in Slack, we should explore some alternatives:

avoid modifying DS at all

fix the radix for all servers in one step, then fix the nextRange in a separate step

store the new SSNv2 ranges in a separate subtree (Relocate SSNv2 range objects for new CA instances #4885)

edewata · 2024-10-22T20:47:13Z

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemIdGeneratorUpdateCLI.java

+            String dn = "cn=" + beginRangeNo.toString() + "," + rangeDN;
+            LDAPEntry rangeEntry = new LDAPEntry(dn, attrs);
+            logger.info("SubsystemRangeGeneratorUpdateCLI.updateRanges: Remove entry " + entry.getDN());
+            conn.delete(entry.getDN());
+            logger.info("SubsystemRangeGeneratorUpdateCLI.updateRanges: Adding entry " + dn);
+            conn.add(rangeEntry);


Here we're deleting the old cn=<hex> and adding a new cn=<decimal>. Is there a possibility that the cn=<decimal> will conflict with an existing range?

I was trying to compute this to exclude this possibility but I cannot. However, since the conversion between hex and dec is not linear, the results are not multiple and the value increase very quickly I cannot find combination of value where overlap could happen so I consider very unlikely. However, we could try to find if a combination is possible orwe could add a check and eventually do a correction (e.g. adding or reducing the range by one entry)

base/server/src/main/java/org/dogtagpki/server/cli/SubsystemIdGeneratorUpdateCLI.java

The current generator has a problem with converting from hex to decimal the range boundaries creating gaps between ranges. This a problem when third parties tools are used to with certificates because contiguous range are expected. This commit introduce the generator legacy2. This uses same configuration parameter but hex value are specified by the prefix '0x'. When value are written to the configuration value it is possible to set the radix with the options: - dbs.cert.id.radix (default to 16) - dbs.key.id.radix (default to 16) - dbs.request.id.radix (default to 10) Additionally, the new command `pki-server <subsystem>-id-generator-*` has been added to migrate from the legacy generator to the legacy2 or to random.

sonarcloud · 2024-10-23T18:07:06Z

Quality Gate passed

Issues
9 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
8.8% Duplication on New Code

See analysis details on SonarCloud

edewata

There are still some concerns about the migration process, but the SSNv2 itself looks fine. I think lets merge this PR so we can add tests for SSNv2 (without the migration) while we keep improving the migration process. Or if you prefer, feel free to split the PR and move the migration tools & tests into a separate PR. Thanks!

fmarco76 · 2024-10-24T17:23:15Z

@edewata Thanks! I am merging this with the current migration implementation. It will be easier to test and improve to support the corner cases.

fmarco76 requested a review from edewata October 10, 2024 10:28

fmarco76 force-pushed the nextRange branch from 72b8aeb to 67fdebd Compare October 10, 2024 11:16

edewata reviewed Oct 11, 2024

View reviewed changes

fmarco76 force-pushed the nextRange branch from 67fdebd to 19ca750 Compare October 11, 2024 16:05

edewata reviewed Oct 11, 2024

View reviewed changes