Faster way of performing AUC Evaluations on larger datasets. #126

FaizalJnu · 2024-08-25T18:01:10Z

Description:

While working with Beeline dataset as a part of GSoC. I encountered difficulty running the evaluation pipeline to generate AUC scores. The file in question was computeDGAUC.py in the BLEval folder. Therefore I've implemented an optimized version of the computeScores function that significantly improves performance and efficiency, especially for large genetic networks. Here's a comparison of the old and new implementations:

Previous Implementation:

Used nested loops and DataFrame operations for edge lookups
Initialized dictionaries with all possible edges before filling them
Relied on DataFrame filtering for each edge check
Separate logic for directed and undirected cases

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[(trueEdgesDF['Gene1'] == key.split('|')[0]) &
           (trueEdgesDF['Gene2'] == key.split('|')[1])])>0:
            TrueEdgeDict[key] = 1

for key in TrueEdgeDict.keys():
    if len(trueEdgesDF.loc[((trueEdgesDF['Gene1'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene2'] == key.split('|')[1])) |
                      ((trueEdgesDF['Gene2'] == key.split('|')[0]) &
                   (trueEdgesDF['Gene1'] == key.split('|')[1]))]) > 0:
        TrueEdgeDict[key] = 1

New Implementation:

Converts DataFrames to sets and dictionaries for faster lookups
Creates dictionaries on-the-fly while iterating through possible edges
Uses set membership and dictionary lookups instead of DataFrame filtering
Unifies logic for directed and undirected cases

true_edges = set(map(tuple, trueEdgesDF[['Gene1', 'Gene2']].values))
for edge in edge_generator:
    key = '|'.join(edge)
    TrueEdgeDict[key] = int(edge in true_edges or (not directed and edge[::-1] in true_edges))

Key Improvements:

Performance: The new version is significantly faster, especially for large datasets, due to the use of more efficient data structures and operations.
Scalability: Performance gains become more pronounced as the size of the input data increases, making it better suited for large-scale genetic network analyses.
Code Readability: The new version is more concise with less repeated code, improving maintainability.
Memory Usage: While it might use slightly more memory upfront, this trade-off results in substantial runtime performance benefits.

Why It's Better:

Faster execution times, especially crucial for large genetic networks
More efficient handling of edge lookups and checks
Better scalability for growing datasets
Improved code structure for easier maintenance and future enhancements

These optimizations maintain the same functionality while providing substantial performance enhancements, making our genetic network analysis more efficient and capable of handling larger datasets.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster way of performing AUC Evaluations on larger datasets. #126

Faster way of performing AUC Evaluations on larger datasets. #126

FaizalJnu commented Aug 25, 2024 •

edited

Loading

Faster way of performing AUC Evaluations on larger datasets. #126

Faster way of performing AUC Evaluations on larger datasets. #126

Comments

FaizalJnu commented Aug 25, 2024 • edited Loading

Description:

FaizalJnu commented Aug 25, 2024 •

edited

Loading