rte_hash lookup issues in multi-process environment #185

dennisafa · 2020-02-13T02:26:03Z

Bug Report

An rte_hash structure created in a primary process using rte_hash_create, with , then looked up using rte_hash_find_existing in a secondary process and subsequently used for lookups/additions may cause seg faults.

Current Behavior
I initialize a hash structure from the primary process, the manager. Then, I create a separate NF that looks up the created structure by name. The pointer to the hash table is correct, and simple functions like rte_hash_count work, but rte_hash_lookup will fail with a rte_pipeline segfault.

Expected behavior/code
rte_hash structures mapped to hugepages should be accessible from any process that attaches to the EAL environment

Steps to reproduce
Initiate hash structure in primary process, then look it up using rte_hash_find_existing from secondary process. Then, perform a lookup call.

Environment

OS: Ubuntu 18.04
onvm version: latest
dpdk version: latest
other info:

Possible Solution

For the onvm_flow table api, we map the structure that contains the hash structure via rte_memzone (found here) This could work if we do it for hash structs as well.

Additional context/Screenshots

The text was updated successfully, but these errors were encountered:

@nks5295

This code introduces **EXPERIMENTAL** support to allow NFs to efficiently run on **shared** CPU cores. NFs wait on semaphores when idle and are signaled by the manager when new packets arrive. Once the NF is in wake state, no additional notifications will be sent until it goes back to sleep. Shared cpu variables for mgr are in the `nf_wakeup_info` structs, the NF shared cpu vars were moved to the `onvm_nf` struct. - The scaling/pthread model required modifications to the signal handling process. To accommodate proper signal handling and fix other issues, children will block all signals and only the parent will handle signals. When terminating the parent NF will tell all children to stop running, which in turn will turn their children to stop running, which in turn will tell... this ensures that the parent thread blocks until absolutely all descendants have exited (this is required because even though a pthread child is detached if the main process exits it will be abruptly killed) - Creating an `onvm_nf_context` to be malloced before nflib starts and has info about that specific NF (shutdown information/nf and nf_info structs). This struct replaces `onvm_nf_info` in most API calls. _In the next pr `onvm_nf_info` will be reworked into a nf initialization structure, more on that later_ - Running the `onvm_nflib_start_signal_handler(nf_context, NULL);` call before `onvm_nflib_init` to ensure proper signal handling even on NF startup (pretty sure that resolves sdnfv#93) - `onvm_nflib_stop` is now executed from the main(), instead of inside `onvm_nflib_run`. Also, the `onvm_nflib_stop` will be automatically called for the children spawned by the scaling API. Other related changes: - Stats `-v` mode gets an update with the shared cpu information if such is available - onvm gets a config structure which contains onvm configuration flags for now but might be expanded later. This will be stored in a memzone and shared with the NFs. **API additions:** - `onvm_nflib_start_signal_handler(struct onvm_nf_context *nf_context, handle_signal_func signal_hanlder);` **API changes:** - `onvm_nflib_init(int argc, char *argv[], const char *nf_tag, struct onvm_nf_info **nf_info_p);` -> `onvm_nflib_init(int argc, char *argv[], const char *nf_tag, struct onvm_nf_context *nf_context);` - `onvm_nflib_run_callback(struct onvm_nf_info *info, pkt_handler_func pkt_handler, callback_handler_func callback_handler);` -> `onvm_nflib_run_callback(struct onvm_nf_context *nf_context, pkt_handler_func pkt_handler, callback_handler_func callback_handler);` - `onvm_nflib_run(struct onvm_nf_info *info, pkt_handler_func pkt_handler);` -> `onvm_nflib_run(struct onvm_nf_context *nf_context, pkt_handler_func pkt_handler);` - `onvm_nflib_stop(struct onvm_nf_info *nf_info);` -> `onvm_nflib_stop(struct onvm_nf_context *nf_context);` All code changes are featurized using INTERRUPT_SEM macro. See the ONVM README.md file for more information and warnings. Commit log: * Shared CPU NFs (sdnfv#150) * merges rebased to master with sdnfv#147 * reverting num_clients update. * replace tabs with spaces * Readme describing experimental shared CPU support * Add usage and known limitations for shared CPU code * Bug Fix: Fixes instance ID assignment race condition (sdnfv#183) (sdnfv#185) Fixes a race condition in NF instance ID assignment. Before, if two NFs were started at the same time, they would be assigned the same instance ID. This change modifies logic in `onvm_nf_next_instance_id` to avoid assigning the same ID to two starting NFs. * Bug Fix: Fixes instance ID assignment race condition - Fixes sdnfv#178 - Changed incrementation of next_instance_id in onvm_mgr/onvm_nf.c such that the ID can't be reused before an NF becomes active. * partial refactoring of api w accordance of nf_info * onvm mgr compiles * flow table compiles -- all examples compile * changed onvm_get_rx and onvm_get_tx to get rx_q and tx_q associated with onvm_nf struct * Did requested changes. Also updated newer NFs (such as arp_response and flow_tracker). Also did some small style changes in some other documentation in onvm_nflib.h * Corrected some incorrect references to nflib functions in NF_Dev.md * removed typing error * Minor cleanup * Stats cleanup - Some stats were unused and broken(?) nuked them - added stats that are possible usefull to stdout (slightly ugly) * Add functionality to scale same NF instances * Fixed a few global vars for multithreading to work * Code clean up, comments and style fixes * Added scale return value * Fixed nf_mode for running advanced rings * Refactored the onvm_nf/onvm_nf_info passing - Keeping onvm_nf_info small, moved vars to onvm_nf - Added a callback function for the scaling child thread - Cleaned up code * Adv rings NF scaling, setup func for spawned NFS - Adds ability to scale NFs with advanced rings - Adds a setup function for NFs - More docs and cleanup * Fixed build errors, changed NF typedef func names * Launching NFs with different state/pkt processing - Moving nf_info as passed arg to packet_handler -> allows for spawned NF to access their own nf_info - Launching new NFs now accepts arguments for packet_handler function, setup function, callback function, advanced rings function - Add a data pointer to onvm_nflib_info to maintain state - Working on a scaling example (currently testing code, not in usable state) will showcase different features of spawning NFs * Major changes, init breakdown, argv for spawned nf - This is a major change, a lot of nflib initlization is refactored to allow spawned NFs to launch with their own service id - There is a bug when the spawn a lot of new NFs we can no longer retrieve the nf_info from nf_info_mp, this needs to be fixed as it breaks stuff * Fixing bugs, cleaning up code, updating example NF - Fixed the nf_info bug where the mempool would get messed up as children weren't cleared properly. - Updated the example scaling NF - Removed most of the debugging code * Clean up of shared cpu branch Cleaned up messy bits, replacing macro branches with if statements, encapsulating code in functions, and getting rid of unnecessary code. * The code cleaning commit has arrived - Style fixes - Removing debug code - Making code prettier - Adding better docs - Implementing requested changes * Minor cleanup/whitespace fixes * Fix Large ServieNum port crash bug * Implementing requested changes * Update scaling docs * Doc typos fixed * Doc fixes * Performace fixes - Changed MAX_NFS back to 128, improves performance - Changed the order inside of the onvm_nf struct, due to struct packing this reduced the overall size of onvm_nf down to 168 from 176 * Fixed memzone init size issue * Update Releases.md * Fix space * initial pthread testing * Small fixes * small fixes * More minor fixes * More tiny fixes * Scaling nf fixes * Small fixes * Logging, README fixes * Added Manual/MGR core assignment mode, cleanup * Added dedicated core option, error codes * Combining shared cpu & pthreads - Dedicated pthread to watch for signals, this ensures that when user sends a SIGINT the threads are properly woken up and cleaned up. Not possible with the previous signal handler as it didn't have NF infromation * Implemented shared cpu mutex logic for adv rings * Add hacky shared cpu fix * Fix some unused/old code * Minor fixes, cleanup * Remove testing print statement * Remove unused func * Fix shutdown thread join and cleanup * Shared CPU mode as an argument - Add shared cpu '-c' flag passed to onvm_mgr - Add custom flags memzone - Surround all shared cpu code with if(shared_cpu) - Overall cleanup * Update README.md * Update NF_Dev.md * Fix onvm_manager README * Scaling nf share cpu flag * Style fixes * Cleanup * Code cleanup * Cleanup * Onvm config struct for flags * Refactoring and things * Remove unused vars, rename var * Fix stats & remove unrelated code * Style fixes * Fix imports in speed tester * Fix race by properly killing grandchildren * Rename vars, cleanup, comments * Final cleanup * Minore style fixes * Minore style fixes * Only cleanup shm when in shared cpu mode * Fix unused macro * Fix flag_p name to sleep_state * Fix comment about signal thread * Fix some style issues * Fix scaling adv ring children termination * Fix sleep issue with shared cpu scaling * Fix cleanup, logical redesign * The bug is strong with this one * Fix type * Pre-init signaling thread * Better pre-init termination * Working on combining signal handlers * Remove unused dpdk_init_finished var * Rewrote some termination logic * Add defines in header file * Fix signal handling, add nf context (api changes) - For proper termination of both advanced rings and normal NFs create nf context which is malloced as opposed to rte_malloc and is created prior to nflib init. - To properly handle signals and simplify the process instead of using a dedicated pthread, use signal() + global termination context - Involves a bunch of API changes to handle the new context variable, some API changes are still missing * Cleanup comments, remove unused vars * Further cleanup * Cleanup * More cleanup/updates * Resolve needed signaling issues * Update NFs to new API * Fix fake getopts hack * Add custom nf specific sig handling * Update docs with paper description * Update NF_Dev.md * Update NF_Dev.md * update references to research papers * Remove debug pring * Make children_cnt atomic * Fix docs and comments * Fix style issues * Redo termination cleanup * Removed ugly flag, proposed by the one and only @nks5295 * Fixed API calls for all example NFs * Remove extra onvm_nflib_stop call in SpeedTester * Add time based stop * Fix Basic Monitor NF * Fixed iter cnt for shutdown * Fix comment * Make keep_running atomic

dennisafa added the bug 🐛 label Feb 13, 2020

dennisafa self-assigned this Feb 28, 2020

dennisafa closed this as completed Apr 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rte_hash lookup issues in multi-process environment #185

rte_hash lookup issues in multi-process environment #185

dennisafa commented Feb 13, 2020

rte_hash lookup issues in multi-process environment #185

rte_hash lookup issues in multi-process environment #185

Comments

dennisafa commented Feb 13, 2020

Bug Report