WIP: Routine Unresponsive due to Deadlock #51
No reviewers
Labels
No labels
Automated
Backlog
Post_Prototype_1.0
Bot_Code
Core
Bot_Code
Custom
CI/CD
Complexity
Advanced
Complexity
Basic
Complexity
Expert
Complexity
Intermediate
Kind/Breaking
Kind/Bug
Kind/Bug Fix
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Nix
Ownership
Collab
Ownership
Collab with Leads
Ownership
Individual Lead
Ownership
In-Review
Ownership
Needs Owner > May Delegate
Ownership
Workshop with Leads
Phase 1.0
Requirements > Drafting
Phase 1.0
Requirements > Researching
Phase 1.0
Requirements > Review & Planning
Phase 2.0
Design > Research & Analysis
Phase 3.0
Coding > Implementation
Phase 4.0
QA > Unit Testing & Design
Phase 5.0
Resolution > Completed
Phase 5.0
Resolution > Review for Completion
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Blocks
#40 WIP: Basic Routine Functionality
modulatingforce/forcebot_rs
Reference
modulatingforce/forcebot_rs!51
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "issue-routine-lock"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Initial Description
Stemming from feature set from Routines Functionality - #40 (comment)
ISSUE . The defined Routine's Custom Execution Body is being reached (so the custom
fnis being triggered as expected); However, I believe at L413 wherec.read().await.channel, it never goes beyond that point within the same CustomfnbodyEven if the bot continues to be responsive (i.e., the
mainbot thread continues to run), the rest of the routine is not. I believe this is suggesting there's a lock with one of the objects we're introducing with this feature . In the above line, I believe the lock involvesExecBodyParam'scurr_actvalueI believe the issue is a deadlock between
customandcorelocks : However, ideally this is something that can be identified or some best practices recommended, even if we do identify that something in the custom module is the issueHigh Priority . This is a blocker of Routines functionality
Additional Resources
tokio::RwLock- https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.htmlPlan of Action
readandwritelocks are located, and ensure they're dropped (e.g., within a block or through explicitdrop)'custommodule, consider creating SOP recommendations for Custom developersHmm. I'm wondering if I can delegate this to someone like tori or if this is a lot of manual work / investigation
If tori is willing to pick it up after reading the initial description, I may need to adjust the description with some additional details like Functional Use Case (that might not be as clear, since we're just introducing routines on parent PR)
The Routine that you want to read lock in the execution body is the routine that is used to call that execution body, making it so the execution body waits for itself to end. No idea how you would do this, but I suspect that you cant do that. If you want to access the channel you probably have to pass it in a different way instead of just passing the routine itself.
Second commit isn't necessary btw you can roll that back if you want, but it just has a bunch of debug implementations that might help debugging stuff like this in the future.
Wow @mzntori you are beautiful peepoShy
std::fmt::{Debug, Formatter};anddbg!are a learning experience for me! Good share!uuh feel free to resolve this comment afterwards? lol testing this review process as well
@ -535,0 +538,4 @@fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {write!(f, "{:?} {:?} {:?} {:?} {:?}", self.module, self.command, self.alias, self.help, self.required_roles)}}Holy I love this idea! 🤤 with the others too!
@ -1228,2 +1263,3 @@).await;dbg!("after");wow!
dbg!is so useful! Good share!@mzntori Definitely agree that you're in the right area , and I think it's related . However, what's strange is that this
exec_bodyis called and executed - if the lock occurs at this spot, I don't believe I would seeexec_bodyinternal code start to execute . This may be related thoughFew additional Comments :
exec_bodybody that the lock occurs 😢exec_body, if you remove code related to locks, theRoutineWorks without issues and triggers the rest of it's loop without issuescustommodule developer defined the customfnassociated withexec_bodyto justprintln!, the Routine goes across without issuesexec_body, if you continue to adddbg!around there, you'll see it get around locks😓I know this is a bit complex, so I'll writeup soon-ish the observed workflow that might clarify this
@ -1226,3 +1260,4 @@// So execbody waits for itself to finish.(self_ar.read().await.exec_body)(self_ar.read().await.parent_params.clone()).await;Just for added clarity, at L1262
self_ar.read().await.parent_params.clone():BotModulesimpl Routinemethodasync fn loopbody(&mut self)Useful Tool -
tokio-consolecrate for DiagnosingtokioLearned about a useful tool
tokio-consolethat could be used to troubleshoot async issues, including possibly Deadlock issuesI figured as code is getting more complex, it will be more difficult to track
tokioasync deadlocks . This will ease identifying these issues earlier I thinkRelatively easy to setup but very powerful to use
I'm sure @notohh will love this 😆it screams devops to me for some reason
Below is an example where I ran it and identified which
RwLockwere observed while the Bot is current running, and information like number ofcurrent_readersfor thatRwLockI've added it in a commit I'll be adding shortly, but it did adjust the Cargo.lock by a lot - I may need someone to just see if there's concerns or not
For More Information
tokio-rscrateconscole-subscribercrateUpdate - Related to this PR / Issue
To hopefully simplify and isolate the issue/crucial code, I've made the Custom Child
fnthat the Routine Calls callExecBody::get_channel()get_channel()call, an unexpected deadlock is occurringAt the moment,
tokio-consoledoesn't add to what we know already : in particular, the issue has to do with theArc<RwLock<Routine>>&Arc<RwLock<BotAction>>with
tokio-consoletool andCodeLLBextension, I was able to add more debugging on relatedRwLocks and theircurrent_readerin critical parts of codeI did the following to address the issue , but didn't seem to help - but will be keeping these in place just in case (I think these are better in code)
dropsandletforRwLockGuardsI suspect it has something to do with the relationship of
BotActionandRoutineand how they store their own self referencestokio-consoleandconsole-subscribercrates need to be adjusted or no issues@ -16,2 +16,4 @@casual_logger = "0.6.5"chrono = "0.4.35"tokio-console = "0.1.10"console-subscriber = "0.2.0"@notohh & @mzntori - these two crates I added
tokio-consoleandconsole-subscribercreated very large changes inCargo.lock. Do you see this as an issue?I'm going to shrug it off cuz LULE I don't know - but this is really helpful
If you're not sure or you may need time to look, I'll just shrug this off for now and say let's add
More information on these crates:
Notes on the Identified Critical Areas
@ -78,0 +80,4 @@let out = Some(r.read().await.channel.clone());// => 03.30 - This isn't reached because of the Deadlockdbg!(">> Just after Potential Deadlock Lock");L74-L83 Observations at this Critical Deadlock
@ -1230,0 +1293,4 @@{let guard2 = self_ar.read().await;(guard2.exec_body)(parent_params).await;drop(guard2);L1288-L1296 - Observations during this line - prior to Critical Deadlock
To be clear though, I do agree with this : this does merit exploration if we can't self pass routine . If push comes to shove, we can prioritize exploring a different way to retrieve the desired values without self reference (
channelwas chosen here for a simplified case)::Giggles:: but my gut feeling says this is possible, like how
BotInstanceis now able to pass itself into commandsBotInsancedoesn't use self referencing atm so might make senseasync&RwLockusage #54Confirmed Fix
Code Fix Summary
Routine, we had code fixes involving using blocking methods forRoutinefunctionalityCode Fix Involved :
Arc<RwLock<BotAction>>, useblocking_read()&blocking_write()ExecBodyParams::get_channel()be defined as non-asyncRoutinedefinition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)parent_params : Arc<RwLock<ExecBodyParams>>Routine::start()involves atokio::task::spawn_blockingexecute theloopbody()@ -52,2 +52,3 @@pub async fn get_channel(&self) -> Option<Channel> {// pub async fn get_channel(&self) -> Option<Channel> {pub fn get_channel(&self) -> Option<Channel> {Code Fix Involved :
Arc<RwLock<BotAction>>, useblocking_read()&blocking_write()ExecBodyParams::get_channel()be defined as non-asyncRoutinedefinition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)parent_params : Arc<RwLock<ExecBodyParams>>Routine::start()involves atokio::task::spawn_blockingexecute theloopbody()@ -63,0 +70,4 @@thread 'tokio-runtime-worker' panicked at src\core\bot_actions.rs:66:38:Cannot block the current thread from within a runtime. This happens because a function attempted to block the current thread while the thread is being used to drive asynchronous tasks.*/let curr_act_lock = curr_act.blocking_read();Code Fix Involved :
Arc<RwLock<BotAction>>, useblocking_read()&blocking_write()ExecBodyParams::get_channel()be defined as non-asyncRoutinedefinition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)parent_params : Arc<RwLock<ExecBodyParams>>Routine::start()involves atokio::task::spawn_blockingexecute theloopbody()@ -634,0 +648,4 @@// exec_body : fn(ExecBodyParams),exec_body : fn(Arc<RwLock<ExecBodyParams>>),// pub parent_params : ExecBodyParams ,pub parent_params : Arc<RwLock<ExecBodyParams>> ,Code Fix Involved :
Arc<RwLock<BotAction>>, useblocking_read()&blocking_write()ExecBodyParams::get_channel()be defined as non-asyncRoutinedefinition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)parent_params : Arc<RwLock<ExecBodyParams>>Routine::start()involves atokio::task::spawn_blockingexecute theloopbody()@ -1109,0 +1157,4 @@drop(guard1);});loopbodyspawn.await.unwrap();Code Fix Involved :
Arc<RwLock<BotAction>>, useblocking_read()&blocking_write()ExecBodyParams::get_channel()be defined as non-asyncRoutinedefinition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)parent_params : Arc<RwLock<ExecBodyParams>>Routine::start()involves atokio::task::spawn_blockingexecute theloopbody()@ -1152,6 +1209,8 @@ impl Routine {}dbg!("SUCCESS! Routinecompleted");Resolution Confirmed
In
stdout,fnof the custom child body is reachedWe should use
asynccustomfndefinitions so we have the ability to await onChat.say()callsDon't believe this is easy to just revert back without possibility re-introducing the locks - but will need to check
@ -253,0 +282,4 @@// [ ] !! ISSUE : With this fix though, the custom fn is no longer// async. This is an issue because chat module is async// - There should be no need to change chat , as async allows// us to multitask with messagesIssue with the Fix
If child Routine
fnare defined as regularfnrather thanasync, Custom Routinefnwould not be able to callChat.say()or similar functionalityHere for example, in a Custom Routine
fn innertester(params : Arc<RwLock<ExecBodyParams>>, we can't callchat.say()in this areaHonestly I'm favoring keeping the fox and maybe enhancing
Chatso it has non-async calls forsay().Keep in mind in implementing, we cannot call
asynccalls in a blocking context. This means if I have a nonasynccode block, code calls (e.g.fn) cannot be awaited on. In other words in non Async fn, I cannot useclient.send().await. There are ways around this though (later notes)I don't foresee issues with multitasking here if we enhance
Chatso users essentially enqueue "messages" (consisting ofBotMsgTypeandExecParams) then the bot has a separatetokio::taskrunning that processes those messages (i.e., triggering the message using giventwitch-irccall likeclient.say())In addition with in place, when custom module developers define a routine, they don't have to worry about locks as within the context of a custom built
Routine(in theory) should be blockingcorelevelI'm thinking of using
async-channelcratehttps://docs.rs/async-channel/latest/async_channel/fn.bounded.html
Hmm 🤔 I still need to review Tori's branch but wanted to mention the latest solution I started going for might not be the best path. I'm noticing some problems with what I've put in place so far (below), so I may separate my fixes out in separate branch
__
With above said, my solution's enhancements to
chatare welcomed as this could be a workaround for locking issues if we can't identify how to address a lock, but willing to go down a non-async workflowProbably will use this branch as the main issue branch, but continue working on my solution in a separate branch
Just to note, I think making Routines define custom functions differently than other BotActions may get confusing for Custom Developers - so better to make them
asyncacross the board. In addition, we're giving Custom Developersasyncfeatures ; if required underasync, Custom Developers may be able to replicate blocking behavior if requiredView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.