WIP: Routine Unresponsive due to Deadlock #51
No reviewers
Labels
No labels
Automated
Backlog
Post_Prototype_1.0
Bot_Code
Core
Bot_Code
Custom
CI/CD
Complexity
Advanced
Complexity
Basic
Complexity
Expert
Complexity
Intermediate
Kind/Breaking
Kind/Bug
Kind/Bug Fix
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Nix
Ownership
Collab
Ownership
Collab with Leads
Ownership
Individual Lead
Ownership
In-Review
Ownership
Needs Owner > May Delegate
Ownership
Workshop with Leads
Phase 1.0
Requirements > Drafting
Phase 1.0
Requirements > Researching
Phase 1.0
Requirements > Review & Planning
Phase 2.0
Design > Research & Analysis
Phase 3.0
Coding > Implementation
Phase 4.0
QA > Unit Testing & Design
Phase 5.0
Resolution > Completed
Phase 5.0
Resolution > Review for Completion
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
2 participants
Notifications
Due date
No due date set.
Blocks
#40 WIP: Basic Routine Functionality}
modulatingforce/forcebot_rs
Reference: modulatingforce/forcebot_rs#51
Loading…
Reference in a new issue
No description provided.
Delete branch "issue-routine-lock"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Initial Description
Stemming from feature set from Routines Functionality - #40 (comment)
ISSUE . The defined Routine's Custom Execution Body is being reached (so the custom
fn
is being triggered as expected); However, I believe at L413 wherec.read().await.channel
, it never goes beyond that point within the same Customfn
bodyEven if the bot continues to be responsive (i.e., the
main
bot thread continues to run), the rest of the routine is not. I believe this is suggesting there's a lock with one of the objects we're introducing with this feature . In the above line, I believe the lock involvesExecBodyParam
'scurr_act
valueI believe the issue is a deadlock between
custom
andcore
locks : However, ideally this is something that can be identified or some best practices recommended, even if we do identify that something in the custom module is the issueHigh Priority . This is a blocker of Routines functionality
Additional Resources
tokio::RwLock
- https://docs.rs/tokio/latest/tokio/sync/struct.RwLock.htmlPlan of Action
read
andwrite
locks are located, and ensure they're dropped (e.g., within a block or through explicitdrop
)'custom
module, consider creating SOP recommendations for Custom developersHmm. I'm wondering if I can delegate this to someone like tori or if this is a lot of manual work / investigation
If tori is willing to pick it up after reading the initial description, I may need to adjust the description with some additional details like Functional Use Case (that might not be as clear, since we're just introducing routines on parent PR)
The Routine that you want to read lock in the execution body is the routine that is used to call that execution body, making it so the execution body waits for itself to end. No idea how you would do this, but I suspect that you cant do that. If you want to access the channel you probably have to pass it in a different way instead of just passing the routine itself.
Second commit isn't necessary btw you can roll that back if you want, but it just has a bunch of debug implementations that might help debugging stuff like this in the future.
Wow @mzntori you are beautiful peepoShy
std::fmt::{Debug, Formatter};
anddbg!
are a learning experience for me! Good share!uuh feel free to resolve this comment afterwards? lol testing this review process as well
@ -535,0 +538,4 @@
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
write!(f, "{:?} {:?} {:?} {:?} {:?}", self.module, self.command, self.alias, self.help, self.required_roles)
}
}
Holy I love this idea! 🤤 with the others too!
@ -1228,2 +1263,3 @@
).await;
dbg!("after");
wow!
dbg!
is so useful! Good share!@mzntori Definitely agree that you're in the right area , and I think it's related . However, what's strange is that this
exec_body
is called and executed - if the lock occurs at this spot, I don't believe I would seeexec_body
internal code start to execute . This may be related thoughFew additional Comments :
exec_body
body that the lock occurs 😢exec_body
, if you remove code related to locks, theRoutine
Works without issues and triggers the rest of it's loop without issuescustom
module developer defined the customfn
associated withexec_body
to justprintln!
, the Routine goes across without issuesexec_body
, if you continue to adddbg!
around there, you'll see it get around locks😓I know this is a bit complex, so I'll writeup soon-ish the observed workflow that might clarify this
@ -1226,3 +1260,4 @@
// So execbody waits for itself to finish.
(self_ar.read().await.exec_body)(
self_ar.read().await.parent_params.clone()
).await;
Just for added clarity, at L1262
self_ar.read().await.parent_params.clone()
:BotModules
impl Routine
methodasync fn loopbody(&mut self)
Useful Tool -
tokio-console
crate for Diagnosingtokio
Learned about a useful tool
tokio-console
that could be used to troubleshoot async issues, including possibly Deadlock issuesI figured as code is getting more complex, it will be more difficult to track
tokio
async deadlocks . This will ease identifying these issues earlier I thinkRelatively easy to setup but very powerful to use
I'm sure @notohh will love this 😆it screams devops to me for some reason
Below is an example where I ran it and identified which
RwLock
were observed while the Bot is current running, and information like number ofcurrent_readers
for thatRwLock
I've added it in a commit I'll be adding shortly, but it did adjust the Cargo.lock by a lot - I may need someone to just see if there's concerns or not
For More Information
tokio-rs
crateconscole-subscriber
crateUpdate - Related to this PR / Issue
To hopefully simplify and isolate the issue/crucial code, I've made the Custom Child
fn
that the Routine Calls callExecBody::get_channel()
get_channel()
call, an unexpected deadlock is occurringAt the moment,
tokio-console
doesn't add to what we know already : in particular, the issue has to do with theArc<RwLock<Routine>>
&Arc<RwLock<BotAction>>
with
tokio-console
tool andCodeLLB
extension, I was able to add more debugging on relatedRwLock
s and theircurrent_reader
in critical parts of codeI did the following to address the issue , but didn't seem to help - but will be keeping these in place just in case (I think these are better in code)
drops
andlet
forRwLock
GuardsI suspect it has something to do with the relationship of
BotAction
andRoutine
and how they store their own self referencestokio-console
andconsole-subscriber
crates need to be adjusted or no issues@ -16,2 +16,4 @@
casual_logger = "0.6.5"
chrono = "0.4.35"
tokio-console = "0.1.10"
console-subscriber = "0.2.0"
@notohh & @mzntori - these two crates I added
tokio-console
andconsole-subscriber
created very large changes inCargo.lock
. Do you see this as an issue?I'm going to shrug it off cuz LULE I don't know - but this is really helpful
If you're not sure or you may need time to look, I'll just shrug this off for now and say let's add
More information on these crates:
Notes on the Identified Critical Areas
@ -78,0 +80,4 @@
let out = Some(r.read().await.channel.clone());
// => 03.30 - This isn't reached because of the Deadlock
dbg!(">> Just after Potential Deadlock Lock");
L74-L83 Observations at this Critical Deadlock
@ -1230,0 +1293,4 @@
{
let guard2 = self_ar.read().await;
(guard2.exec_body)(parent_params).await;
drop(guard2);
L1288-L1296 - Observations during this line - prior to Critical Deadlock
To be clear though, I do agree with this : this does merit exploration if we can't self pass routine . If push comes to shove, we can prioritize exploring a different way to retrieve the desired values without self reference (
channel
was chosen here for a simplified case)::Giggles:: but my gut feeling says this is possible, like how
BotInstance
is now able to pass itself into commandsBotInsance
doesn't use self referencing atm so might make senseasync
&RwLock
usage #54Confirmed Fix
Code Fix Summary
Routine
, we had code fixes involving using blocking methods forRoutine
functionalityCode Fix Involved :
Arc<RwLock<BotAction>>
, useblocking_read()
&blocking_write()
ExecBodyParams::get_channel()
be defined as non-asyncRoutine
definition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)
parent_params : Arc<RwLock<ExecBodyParams>>
Routine::start()
involves atokio::task::spawn_blocking
execute theloopbody()
@ -52,2 +52,3 @@
pub async fn get_channel(&self) -> Option<Channel> {
// pub async fn get_channel(&self) -> Option<Channel> {
pub fn get_channel(&self) -> Option<Channel> {
Code Fix Involved :
Arc<RwLock<BotAction>>
, useblocking_read()
&blocking_write()
ExecBodyParams::get_channel()
be defined as non-asyncRoutine
definition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)
parent_params : Arc<RwLock<ExecBodyParams>>
Routine::start()
involves atokio::task::spawn_blocking
execute theloopbody()
@ -63,0 +70,4 @@
thread 'tokio-runtime-worker' panicked at src\core\bot_actions.rs:66:38:
Cannot block the current thread from within a runtime. This happens because a function attempted to block the current thread while the thread is being used to drive asynchronous tasks.
*/
let curr_act_lock = curr_act.blocking_read();
Code Fix Involved :
Arc<RwLock<BotAction>>
, useblocking_read()
&blocking_write()
ExecBodyParams::get_channel()
be defined as non-asyncRoutine
definition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)
parent_params : Arc<RwLock<ExecBodyParams>>
Routine::start()
involves atokio::task::spawn_blocking
execute theloopbody()
@ -634,0 +648,4 @@
// exec_body : fn(ExecBodyParams),
exec_body : fn(Arc<RwLock<ExecBodyParams>>),
// pub parent_params : ExecBodyParams ,
pub parent_params : Arc<RwLock<ExecBodyParams>> ,
Code Fix Involved :
Arc<RwLock<BotAction>>
, useblocking_read()
&blocking_write()
ExecBodyParams::get_channel()
be defined as non-asyncRoutine
definition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)
parent_params : Arc<RwLock<ExecBodyParams>>
Routine::start()
involves atokio::task::spawn_blocking
execute theloopbody()
@ -1109,0 +1157,4 @@
drop(guard1);
});
loopbodyspawn.await.unwrap();
Code Fix Involved :
Arc<RwLock<BotAction>>
, useblocking_read()
&blocking_write()
ExecBodyParams::get_channel()
be defined as non-asyncRoutine
definition involve :exec_body : fn(Arc<RwLock<ExecBodyParams>>)
parent_params : Arc<RwLock<ExecBodyParams>>
Routine::start()
involves atokio::task::spawn_blocking
execute theloopbody()
@ -1152,6 +1209,8 @@ impl Routine {
}
dbg!("SUCCESS! Routinecompleted");
Resolution Confirmed
In
stdout
,fn
of the custom child body is reachedWe should use
async
customfn
definitions so we have the ability to await onChat.say()
callsDon't believe this is easy to just revert back without possibility re-introducing the locks - but will need to check
@ -253,0 +282,4 @@
// [ ] !! ISSUE : With this fix though, the custom fn is no longer
// async. This is an issue because chat module is async
// - There should be no need to change chat , as async allows
// us to multitask with messages
Issue with the Fix
If child Routine
fn
are defined as regularfn
rather thanasync
, Custom Routinefn
would not be able to callChat.say()
or similar functionalityHere for example, in a Custom Routine
fn innertester(params : Arc<RwLock<ExecBodyParams>>
, we can't callchat.say()
in this areaHonestly I'm favoring keeping the fox and maybe enhancing
Chat
so it has non-async calls forsay()
.Keep in mind in implementing, we cannot call
async
calls in a blocking context. This means if I have a nonasync
code block, code calls (e.g.fn
) cannot be awaited on. In other words in non Async fn, I cannot useclient.send().await
. There are ways around this though (later notes)I don't foresee issues with multitasking here if we enhance
Chat
so users essentially enqueue "messages" (consisting ofBotMsgType
andExecParams
) then the bot has a separatetokio::task
running that processes those messages (i.e., triggering the message using giventwitch-irc
call likeclient.say()
)In addition with in place, when custom module developers define a routine, they don't have to worry about locks as within the context of a custom built
Routine
(in theory) should be blockingcore
levelI'm thinking of using
async-channel
cratehttps://docs.rs/async-channel/latest/async_channel/fn.bounded.html
Hmm 🤔 I still need to review Tori's branch but wanted to mention the latest solution I started going for might not be the best path. I'm noticing some problems with what I've put in place so far (below), so I may separate my fixes out in separate branch
__
With above said, my solution's enhancements to
chat
are welcomed as this could be a workaround for locking issues if we can't identify how to address a lock, but willing to go down a non-async workflowProbably will use this branch as the main issue branch, but continue working on my solution in a separate branch
Just to note, I think making Routines define custom functions differently than other BotActions may get confusing for Custom Developers - so better to make them
async
across the board. In addition, we're giving Custom Developersasync
features ; if required underasync
, Custom Developers may be able to replicate blocking behavior if requiredView command line instructions
Checkout
From your project repository, check out a new branch and test the changes.Merge
Merge the changes and update on Forgejo.Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.