c45y Posted September 28, 2015 Report Share Posted September 28, 2015 (edited) So, as you all know PvE is faced with a bit of a lag problem, currently we have tried a lot of things but I am personally running out of ideas. Now you are all smart people, and full of great ideas, and we are very much open to any suggestion you may have. Below is a braindump of the current state, and will be updated as we progress. Thank you very much for your understanding as we work through this, and any assistance you can provide. Problem: Shits laggy yo, currently unknown trigger or steps to reproduce, appears to be related to people joining the server in some instances but not always. State: Ram: 16GB Build: 1.8.7 spigot Plugins: (44): AdminHunt, AsyncWorldEdit, AsyncWorldEditInjector, BookExploitFix, bPermissions, ClientPermissions, ClockBlocked, CobraCorral, CommandBlock, CommandHelper, Doppelganger, dynmap, KeepBabyMobs, KitchenSink, LogBlock, LolNeigh, LolNo, LWC, MaskOfFutures, MobLimiter, ModMode, ModReq, Multiverse-Core, NerdFlags, NerdList, NerdMessage, NerdSpawn, NoCheatPlus, OpenInv, PacketWrapper, PlumpXP, ProperTime, ProtocolLib, RedstoneClockDetector, SafeBuckets, SBC, StandMaster9000, SureLog, TPControl, VanishNoPacket, WGCustomFlags, WorldBorder, WorldEdit, WorldGuard Versions: Changed between this rev and last are underlined Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8 bPermissions v2.10.9m -> v2.12 zombieKill v1.0.4 -> # SBC v0.2.1 -> v1.2.2 AsyncWorldEditInjector v2.1.3 -> v2.1.3 SureLog v0.1 -> v0.1 WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7 MobLimiter v1.1.3 -> v1.1.3 KeepBabyMobs v0.1.2 -> v0.1.2 StandMaster9000 v0.0.3 -> v0.0.3 Multiverse-Core v2.5-b691 -> v2.5-b697 LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260) LWC v4.5.0-SNAPSHOT -> v4.5.0-SNAPSHOT VillageMarker v0.33 -> # WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236 CommandBlock v2.0 -> v2.0 CobraCorral v1.2 -> v1.2.2 dynmap v2.1-SNAPSHOT-1091 -> v2.1-SNAPSHOT-1091 NerdList v0.1 -> v0.2 ModReq v1.6 -> v1.6 NerdSpawn v2.0 -> v2.0 AdminHunt v1.0-1.6.4 -> v1.0-1.6.4 LolNeigh v1.0.0 -> v1.0.0 LolNo v0.4 -> v0.4 Doppelganger v0.9.2 -> v0.9.2 BookExploitFix v0.5 -> v0.6 ClientPermissions v1.1.0 -> v1.1.0 NerdMessage v0.5 -> v0.5 SignKitten v0.1 -> # OpenInv v2.2.8 -> v2.3.4 SafeBuckets v0.7.1 -> v0.7.2 TPControl v0.3 -> v0.4.2 ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4 WGCustomFlags v1.8.1-beta -> v1.8.1-beta WorldBorder v1.7.9 -> v1.7.9 NerdFlags v0.1.1 -> v0.1.4 AsyncWorldEdit v2.1.3 -> v2.1.5 NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849 CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910- VanishNoPacket v3.19.1 -> v3.19.1 ModMode v3.5.0 -> v3.5.0 KitchenSink v0.9.2 -> v0.9.4 NerdGreeting v2.0.0 -> # MaskOfFutures v0.10.3.1 -> v0.10.3.1 PlumpXP # -> v0.4 ProperTime # -> v2.2 RedstoneClockDetector # -> v0.2.7-b12 ClockBlocked # -> v1.0.0 Things tested: Running with NoCheatPlus disabled Running with CommandHelper disabled Running with PlumpXP disabled Running with MobLimiter disabled Migrated LWC from sqlite to mysql Reverted bPermissions version to last stable, lag got worse Reviewed redstone activity with RedstoneClockDetector, appears normal Reviewed spigot timings reports @todo generate and attach during peak times Bypassed BungeeCord connecting to PvE directly, still lag spikes Observations: Lag seems related to login events ( by extension chunk loads ) reverting bPermissions from v2.12 to v2.10 increased lag noticeably, needs a review Teleporting to unloaded chunks sometimes triggers a lag spike Issue could not be triggered using two accounts on an alternate server, seems to be tied to in game actions/number of people online Opening chests appears to trigger lag occasionally, though not consistently Edited September 29, 2015 by c45y include spigot build info 2 Quote Link to comment Share on other sites More sharing options...
totemo Posted September 29, 2015 Report Share Posted September 29, 2015 The lag spikes appear to be a stop-the-world type phenomenon. By bumping up the RAM usage we eliminated a big garbage collection pause on P as a possible cause. As I said at the time, I don't think the individual bypassing of BungeeCord was a complete test, since BungeeCord is a shared resource for all players. If BungeeCord can in some way hold up or disrupt the main game thread then other players on Bungee can still cause issues. Today, I see an admincraft post about this: * https://www.reddit.com/r/admincraft/comments/3miu6u/bungeecord_memory_usage_skyrocketing/ * https://www.spigotmc.org/threads/running-bungeecord-with-2gb-of-memory-or-less-with-g1gc.91807/#post-1017207 Indicating that BungeeCord needs a lot of RAM in recent versions and also needs particular GC flags to the JVM. I found out when I sat in on that testing that we're on default args for the BungeeCord JVM. I'd say that's an avenue of investigation. It's possible that there are multiple issues and that BungeeCord is an amplifier of any lag that does come up, as indicated in the post. It would be consistent with the occasional lag around chunk loading when teleporting too. One counter to theories along this line was that neither C nor S (when it existed) was seeing these issues. I think the main lag situations appear to be on logins and saves. As I said last night (my timezone) I do wonder about the name API but I haven't been able to specifically trigger a noticeable latency issue with it. It wouldn't explain the chunk loading anyway. Quote Link to comment Share on other sites More sharing options...
totemo Posted September 29, 2015 Report Share Posted September 29, 2015 Get rid of SureLog. It was just experimental stuff and absolultely doesn't need to be there. I doubt (hope) it's not having any effect. :) Quote Link to comment Share on other sites More sharing options...
totemo Posted September 29, 2015 Report Share Posted September 29, 2015 Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse. And further questions: * NerdList is in the changed list. Can we get a source code link? Quote Link to comment Share on other sites More sharing options...
tompreuss Posted September 29, 2015 Report Share Posted September 29, 2015 (edited) Versions: For folks who might know what to look for and/or just want to browse the actual changes to the listed plugins, I've added relevant links to code diffs where possible: Spigot 1.8 -> 1.8.7Do you have build numbers? "git-Spigot-07a5b1b-04afd79 (MC: 1.8.7)" or somesuch would allow for pin pointing commits. Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8Atlassian doesn't allow for commit diff links as far as I can tell, so the best single link I can supply is to the 550ebac to current HEAD changes, so just only look at the diffs from the bottom of the table up to the f928e7aa73b 21-Jun-2015 commit. bPermissions v2.10.9m -> v2.12v2.12 is still beta, github doesn't have latest beta commits SBC v0.2.1 -> v1.2.2github diff WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7github diff Multiverse-Core v2.5-b691 -> v2.5-b697github diff LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260)github diff WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236github diff CobraCorral v1.2 -> v1.2.2github diff NerdList v0.1 -> v0.2Not open source? BookExploitFix v0.5 -> v0.6Not open source, has only changelog. OpenInv v2.2.8 -> v2.3.4github diff SafeBuckets v0.7.1 -> v0.7.2github diff TPControl v0.3 -> v0.4.2github diff ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4 github diff (best guess based on dates and version numbers supplied) NerdFlags v0.1.1 -> v0.1.4github diff AsyncWorldEdit v2.1.3 -> v2.1.5github diff NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849github diff CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910-github diff KitchenSink v0.9.2 -> v0.9.4github diff PlumpXP # -> v0.4New, but following changes made since use on S: github diff ProperTime # -> v2.2New to P, no changes since use on S, github RedstoneClockDetector # -> v0.2.7-b12New, github ClockBlocked # -> v1.0.0New, github ---- (edits: fixing links) (edit to add link to spigot changes due to edit in parent) Edited September 29, 2015 by tompreuss Quote Link to comment Share on other sites More sharing options...
c45y Posted September 29, 2015 Author Report Share Posted September 29, 2015 (edited) Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse. And further questions: * NerdList is in the changed list. Can we get a source code link? Up to the original author, I'll let them decide ( have pinged on irc ) -- Have update to include spigot build info -- Surelog is removed, no idea what it was anyway -- While I am surprised it could be bungee after experiencing the same issues while not connected through it it does look like there is some intense heap clearing going on, I'll have a read of those threads and make some changes once im finished here at work -- Edit BungeeCord is now running with the specified args Edited September 29, 2015 by c45y Quote Link to comment Share on other sites More sharing options...
c45y Posted September 29, 2015 Author Report Share Posted September 29, 2015 Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that. Curious about Multiverse. And further questions: * NerdList is in the changed list. Can we get a source code link? From the original topic, Running with NoCheatPlus disabled Running with CommandHelper disabled Running with PlumpXP disabled Running with MobLimiter disabled multiverse has not been disabled in running, though in theory we could do that live and just drop the admin test world, will test later tonight Quote Link to comment Share on other sites More sharing options...
totemo Posted September 29, 2015 Report Share Posted September 29, 2015 It seemed like the changed JVM flags for BungeeCord had a pronounced negative effect on latency. There was a considerable increase in in the amount of NCP spam about "too many packets". Although the effect was detrimental, it's interesting that there was an effect. It will be interesting to see what the heap is doing. One interesting fact that's apparent from the graph you posted: we're churning through about 80MB of young generation heap every minute. And then after 45 minutes or so, we run out of headroom and the GC clears up the old generation. I'm not sure if those are the old GC settings or the new. But I do wonder, how long is that young generation GC taking, in particular: in comparison to a tick (50ms). According to this article G1 will give us lower latency collections than CMS which is what we would want, I guess. Looking through Oracle's docs, I notice that the default goal for a GC pause with G1 is 200ms. That's 4 ticks. Could we reduce that somewhat drastically: -XX:MaxGCPauseMillis=25? The third heading at that page, "The G1 Garbage Collector", goes into detail on how G1 operates and differs from CMS. According to this, we should be able to achieve 10ms collections for BungeeCord. We might need to run with a fair bit more headroom though. The G1 command line options are summarised rather well here, bottom of the page. Quote Link to comment Share on other sites More sharing options...
c45y Posted September 29, 2015 Author Report Share Posted September 29, 2015 BungeeCord has been modified to have a 1G heap, as well as using maxgcpausemillis set to 50, visualvm heap graphs will be taken during higher usage periods tomorrow to compare. Currently lag appears to be triggered by Join/Quit and sometimes chest access. Quote Link to comment Share on other sites More sharing options...
roastnewt Posted September 29, 2015 Report Share Posted September 29, 2015 Are you running Java 7 or 8? They made some changes re: GC as discussed above. Quote Link to comment Share on other sites More sharing options...
c45y Posted September 29, 2015 Author Report Share Posted September 29, 2015 (edited) It appears java 7 is our environment for the servers. PvE and BungeeCord have been updated to use java8, pve seems to have a nicer heap graph, while bungee still has quite a spikey map Edited September 30, 2015 by c45y Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted September 30, 2015 Report Share Posted September 30, 2015 I just lowered -XX:G1HeapRegionSize to 4M. C45Y's still running the graph, but initial report sounds like it's smoother. Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted September 30, 2015 Report Share Posted September 30, 2015 Here are Spigot timings with around 50 players online: http://timings.aikar.co/?url=12622106 Note that the lag spikes occur even when there are more like 10 players on, when most of the flagged values are reasonably low. (Will post a timing from off-peek hours later.) Edit: And here's c45y's latest graph of the BungeeCord GC. Quote Link to comment Share on other sites More sharing options...
totemo Posted September 30, 2015 Report Share Posted September 30, 2015 You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks. I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has. That latest GC graph for BungeeCord shows memory usage spikes of about 175 MB, e.g. at about 12:08, 12:18, 12:23, 12:28. Do they correlate with saves? Where do the lag spikes happen on that time scale? What would be causing bungee to suddenly buffer so much stuff? It seems to be instantaneous, rather than following the preceding slope of the graph. Makes me wonder about TCP receive buffer tuning: http://www.cyberciti.biz/faq/linux-tcp-tuning/. I'm not sure that the graph really points to that. Just throwing it in there in case it triggers any thoughts. Quote Link to comment Share on other sites More sharing options...
tompreuss Posted September 30, 2015 Report Share Posted September 30, 2015 I'm not sure what triggers that particular message or where they learned to spell "amount" From aikar/timings/legacy/index.php: if ($event == "Task: Unknown(Single)" && substr($plugin, 0, 6) == "dynmap" && $pct_tot_raw > 0.005) { $recommendations[] = "<b>You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks.</b>"; } Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted September 30, 2015 Report Share Posted September 30, 2015 I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has. We ran it with dynmap disabled for a few days before the livemap was made public. It didn't make a difference to the lag spikes. Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted September 30, 2015 Report Share Posted September 30, 2015 This timing is from a few nights ago, during off-peak hours: http://timings.aikar.co/?url=12601271 12 players online, still lag spikes. Quote Link to comment Share on other sites More sharing options...
totemo Posted September 30, 2015 Report Share Posted September 30, 2015 Next thing to try, if relevant: -XX:+PerfDisableSharedMem From TL;DR: http://www.evanjones.ca/jvm-mmap-pause.html Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted September 30, 2015 Report Share Posted September 30, 2015 Since there have been observations that spikes may correlate with players joining, more specifically joining on unloaded chunks, here's a list of every plugin that registers an on-join listener. (As reported by a plugin with an on-join handler with MONITOR priority.) ProtocolLib, bPermissions, SBC, CommandHelper, OpenInv, NoCheatPlus, NoCheatPlus, NoCheatPlus, NoCheatPlus, VanishNoPacket, Multiverse-Core, WorldGuard, WorldGuard, CobraCorral, ModReq, NerdSpawn, CommandHelper, AsyncWorldEdit, ModMode, VanishNoPacket, LolNo, LWC, ProtocolLib, LogBlock, NerdList, dynmap, dynmap, notifyshit, KitchenSink And the quit listeners: CommandHelper, ModMode, KitchenSink, Multiverse-Core, CobraCorral, CommandHelper, CommandHelper, TPControl, AsyncWorldEdit, VanishNoPacket, LolNo, ClientPermissions, LWC, ProtocolLib, StandMaster9000, LogBlock, dynmap, notifyshit, OpenInv, NoCheatPlus, NoCheatPlus Quote Link to comment Share on other sites More sharing options...
c45y Posted September 30, 2015 Author Report Share Posted September 30, 2015 Bungee has got some new flags that actually make sense for our environment PvE also got some new flags, which results in a much nicer heap These were taken at relatively low usage periods, with only 13 people on P and 1 on C, will be taken again at high usage times to compare. We are now confident the issue is not with our GC, but with a plugin as originally investigated. Work will continue on tracking it down and figuring out why it does not show up in timing reports Quote Link to comment Share on other sites More sharing options...
c45y Posted October 1, 2015 Author Report Share Posted October 1, 2015 (edited) Issue confirmed as CobraCorral handling of ChunkUnloadEvent, specific cause not found, we are reviewing the source now and reaching out to the plugin author With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function if(utils.isHorseLocked((Horse)entity)) { LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId()); utils.updateHorse(lhorse, (Horse)entity); } https://gist.github.com/stevommmm/0fa7948c2fabb60fb904 At this stage my theory is that https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L97 is a blocking operation, however I can't seem to make sense of the javadocs at 1am Debug build has confirmed that the call to getHorses is the issue https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/CorralListener.java#L586 https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L196 For some reason horseCache is not being populated, so every time we ask Database for a horse it is querying the database on the main thread ( thus lag ) Edited October 2, 2015 by c45y Quote Link to comment Share on other sites More sharing options...
tompreuss Posted October 1, 2015 Report Share Posted October 1, 2015 Issue confirmed as CobraCorral handling of ChunkUnloadEvent, specific cause not found, we are reviewing the source now and reaching out to the plugin author With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function if(utils.isHorseLocked((Horse)entity)) { LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId()); utils.updateHorse(lhorse, (Horse)entity); } https://gist.github.com/stevommmm/0fa7948c2fabb60fb904 Note for casual readers that what is seen in that log paste is not "each locked horse per chunk unload" on the server taking more than 1000ms, it's only the locked horses that are taking more than 1000ms that are shown: Long duration = System.currentTimeMillis() - begin; if (duration > 1) { this.plugin.getLogger().log(Level.INFO, "ChunkUnloadEvent; duration={0}", duration); } All the horses that aren't taking longer than 1000ms wouldn't be shown. CobraCorral v1.2 -> v1.2.2github diff I don't see any changes to onChunkUnload() here; additionally this function has been there since March of 2014, why hasn't it manifested this problem until now at the new rev start? Quote Link to comment Share on other sites More sharing options...
totemo Posted October 1, 2015 Report Share Posted October 1, 2015 Just to deepen the mystery even further, C45Y: CobraCorral v1.2.2 Total: 0.006 s Pct: 0.01% Pct Total Pct Tick Total Avg PerTick Count Event 0.01% 0.01% 0.00 s 0.01 ms 0.5 0.6k CorralListener::onChunkUnload(ChunkUnloadEvent) 0.00% 2.52% 0.00 s 1.26 ms 0.0 0.0k CorralListener::onPlayerJoin(PlayerJoinEvent) 0.00% 0.00% 0.00 s 0.00 ms 0.2 0.3k CorralListener::onEntityDamage(EntityDamageEvent) 0.00% 0.02% 0.00 s 0.01 ms 0.0 0.0k CorralListener::onInventoryClose(InventoryCloseEvent) 0.00% 0.42% 0.00 s 0.21 ms 0.0 0.0k CorralListener::onPlayerInteractEntity(PlayerInteractEntityEvent) 0.00% 0.01% 0.00 s 0.00 ms 0.0 0.1k CorralListener::onInventoryClick(InventoryClickEvent) 0.00% 0.00% 0.00 s 0.00 ms 0.0 0.0k CorralListener::onEntityDamageByEntity(EntityDamageByEntityEvent) 0.00% 0.00% 0.00 s 0.00 ms 0.0 0.0k CorralListener::onInventoryOpen(InventoryOpenEvent) 0.00% 0.00% 0.00 s 0.00 ms 0.0 0.0k CorralListener::onEntityDeath(EntityDeathEvent) 0.00% 0.01% 0.00 s 0.00 ms 0.0 0.0k CorralListener::onInventoryDrag(InventoryDragEvent) 0.00% 0.00% 0.00 s 0.00 ms 0.0 0.0k CorralListener::onVehicleExit(VehicleExitEvent) onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871. You were saying that you saw an 11,000ms chunk unload event. It's really strange that this doesn't show up on the timings. 11000ms/600 = 18.3 ms. A single bad event should have a noticeable effect on the average. Quote Link to comment Share on other sites More sharing options...
redwall_hp Posted October 1, 2015 Report Share Posted October 1, 2015 onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871. You were saying that you saw an 11,000ms chunk unload event. It's really strange that this doesn't show up on the timings. 11000ms/600 = 18.3 ms. A single bad event should have a noticeable effect on the average. One would assume. Then there's this timing: http://timings.aikar.co/?url=12629871 High load, lots of red events for entities and mob AI. But the average tps during the sample is still over 19. Usually when that happens, TPS drops significantly. Quote Link to comment Share on other sites More sharing options...
totemo Posted October 2, 2015 Report Share Posted October 2, 2015 That's the same timing Redwall. I assume you meant to link a different one. Can we see the one that shows the problem? I'm confused about the TPS drops. Last I heard, TPS was weirdly rock solid: Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and register later. If you have an account, sign in now to post with your account.