Jump to content
c45y

PvE lag topic

Recommended Posts

So, as you all know PvE is faced with a bit of a lag problem, currently we have tried a lot of things but I am personally running out of ideas. Now you are all smart people, and full of great ideas, and we are very much open to any suggestion you may have.

 

Below is a braindump of the current state, and will be updated as we progress. Thank you very much for your understanding as we work through this, and any assistance you can provide.

 

 

Problem: Shits laggy yo, currently unknown trigger or steps to reproduce, appears to be related to people joining the server in some instances but not always.

 

State:

Ram: 16GB

Build: 1.8.7 spigot

Plugins: (44): AdminHunt, AsyncWorldEdit, AsyncWorldEditInjector, BookExploitFix, bPermissions, ClientPermissions, ClockBlocked, CobraCorral, CommandBlock, CommandHelper, Doppelganger, dynmap, KeepBabyMobs, KitchenSink, LogBlock, LolNeigh, LolNo, LWC, MaskOfFutures, MobLimiter, ModMode, ModReq, Multiverse-Core, NerdFlags, NerdList, NerdMessage, NerdSpawn, NoCheatPlus, OpenInv, PacketWrapper, PlumpXP, ProperTime, ProtocolLib, RedstoneClockDetector, SafeBuckets, SBC, StandMaster9000, SureLog, TPControl, VanishNoPacket, WGCustomFlags, WorldBorder, WorldEdit, WorldGuard

 

 

Versions: Changed between this rev and last are underlined

  • Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8
  • bPermissions v2.10.9m -> v2.12
  • zombieKill v1.0.4 -> #
  • SBC v0.2.1 -> v1.2.2
  • AsyncWorldEditInjector v2.1.3 -> v2.1.3
  • SureLog v0.1 -> v0.1
  • WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7
  • MobLimiter v1.1.3 -> v1.1.3
  • KeepBabyMobs v0.1.2 -> v0.1.2
  • StandMaster9000 v0.0.3 -> v0.0.3
  • Multiverse-Core v2.5-b691 -> v2.5-b697
  • LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260)
  • LWC v4.5.0-SNAPSHOT -> v4.5.0-SNAPSHOT
  • VillageMarker v0.33 -> #
  • WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236
  • CommandBlock v2.0 -> v2.0
  • CobraCorral v1.2 -> v1.2.2
  • dynmap v2.1-SNAPSHOT-1091 -> v2.1-SNAPSHOT-1091
  • NerdList v0.1 -> v0.2
  • ModReq v1.6 -> v1.6
  • NerdSpawn v2.0 -> v2.0
  • AdminHunt v1.0-1.6.4 -> v1.0-1.6.4
  • LolNeigh v1.0.0 -> v1.0.0
  • LolNo v0.4 -> v0.4
  • Doppelganger v0.9.2 -> v0.9.2
  • BookExploitFix v0.5 -> v0.6
  • ClientPermissions v1.1.0 -> v1.1.0
  • NerdMessage v0.5 -> v0.5
  • SignKitten v0.1 -> #
  • OpenInv v2.2.8 -> v2.3.4
  • SafeBuckets v0.7.1 -> v0.7.2
  • TPControl v0.3 -> v0.4.2
  • ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4
  • WGCustomFlags v1.8.1-beta -> v1.8.1-beta
  • WorldBorder v1.7.9 -> v1.7.9
  • NerdFlags v0.1.1 -> v0.1.4
  • AsyncWorldEdit v2.1.3 -> v2.1.5
  • NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849
  • CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910-
  • VanishNoPacket v3.19.1 -> v3.19.1
  • ModMode v3.5.0 -> v3.5.0
  • KitchenSink v0.9.2 -> v0.9.4
  • NerdGreeting v2.0.0 -> #
  • MaskOfFutures v0.10.3.1 -> v0.10.3.1
  • PlumpXP # ->  v0.4
  • ProperTime # -> v2.2
  • RedstoneClockDetector # -> v0.2.7-b12
  • ClockBlocked # -> v1.0.0
 
Things tested:
  1. Running with NoCheatPlus disabled
  2. Running with CommandHelper disabled
  3. Running with PlumpXP disabled
  4. Running with MobLimiter disabled
  5. Migrated LWC from sqlite to mysql
  6. Reverted bPermissions version to last stable, lag got worse
  7. Reviewed redstone activity with RedstoneClockDetector, appears normal
  8. Reviewed spigot timings reports @todo generate and attach during peak times
  9. Bypassed BungeeCord connecting to PvE directly, still lag spikes

Observations:

  • Lag seems related to login events ( by extension chunk loads )
  • reverting bPermissions from v2.12 to v2.10 increased lag noticeably, needs a review
  • Teleporting to unloaded chunks sometimes triggers a lag spike
  • Issue could not be triggered using two accounts on an alternate server, seems to be tied to in game actions/number of people online
  • Opening chests appears to trigger lag occasionally, though not consistently 
Edited by c45y
include spigot build info
  • Upvote 2

Share this post


Link to post
Share on other sites

The lag spikes appear to be a stop-the-world type phenomenon.  By bumping up the RAM usage we eliminated a big garbage collection pause on P as a possible cause.

 

As I said at the time, I don't think the individual bypassing of BungeeCord was a complete test, since BungeeCord is a shared resource for all players.  If BungeeCord can in some way hold up or disrupt the main game thread then other players on Bungee can still cause issues.

 

Today, I see an admincraft post about this:

 

 * https://www.reddit.com/r/admincraft/comments/3miu6u/bungeecord_memory_usage_skyrocketing/

 * https://www.spigotmc.org/threads/running-bungeecord-with-2gb-of-memory-or-less-with-g1gc.91807/#post-1017207

 

Indicating that BungeeCord needs a lot of RAM in recent versions and also needs particular GC flags to the JVM.  I found out when I sat in on that testing that we're on default args for the BungeeCord JVM.  I'd say that's an avenue of investigation.  It's possible that there are multiple issues and that BungeeCord is an amplifier of any lag that does come up, as indicated in the post.  It would be consistent with the occasional lag around chunk loading when teleporting too.

 

One counter to theories along this line was that neither C nor S (when it existed) was seeing these issues.  

 

I think the main lag situations appear to be on logins and saves.  As I said last night (my timezone) I do wonder about the name API but I haven't been able to specifically trigger a noticeable latency issue with it.  It wouldn't explain the chunk loading anyway.

Share this post


Link to post
Share on other sites

Get rid of SureLog. It was just experimental stuff and absolultely doesn't need to be there. I doubt (hope) it's not having any effect. :)

Share this post


Link to post
Share on other sites

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that.  Curious about Multiverse.

 

And further questions:

 

 * NerdList is in the changed list.  Can we get a source code link?

Share this post


Link to post
Share on other sites

Versions:

For folks who might know what to look for and/or just want to browse the actual changes to the listed plugins, I've added relevant links to code diffs where possible:
  • Spigot 1.8 -> 1.8.7
    • Do you have build numbers?  "git-Spigot-07a5b1b-04afd79 (MC: 1.8.7)" or somesuch would allow for pin pointing commits.
      • Spigot 1.8 git-Spigot-550ebac-7019900 -> 1.8.7 git-Spigot-f928e7a-e91aed8
        • ​Atlassian doesn't allow for commit diff links as far as I can tell, so the best single link I can supply is to the 550ebac to current HEAD changes, so just only look at the diffs from the bottom of the table up to the f928e7aa73b 21-Jun-2015 commit.
  • bPermissions v2.10.9m -> v2.12
    • v2.12 is still beta, github doesn't have latest beta commits
  • SBC v0.2.1 -> v1.2.2
  • WorldEdit v6.0.2-SNAPSHOT;no_git_id -> v6.1.1-SNAPSHOT;3394-9a65bdb7
  • Multiverse-Core v2.5-b691 -> v2.5-b697
  • LogBlock v1.94-dev-SNAPSHOT (build #256) -> v1.94-dev-SNAPSHOT (build #260)
  • WorldGuard v6.0.0-SNAPSHOT.1583- -> v6.1.1-SNAPSHOT.1628-e28b236
  • CobraCorral v1.2 -> v1.2.2
  • NerdList v0.1 -> v0.2
    • Not open source?
  • BookExploitFix v0.5 -> v0.6
  • OpenInv v2.2.8 -> v2.3.4
  • SafeBuckets v0.7.1 -> v0.7.2
  • TPControl v0.3 -> v0.4.2
  • ProtocolLib v3.6.3-SNAPSHOT -> v3.6.4
    • github diff (best guess based on dates and version numbers supplied)
  • NerdFlags v0.1.1 -> v0.1.4
  • AsyncWorldEdit v2.1.3 -> v2.1.5
  • NoCheatPlus v3.12.0-SNAPSHOT-sMD5NET-b793 -> v3.12.1-SNAPSHOT-sMD5NET-b849
  • CommandHelper v3.3.1-SNAPSHOT.2833- -> v3.3.1-SNAPSHOT.2910-
  • KitchenSink v0.9.2 -> v0.9.4
  • PlumpXP # ->  v0.4
    • New, but following changes made since use on S: github diff
  • ProperTime # -> v2.2
    • New to P, no changes since use on S, github
  • RedstoneClockDetector # -> v0.2.7-b12
  • ClockBlocked # -> v1.0.0
----

(edits: fixing links)

 

(edit to add link to spigot changes due to edit in parent)

Edited by tompreuss

Share this post


Link to post
Share on other sites

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that.  Curious about Multiverse.

 

And further questions:

 

 * NerdList is in the changed list.  Can we get a source code link?

 

Up to the original author, I'll let them decide ( have pinged on irc )

 

--

 

Have update to include spigot build info

 

--

 

Surelog is removed, no idea what it was anyway

 

--

 

While I am surprised it could be bungee after experiencing the same issues while not connected through it it does look like there is some intense heap clearing going on, I'll have a read of those threads and make some changes once im finished here at work

 

WOQGujI.png

 

-- Edit

 

BungeeCord is now running with the specified args

 

ingDZgD.png

Edited by c45y

Share this post


Link to post
Share on other sites

Also, C45Y, can you indicate which plugins specifically you have previously disabled in testing. Mark with a starr or something like that.  Curious about Multiverse.

 

And further questions:

 

 * NerdList is in the changed list.  Can we get a source code link?

 

From the original topic,

  1. Running with NoCheatPlus disabled
  2. Running with CommandHelper disabled
  3. Running with PlumpXP disabled
  4. Running with MobLimiter disabled

multiverse has not been disabled in running, though in theory we could do that live and just drop the admin test world, will test later tonight

Share this post


Link to post
Share on other sites

It seemed like the changed JVM flags for BungeeCord had a pronounced negative effect on latency.  There was a considerable increase in in the amount of NCP spam about "too many packets".  Although the effect was detrimental, it's interesting that there was an effect.  It will be interesting to see what the heap is doing.
 
One interesting fact that's apparent from the graph you posted: we're churning through about 80MB of young generation heap every minute.  And then after 45 minutes or so, we run out of headroom and the GC clears up the old generation.
 
I'm not sure if those are the old GC settings or the new.  But I do wonder, how long is that young generation GC taking, in particular: in comparison to a tick (50ms).  According to this article G1 will give us lower latency collections than CMS which is what we would want, I guess. 
 
Looking through Oracle's docs, I notice that the default goal for a GC pause with G1 is 200ms.  That's 4 ticks.  Could we reduce that somewhat drastically: -XX:MaxGCPauseMillis=25?   The third heading at that page, "The G1 Garbage Collector", goes into detail on how G1 operates and differs from CMS.  According to this, we should be able to achieve 10ms collections for BungeeCord.  We might need to run with a fair bit more headroom though.

 

The G1 command line options are summarised rather well here, bottom of the page.
 
 

Share this post


Link to post
Share on other sites

BungeeCord has been modified to have a 1G heap, as well as using maxgcpausemillis set to 50, visualvm heap graphs will be taken during higher usage periods tomorrow to compare.

 

Currently lag appears to be triggered by Join/Quit and sometimes chest access.

Share this post


Link to post
Share on other sites

It appears java 7 is our environment for the servers.

 

PvE and BungeeCord have been updated to use java8, pve seems to have a nicer heap graph, while bungee still has quite a spikey map

Edited by c45y

Share this post


Link to post
Share on other sites

Here are Spigot timings with around 50 players online: http://timings.aikar.co/?url=12622106

 

Note that the lag spikes occur even when there are more like 10 players on, when most of the flagged values are reasonably low. (Will post a timing from off-peek hours later.)

 

Edit: And here's c45y's latest graph of the BungeeCord GC.

 

KpEiWyi.png

Share this post


Link to post
Share on other sites

You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks.

I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has.

 

That latest GC graph for BungeeCord shows memory usage spikes of about 175 MB, e.g. at about 12:08, 12:18, 12:23, 12:28.  Do they correlate with saves?  Where do the lag spikes happen on that time scale?  What would be causing bungee to suddenly buffer so much stuff?  It seems to be instantaneous, rather than following the preceding slope of the graph.

 

Makes me wonder about TCP receive buffer tuning: http://www.cyberciti.biz/faq/linux-tcp-tuning/.  I'm not sure that the graph really points to that.  Just throwing it in there in case it triggers any thoughts.

Share this post


Link to post
Share on other sites

I'm not sure what triggers that particular message or where they learned to spell "amount"

From aikar/timings/legacy/index.php:

if ($event == "Task: Unknown(Single)" && substr($plugin, 0, 6) == "dynmap" && $pct_tot_raw > 0.005) {
    $recommendations[] = "<b>You are using DynMap, and its rendering is causing you a decent amoung of lag due to it loading chunks.</b>";
}

Share this post


Link to post
Share on other sites

I'm not sure what triggers that particular message or where they learned to spell "amount" but can we run for a while without DynMap to see what effect that has.

 

We ran it with dynmap disabled for a few days before the livemap was made public. It didn't make a difference to the lag spikes.

Share this post


Link to post
Share on other sites

Since there have been observations that spikes may correlate with players joining, more specifically joining on unloaded chunks, here's a list of every plugin that registers an on-join listener. (As reported by a plugin with an on-join handler with MONITOR priority.)

ProtocolLib, bPermissions, SBC, CommandHelper, OpenInv, NoCheatPlus, NoCheatPlus, NoCheatPlus, NoCheatPlus, VanishNoPacket, Multiverse-Core, WorldGuard, WorldGuard, CobraCorral, ModReq, NerdSpawn, CommandHelper, AsyncWorldEdit, ModMode, VanishNoPacket, LolNo, LWC, ProtocolLib, LogBlock, NerdList, dynmap, dynmap, notifyshit, KitchenSink

And the quit listeners:

CommandHelper, ModMode, KitchenSink, Multiverse-Core, CobraCorral, CommandHelper, CommandHelper, TPControl, AsyncWorldEdit, VanishNoPacket, LolNo, ClientPermissions, LWC, ProtocolLib, StandMaster9000, LogBlock, dynmap, notifyshit, OpenInv, NoCheatPlus, NoCheatPlus

Share this post


Link to post
Share on other sites

Bungee has got some new flags that actually make sense for our environment

 

ClYg4B6.png

 

PvE also got some new flags, which results in a much nicer heap

S23M9mD.png

 

These were taken at relatively low usage periods, with only 13 people on P and 1 on C, will be taken again at high usage times to compare.

 

We are now confident the issue is not with our GC, but with a plugin as originally investigated. Work will continue on tracking it down and figuring out why it does not show up in timing reports

Share this post


Link to post
Share on other sites

Issue confirmed as CobraCorral handling of ChunkUnloadEvent, specific cause not found, we are reviewing the source now and reaching out to the plugin author

 

With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function

if(utils.isHorseLocked((Horse)entity)) {
    LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId());
    utils.updateHorse(lhorse, (Horse)entity);
}

https://gist.github.com/stevommmm/0fa7948c2fabb60fb904

 

At this stage my theory is that https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L97 is a blocking operation, however I can't seem to make sense of the javadocs at 1am

 

Debug build has confirmed that the call to getHorses is the issue

https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/CorralListener.java#L586

https://github.com/TheAcademician/CobraCorral/blob/master/src/us/drome/cobracorral/Database.java#L196

 

For some reason horseCache is not being populated, so every time we ask Database for a horse it is querying the database on the main thread ( thus lag )

Edited by c45y

Share this post


Link to post
Share on other sites

Issue confirmed as CobraCorral handling of ChunkUnloadEvent, specific cause not found, we are reviewing the source now and reaching out to the plugin author

 

With more debugging we have narrowed the issue to what appears to be async database updates. However for whatever reason each locked horse per chunk unload takes 1000ms to complete the following function

if(utils.isHorseLocked((Horse)entity)) {
    LockedHorse lhorse = config.Database.getHorse(((Horse)entity).getUniqueId());
    utils.updateHorse(lhorse, (Horse)entity);
}
https://gist.github.com/stevommmm/0fa7948c2fabb60fb904
 

Note for casual readers that what is seen in that log paste is not  "each locked horse per chunk unload" on the server taking more than 1000ms, it's only the locked horses that are taking more than 1000ms that are shown:

 

        Long duration = System.currentTimeMillis() - begin;
        if (duration > 1) {
            this.plugin.getLogger().log(Level.INFO, "ChunkUnloadEvent; duration={0}", duration);
        }
All the horses that aren't taking longer than 1000ms wouldn't be shown.

 

I don't see any changes to onChunkUnload() here; additionally this function has been there since March of 2014, why hasn't it manifested this problem until now at the new rev start?

Share this post


Link to post
Share on other sites

Just to deepen the mystery even further, C45Y:

CobraCorral v1.2.2 Total: 0.006 s	Pct: 0.01%
   Pct Total	Pct Tick	   Total	      Avg	 PerTick	     Count		Event
       0.01%	   0.01%	  0.00 s	     0.01 ms	 0.5	       0.6k	    CorralListener::onChunkUnload(ChunkUnloadEvent)
       0.00%	   2.52%	  0.00 s	     1.26 ms	 0.0	       0.0k	    CorralListener::onPlayerJoin(PlayerJoinEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.2	       0.3k	    CorralListener::onEntityDamage(EntityDamageEvent)
       0.00%	   0.02%	  0.00 s	     0.01 ms	 0.0	       0.0k	    CorralListener::onInventoryClose(InventoryCloseEvent)
       0.00%	   0.42%	  0.00 s	     0.21 ms	 0.0	       0.0k	    CorralListener::onPlayerInteractEntity(PlayerInteractEntityEvent)
       0.00%	   0.01%	  0.00 s	     0.00 ms	 0.0	       0.1k	    CorralListener::onInventoryClick(InventoryClickEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onEntityDamageByEntity(EntityDamageByEntityEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onInventoryOpen(InventoryOpenEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onEntityDeath(EntityDeathEvent)
       0.00%	   0.01%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onInventoryDrag(InventoryDragEvent)
       0.00%	   0.00%	  0.00 s	     0.00 ms	 0.0	       0.0k	    CorralListener::onVehicleExit(VehicleExitEvent)

onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871.  You were saying that you saw an 11,000ms chunk unload event.  It's really strange that this doesn't show up on the timings.  11000ms/600 = 18.3 ms.  A single bad event should have a noticeable effect on the average.

Share this post


Link to post
Share on other sites

onChunkUnload() was called about 600 times, averaging 0.01ms per call in one of the more recent timings that you guys linked: http://timings.aikar.co/?url=12629871.  You were saying that you saw an 11,000ms chunk unload event.  It's really strange that this doesn't show up on the timings.  11000ms/600 = 18.3 ms.  A single bad event should have a noticeable effect on the average.

 

One would assume. Then there's this timing: http://timings.aikar.co/?url=12629871

 

High load, lots of red events for entities and mob AI. But the average tps during the sample is still over 19. Usually when that happens, TPS drops significantly.

Share this post


Link to post
Share on other sites

That's the same timing Redwall.  I assume you meant to link a different one.  Can we see the one that shows the problem?

 

I'm confused about the TPS drops.  Last I heard, TPS was weirdly rock solid:

 

tps.png

Share this post


Link to post
Share on other sites

Create an account or sign in to comment

You need to be a member in order to leave a comment

Create an account

Sign up for a new account in our community. It's easy!

Register a new account

Sign in

Already have an account? Sign in here.

Sign In Now

  • Recently Browsing   0 members

    No registered users viewing this page.

×