Skip to content

More checks before assuming FIM tokens #7644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 14, 2024

Conversation

CISC
Copy link
Collaborator

@CISC CISC commented May 30, 2024

Added a bare minimum of extra checks before assuming CodeLlama/CodeGemma FIM tokens.

This fixes Codestral and any other non-CodeLlama/Gemma model with the word "code" in general.name.

BTW, Codestral was initially released with wrong vocab (missing FIM tokens), updated vocab was released today, however Codestral does not use the middle token, so I'm generating GGUFs without the middle token, this might require a separate fix in server.cpp /infill route!

The GGUFs are up in case anyone wants to test (and perhaps implement an spm_infill argument to switch to Suffix/Prefix/Middle sequence).

Fixes #7592

@mofosyne mofosyne added Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix bugfix fixes an issue or bug medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels May 30, 2024
@mofosyne mofosyne added the merge ready indicates that this may be ready to merge soon and is just holding out in case of objections label May 30, 2024
Copy link
Contributor

github-actions bot commented May 30, 2024

📈 llama.cpp server for bench-server-baseline on Standard_NC4as_T4_v3 for phi-2-q4_0: 527 iterations 🚀

Expand details for performance related PR only
  • Concurrent users: 8, duration: 10m
  • HTTP request : avg=8901.42ms p(95)=21984.39ms fails=, finish reason: stop=475 truncated=52
  • Prompt processing (pp): avg=106.63tk/s p(95)=484.55tk/s
  • Token generation (tg): avg=59.75tk/s p(95)=49.26tk/s
  • ggml-org/models/phi-2/ggml-model-q4_0.gguf parallel=8 ctx-size=16384 ngl=33 batch-size=2048 ubatch-size=256 pp=1024 pp+tg=2048 branch=non-llama-fim-fix commit=f4df3ffdf718eb8c5637991a59ca2f8ec30117cc

prompt_tokens_seconds

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:prompt_tokens_seconds"
    x-axis "llamacpp:prompt_tokens_seconds" 1717977423 --> 1717978053
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 404.15, 404.15, 404.15, 404.15, 404.15, 988.04, 988.04, 988.04, 988.04, 988.04, 970.87, 970.87, 970.87, 970.87, 970.87, 972.62, 972.62, 972.62, 972.62, 972.62, 1009.18, 1009.18, 1009.18, 1009.18, 1009.18, 1019.49, 1019.49, 1019.49, 1019.49, 1019.49, 1007.99, 1007.99, 1007.99, 1007.99, 1007.99, 1013.64, 1013.64, 1013.64, 1013.64, 1013.64, 1025.45, 1025.45, 1025.45, 1025.45, 1025.45, 1015.1, 1015.1, 1015.1, 1015.1, 1015.1, 1018.56, 1018.56, 1018.56, 1018.56, 1018.56, 1036.78, 1036.78, 1036.78, 1036.78, 1036.78, 1025.94, 1025.94, 1025.94, 1025.94, 1025.94, 1032.08, 1032.08, 1032.08, 1032.08, 1032.08, 1022.19, 1022.19, 1022.19, 1022.19, 1022.19, 1017.78, 1017.78, 1017.78, 1017.78, 1017.78, 1012.42, 1012.42, 1012.42, 1012.42, 1012.42, 1006.4, 1006.4, 1006.4, 1006.4, 1006.4, 1002.71, 1002.71, 1002.71, 1002.71, 1002.71, 1018.87, 1018.87, 1018.87, 1018.87, 1018.87, 1014.78, 1014.78, 1014.78, 1014.78, 1014.78, 1008.49, 1008.49, 1008.49, 1008.49, 1008.49, 1010.66, 1010.66, 1010.66, 1010.66, 1010.66, 1009.83, 1009.83, 1009.83, 1009.83, 1009.83, 1022.27, 1022.27, 1022.27, 1022.27, 1022.27, 1014.49, 1014.49, 1014.49, 1014.49, 1014.49, 1013.9, 1013.9, 1013.9, 1013.9, 1013.9, 1012.99, 1012.99, 1012.99, 1012.99, 1012.99, 1013.38, 1013.38, 1013.38, 1013.38, 1013.38, 1011.93, 1011.93, 1011.93, 1011.93, 1011.93, 1009.59, 1009.59, 1009.59, 1009.59, 1009.59, 1004.77, 1004.77, 1004.77, 1004.77, 1004.77, 1000.92, 1000.92, 1000.92, 1000.92, 1000.92, 1001.42, 1001.42, 1001.42, 1001.42, 1001.42, 1004.11, 1004.11, 1004.11, 1004.11, 1004.11, 1004.59, 1004.59, 1004.59, 1004.59, 1004.59, 1006.44, 1006.44, 1006.44, 1006.44, 1006.44, 990.82, 990.82, 990.82, 990.82, 990.82, 985.94, 985.94, 985.94, 985.94, 985.94, 979.83, 979.83, 979.83, 979.83, 979.83, 981.1, 981.1, 981.1, 981.1, 981.1, 980.6, 980.6, 980.6, 980.6, 980.6, 985.35, 985.35, 985.35, 985.35, 985.35, 985.62, 985.62, 985.62, 985.62, 985.62, 984.94, 984.94, 984.94, 984.94, 984.94, 983.25, 983.25, 983.25, 983.25, 983.25, 976.91, 976.91, 976.91, 976.91, 976.91, 972.57, 972.57, 972.57, 972.57, 972.57, 975.68, 975.68, 975.68, 975.68, 975.68, 975.4, 975.4, 975.4, 975.4, 975.4, 973.57, 973.57, 973.57, 973.57, 973.57, 973.46, 973.46, 973.46, 973.46, 973.46, 972.74, 972.74, 972.74, 972.74, 972.74, 971.41, 971.41, 971.41, 971.41, 971.41, 946.43, 946.43, 946.43, 946.43, 946.43, 944.3, 944.3, 944.3, 944.3, 944.3, 950.26, 950.26, 950.26, 950.26, 950.26, 948.54, 948.54, 948.54, 948.54, 948.54, 948.92, 948.92, 948.92, 948.92, 948.92, 947.77, 947.77, 947.77, 947.77, 947.77, 948.17, 948.17, 948.17, 948.17, 948.17, 948.17]
                    
Loading
predicted_tokens_seconds
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:predicted_tokens_seconds"
    x-axis "llamacpp:predicted_tokens_seconds" 1717977423 --> 1717978053
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 34.38, 34.38, 34.38, 34.38, 34.38, 30.85, 30.85, 30.85, 30.85, 30.85, 28.14, 28.14, 28.14, 28.14, 28.14, 30.0, 30.0, 30.0, 30.0, 30.0, 30.09, 30.09, 30.09, 30.09, 30.09, 31.59, 31.59, 31.59, 31.59, 31.59, 33.06, 33.06, 33.06, 33.06, 33.06, 33.65, 33.65, 33.65, 33.65, 33.65, 34.27, 34.27, 34.27, 34.27, 34.27, 34.34, 34.34, 34.34, 34.34, 34.34, 33.87, 33.87, 33.87, 33.87, 33.87, 33.48, 33.48, 33.48, 33.48, 33.48, 33.35, 33.35, 33.35, 33.35, 33.35, 31.18, 31.18, 31.18, 31.18, 31.18, 30.62, 30.62, 30.62, 30.62, 30.62, 29.48, 29.48, 29.48, 29.48, 29.48, 28.21, 28.21, 28.21, 28.21, 28.21, 28.57, 28.57, 28.57, 28.57, 28.57, 28.56, 28.56, 28.56, 28.56, 28.56, 28.48, 28.48, 28.48, 28.48, 28.48, 28.57, 28.57, 28.57, 28.57, 28.57, 28.82, 28.82, 28.82, 28.82, 28.82, 28.96, 28.96, 28.96, 28.96, 28.96, 29.2, 29.2, 29.2, 29.2, 29.2, 29.02, 29.02, 29.02, 29.02, 29.02, 29.13, 29.13, 29.13, 29.13, 29.13, 29.42, 29.42, 29.42, 29.42, 29.42, 29.31, 29.31, 29.31, 29.31, 29.31, 29.33, 29.33, 29.33, 29.33, 29.33, 29.63, 29.63, 29.63, 29.63, 29.63, 29.78, 29.78, 29.78, 29.78, 29.78, 29.92, 29.92, 29.92, 29.92, 29.92, 30.11, 30.11, 30.11, 30.11, 30.11, 30.23, 30.23, 30.23, 30.23, 30.23, 30.16, 30.16, 30.16, 30.16, 30.16, 30.01, 30.01, 30.01, 30.01, 30.01, 29.82, 29.82, 29.82, 29.82, 29.82, 29.56, 29.56, 29.56, 29.56, 29.56, 29.74, 29.74, 29.74, 29.74, 29.74, 29.83, 29.83, 29.83, 29.83, 29.83, 29.96, 29.96, 29.96, 29.96, 29.96, 30.08, 30.08, 30.08, 30.08, 30.08, 30.03, 30.03, 30.03, 30.03, 30.03, 29.89, 29.89, 29.89, 29.89, 29.89, 29.74, 29.74, 29.74, 29.74, 29.74, 29.05, 29.05, 29.05, 29.05, 29.05, 27.98, 27.98, 27.98, 27.98, 27.98, 27.99, 27.99, 27.99, 27.99, 27.99, 27.99, 27.99, 27.99, 27.99, 27.99, 28.12, 28.12, 28.12, 28.12, 28.12, 28.18, 28.18, 28.18, 28.18, 28.18, 28.16, 28.16, 28.16, 28.16, 28.16, 28.19, 28.19, 28.19, 28.19, 28.19, 28.21, 28.21, 28.21, 28.21, 28.21, 28.14, 28.14, 28.14, 28.14, 28.14, 28.19, 28.19, 28.19, 28.19, 28.19, 28.24, 28.24, 28.24, 28.24, 28.24, 28.35, 28.35, 28.35, 28.35, 28.35, 28.5, 28.5, 28.5, 28.5, 28.5, 28.56, 28.56, 28.56, 28.56, 28.56, 28.72]
                    
Loading

Details

kv_cache_usage_ratio

More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:kv_cache_usage_ratio"
    x-axis "llamacpp:kv_cache_usage_ratio" 1717977423 --> 1717978053
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.07, 0.07, 0.07, 0.07, 0.07, 0.38, 0.38, 0.38, 0.38, 0.38, 0.22, 0.22, 0.22, 0.22, 0.22, 0.24, 0.24, 0.24, 0.24, 0.24, 0.19, 0.19, 0.19, 0.19, 0.19, 0.16, 0.16, 0.16, 0.16, 0.16, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19, 0.19, 0.19, 0.19, 0.19, 0.27, 0.27, 0.27, 0.27, 0.27, 0.32, 0.32, 0.32, 0.32, 0.32, 0.26, 0.26, 0.26, 0.26, 0.26, 0.41, 0.41, 0.41, 0.41, 0.41, 0.46, 0.46, 0.46, 0.46, 0.46, 0.47, 0.47, 0.47, 0.47, 0.47, 0.39, 0.39, 0.39, 0.39, 0.39, 0.17, 0.17, 0.17, 0.17, 0.17, 0.13, 0.13, 0.13, 0.13, 0.13, 0.32, 0.32, 0.32, 0.32, 0.32, 0.25, 0.25, 0.25, 0.25, 0.25, 0.12, 0.12, 0.12, 0.12, 0.12, 0.18, 0.18, 0.18, 0.18, 0.18, 0.14, 0.14, 0.14, 0.14, 0.14, 0.32, 0.32, 0.32, 0.32, 0.32, 0.13, 0.13, 0.13, 0.13, 0.13, 0.16, 0.16, 0.16, 0.16, 0.16, 0.17, 0.17, 0.17, 0.17, 0.17, 0.21, 0.21, 0.21, 0.21, 0.21, 0.12, 0.12, 0.12, 0.12, 0.12, 0.16, 0.16, 0.16, 0.16, 0.16, 0.13, 0.13, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.12, 0.14, 0.14, 0.14, 0.14, 0.14, 0.2, 0.2, 0.2, 0.2, 0.2, 0.23, 0.23, 0.23, 0.23, 0.23, 0.35, 0.35, 0.35, 0.35, 0.35, 0.23, 0.23, 0.23, 0.23, 0.23, 0.19, 0.19, 0.19, 0.19, 0.19, 0.12, 0.12, 0.12, 0.12, 0.12, 0.1, 0.1, 0.1, 0.1, 0.1, 0.13, 0.13, 0.13, 0.13, 0.13, 0.31, 0.31, 0.31, 0.31, 0.31, 0.39, 0.39, 0.39, 0.39, 0.39, 0.55, 0.55, 0.55, 0.55, 0.55, 0.67, 0.67, 0.67, 0.67, 0.67, 0.59, 0.59, 0.59, 0.59, 0.59, 0.22, 0.22, 0.22, 0.22, 0.22, 0.21, 0.21, 0.21, 0.21, 0.21, 0.2, 0.2, 0.2, 0.2, 0.2, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.17, 0.24, 0.24, 0.24, 0.24, 0.24, 0.2, 0.2, 0.2, 0.2, 0.2, 0.31, 0.31, 0.31, 0.31, 0.31, 0.11, 0.11, 0.11, 0.11, 0.11, 0.25, 0.25, 0.25, 0.25, 0.25, 0.1, 0.1, 0.1, 0.1, 0.1, 0.12, 0.12, 0.12, 0.12, 0.12, 0.15, 0.15, 0.15, 0.15, 0.15, 0.11, 0.11, 0.11, 0.11, 0.11, 0.19]
                    
Loading
requests_processing
More
---
config:
    xyChart:
        titleFontSize: 12
        width: 900
        height: 600
    themeVariables:
        xyChart:
            titleColor: "#000000"
---
xychart-beta
    title "llama.cpp bench-server-baseline on Standard_NC4as_T4_v3
 duration=10m 527 iterations"
    y-axis "llamacpp:requests_processing"
    x-axis "llamacpp:requests_processing" 1717977423 --> 1717978053
    line [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 1.0, 1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 5.0, 5.0, 5.0, 5.0, 5.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 2.0, 2.0, 2.0, 2.0, 2.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 7.0, 3.0, 3.0, 3.0, 3.0, 3.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 7.0, 7.0, 7.0, 7.0, 7.0, 6.0, 6.0, 6.0, 6.0, 6.0, 2.0, 2.0, 2.0, 2.0, 2.0, 4.0, 4.0, 4.0, 4.0, 4.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 6.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 8.0, 5.0, 5.0, 5.0, 5.0, 5.0, 8.0, 8.0, 8.0, 8.0, 8.0, 6.0, 6.0, 6.0, 6.0, 6.0, 3.0, 3.0, 3.0, 3.0, 3.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 7.0, 7.0, 7.0, 7.0, 7.0, 8.0, 8.0, 8.0, 8.0, 8.0, 4.0, 4.0, 4.0, 4.0, 4.0, 5.0, 5.0, 5.0, 5.0, 5.0, 3.0, 3.0, 3.0, 3.0, 3.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 4.0, 7.0, 7.0, 7.0, 7.0, 7.0, 1.0]
                    
Loading

@CISC CISC requested a review from ggerganov June 3, 2024 14:47
@CISC CISC changed the title More checks before assuming FIM tokens for Llama arch More checks before assuming FIM tokens Jun 6, 2024
@mofosyne
Copy link
Collaborator

mofosyne commented Jun 9, 2024

@CISC what's going on with the CI? Can you double check? If its not related to your code you may want to rebase against last known good CI build in master.

@CISC
Copy link
Collaborator Author

CISC commented Jun 9, 2024

@mofosyne I have no idea, but the Server checks seem to randomly fail for everyone, not just here.

@CISC
Copy link
Collaborator Author

CISC commented Jun 9, 2024

@mofosyne So, syncing seems to have "fixed" it, just the normal failures now, no clue what changed. :)

@CISC
Copy link
Collaborator Author

CISC commented Jun 13, 2024

@ggerganov Do the checks look ok now?

SPM infill just got merged in llama-cpp-python, so only this issue blocking Codestral Fill-in-Middle.

@ggerganov
Copy link
Member

Let's merge it, but I think we are doing it wrong when it comes to FIM tokens. I haven't looked in the details deeply, but this code definitely looks like a big hack. We have to try to figure out something more robust

@ggerganov ggerganov merged commit 6fcd133 into ggml-org:master Jun 14, 2024
59 of 72 checks passed
@CISC CISC deleted the non-llama-fim-fix branch June 14, 2024 10:22
@CISC
Copy link
Collaborator Author

CISC commented Jun 14, 2024

The main problem is that there's no way to know if

  1. there are FIM tokens in the vocab
  2. the FIM tokens are used by the model

IMO it is unsafe to make assumptions here, you really need to know that the model in question has the correct tokens and support using them.

@CISC CISC mentioned this pull request Jun 26, 2024
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix fixes an issue or bug medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) merge ready indicates that this may be ready to merge soon and is just holding out in case of objections Review Complexity : Low Trivial changes to code that most beginner devs (or those who want a break) can tackle. e.g. UI fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Some "code" models invoke undefined behavior at load time after #6745
3 participants