Expose http_cache parameter, with its docs and recipe #407

rayluo · 2021-09-15T19:00:55Z

This PR implements the internal proposal of http cache persistence.

Because MSAL 1.14's in-memory http cache behavior (i.e. #379) was specifically designed with a mechanism of "using a generic key-value pair storage as http cache", all the heavy-lifting has already been done in the MSAL 1.14. This PR adds only 1 line to expose the new http_cache parameter, and another line to wire it up. The rest of change is documentation, available for proofreading (please searching http_cache keyword in this doc_staging environment).

When app developer utilizes this PR's new parameter to use a persisted http cache, MSAL Python will cache those internal "instance metadata" across different PublicClientApplication and/or ConfidentialClientApplication, therefore resolves #80. Consequently, it also resolves #334. UPDATE: Thanks @jiasli for testing and confirming that this PR also resolves #334.

jiasli · 2021-09-20T08:02:23Z

I encountered failure while adopting the http_cache:

Traceback (most recent call last):
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 111, in __getitem__
    value = self.cache[key]
KeyError: '_index_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\cli-msal\knack\knack\cli.py", line 231, in invoke
    cmd_result = self.invocation.execute(args)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 657, in execute
    raise ex
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 720, in _run_jobs_serially
    results.append(self._run_job(expanded_arg, cmd_copy))
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 691, in _run_job
    result = cmd_copy(params)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 328, in __call__
    return self.handler(*args, **kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\command_operation.py", line 121, in handler
    return op(**command_args)
  File "d:\cli-msal\azure-cli\src\azure-cli\azure\cli\command_modules\profile\custom.py", line 147, in login
    subscriptions = profile.login(
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 184, in login
    subscriptions = subscription_finder.find_using_common_tenant(username, credential)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 735, in find_using_common_tenant
    specific_tenant_credential = identity.get_user_credential(username)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\identity.py", line 193, in get_user_credential
    return UserCredential(self.client_id, username, **self._msal_app_kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\msal_authentication.py", line 26, in __init__
    super().__init__(client_id, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 1447, in __init__
    super(PublicClientApplication, self).__init__(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 420, in __init__
    self.authority = Authority(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 83, in __init__
    openid_config = tenant_discovery(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 142, in tenant_discovery
    resp = http_client.get(tenant_discovery_endpoint, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 263, in wrapper
    return self._mapping[key]
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 139, in __getitem__
    sequence, timestamps = self._mapping.get(self._INDEX, ([], {}))
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 106, in get
    return self[key]
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 114, in __getitem__
    value = Unpickler(f).load()
_pickle.UnpicklingError: pickle data was truncated

This is because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance. According to

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L215-L221

            # Note that _index may be out of synch with the directory
            # file now:  _setval() and _addval() don't update the directory
            # file.  This also means that the on-disk directory and data
            # files are in a mutually inconsistent state, and they'll
            # remain that way until _commit() is called.  Note that this
            # is a disaster (for the database) if the program crashes
            # (so that _commit() never gets called).

once MSAL changes the content of _index_, the dbm.dumb's dat is updated but dir is not. If another dbm.dumb reads from the out-of-sync dat and dir, the error will occur.

jiasli · 2021-09-20T08:10:37Z

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from python/cpython@9f824a7 created in 1995)

"""A dumb and slow but simple dbm clone.

For database spam, spam.dir contains the index (a text file),
spam.bak *may* contain a backup of the index (also a text file),
while spam.dat contains the data (a binary file).

XXX TO DO:

- seems to contain a bug when updating...

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

- support concurrent access (currently, if two processes take turns making
updates, they can mess up the index)

- support efficient access to large databases (currently, the whole index
is read when the database is opened, and some updates rewrite the whole index)

- support opening for read-only (flag = 'm')

"""

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235

        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?

Problems:

Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.
As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

jiasli · 2021-09-20T08:26:10Z

Alternative solutions I can think of is

Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.
ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (openid-configuration HTTP request slows down MSAL #334).

rayluo · 2021-09-20T17:56:37Z

Thank you, Jiashuo, for your in-depth report, as usual! Let me answer your questions backwards.

Alternative solutions I can think of is

(2). ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (openid-configuration HTTP request slows down MSAL #334).

This PR introduces a generic http cache concept. When an instance of http_cache is provided by app developer, MSAL would utilize such a newly obtained capability to cache some http responses in some certain situations. That tenant discovery is just one of the situation that can also benefit from http_cache. There are other less frequent situations that MSAL would still need an http_cache (which we can discuss in details offline). In that sense, an alternative to tenant discovery only is not a decisive factor for us to withdraw the effort on http_cache.

(1). Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.

The current PR, at its core, is defining/exposing a generic http cache as a dict-like interface. It does not tie to any specific implementation of such a dict-like object. The shelve was chosen as a "recipe" only, because it is already in standard library and it (is supposed to) already behave like a dict, so our recipe looks concise. But we can always switch to any equivalent implementation, if we need to. Searching "python shelve alternative" can find many options.

An earlier prototype of recipe was indeed based on dict + pickle.dump(...) therefore bypass the dbm. You could also choose dict + pickle.dumpS(...) + MSAL EXtensions's FilePersistence which was designed for a (token cache) file shared among multiple concurrent and potentially long-lived processes (but then you would need to somehow utilize its time_last_modified() and/or even use a lock).

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from python/cpython@9f824a7 created in 1995)
"""A dumb and slow but simple dbm clone.

...

XXX TO DO:

- seems to contain a bug when updating...

...

"""

Thanks for bring me to the amusing history of Python dbm. I wasn't aware your term "its very developer" means the GvR. :-)

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235
        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?
Problems:

Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.

But it seems only those beginning lines were added at the first commit and then never changed. All the implementation have been changed since then. In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

True. If you/we foresee multiple Azure CLI will be run concurrently, Azure CLI can consider use msal_extensions.cache_lock.CrossPlatLock. That lock was also designed to be a generic, self-contained module in itself, so you could also use shelve + msal_extensions.cache_lock.CrossPlatLock combination. If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance.

This might be the very specific reason for your running into the issue. Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

jiasli · 2021-09-21T03:06:04Z

In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

As I tested and inspected the source code, this task still is not implemented:

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

Consider a very simple snippet:

import dbm.dumb

db = dbm.dumb.open(r'D:\testdbm')

db['mykey'] = 'a' * 512
db['mykey'] = 'b' * 1024

Running this script multiple times will make D:\testdbm.dat keep growing. When used as HTTP cache, space occupied by expired entries will not be released, making the cache grow indefinitely.

If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

This would certainly be a nice-to-have feature if MSAL wants to support concurrency.

Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

This is exactly what I am going to do next.

jiasli · 2021-09-24T06:43:13Z

On Linux, dbm.gnu has no concurrency support either.

Consider a simple script:

# testdbm.py
import dbm.gnu
import time

db = dbm.gnu.open('testdbm', 'c')
time.sleep(60)

# Put it in background
$ python3 testdbm.py &
[1] 1007

# This will fail
$ python3 testdbm.py
Traceback (most recent call last):
  File "testdbm.py", line 4, in <module>
    db = dbm.gnu.open('testdbm', 'c')
_gdbm.error: [Errno 11] Resource temporarily unavailable: 'testdbm'

I also see this:

https://docs.python.org/3/library/shelve.html#restrictions

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing.

rayluo · 2021-09-30T05:49:59Z

On Linux, dbm.gnu has no concurrency support either.

OK, good to know. Now, given that you mentioned there could potentially be multiple Azure CLI instances running on the same machine, the goal becomes "to find a dict-like component that supports concurrent access". We may be able to sieve and find some existing packages from PyPI to see if they would fit, or we may eventually develop a new package for that. But for now, perhaps this recipe would be a good-enough workaround.

jiasli · 2021-10-08T12:59:47Z

Just want to mention that the file should be opened in binary mode b:

https://docs.python.org/3/library/pickle.html#pickle.Pickler

class pickle.Pickler(file, protocol=None, *, fix_imports=True, buffer_callback=None)
This takes a binary file for writing a pickle data stream.

jiasli · 2021-10-08T13:09:01Z

Also, writing to the same file with 2 processes can result in corrupted file content (such as Azure/azure-cli#9427). Unpickling from corrupted binary file will result in UnpicklingError.

rayluo · 2021-10-10T01:44:54Z

Just want to mention that the file should be opened in binary mode b:

Unpickling from corrupted binary file will result in UnpicklingError.

Thanks for the catch! Your both comments were correct, and I have now updated the previous recipe accordingly. Please give me a confirmation, and then I'll update it into this PR, and then merge it in.

jiasli · 2021-10-11T01:56:37Z

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

rayluo · 2021-10-11T18:43:58Z

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

Good finding, and an intriguing topic.

There is no way for us to list ALL exceptions that could potentially be raised by pickle.load(...). The official doc clearly mentioned that "including but not necessarily limited to".
But, it perhaps does not matter. My interpretation is that all those "including but not limited to AttributeError, EOFError, ImportError, and IndexError" are lower level exceptions that would possibly only happen when a bug exists inside the pickling-and-unpickling round trip. We don't really care about them, we care the expected UnpicklingError caused by "such as a data corruption". That being said, it doesn't harm to include them in your implementation.

jiasli · 2021-10-12T02:32:25Z

I admit the Python document is pretty vague about the error.

only happen when a bug exists inside the pickling-and-unpickling round trip.

We don't know what exactly the bug is. In other words, we are not sure which kinds of corrupted files result in which kinds of exceptions - it may be possible the file is corrupted in one way which triggers UnpicklingError, and the file is corrupted in another way which triggers EOFError.

For example, unpicking from an empty file results in:

Traceback (most recent call last):
  File "D:\cli\testproj\main.py", line 4, in <module>
    dic2 = pickle.load(f)
EOFError: Ran out of input

jiasli · 2021-10-12T08:47:47Z

msal/application.py

+                        IOError,  # A non-exist http cache file
+                        pickle.UnpicklingError,  # A corrupted http cache file
+                        EOFError,  # An empty http cache file
+                        AttributeError, ImportError, IndexError.  # Other corruption


The doc says "but not necessarily limited to". I think other errors could break the execution. 🤔

Technically, yes. That was exactly why I mentioned earlier that "There is no way for us to list ALL exceptions that could potentially be raised by pickle.load(...)".

But I still think those underlying exceptions would probably only be raised when an issue/bug occurs OUTSIDE OF the pickle module, therefore they even "surprise" the pickle module. For example, the EOFError that you managed to produce in our lab environment, would probably NOT happen in normal concurrent pickle.dump(), because even pickle.dumps(None) (or pickle.dumps("") etc.) would produce non-empty outcome. In other words, when/if EOFError happens in real life, it would be more likely caused by issues in OUR application, such as our app incorrectly truncates the http cache file. In that case, we would want such an EOFError to bubble up, so that we would be aware and go fix it in our application code base.

In short, I still think catching those "EOFError, AttributeError, ImportError, IndexError" are not strictly needed to be caught.

It seems EOFError, AttributeError, ImportError, IndexError still made its way into the final commit fcf34a2.

rayluo force-pushed the http-cache-parameter branch from 46814f2 to f88a573 Compare September 16, 2021 01:50

rayluo force-pushed the http-cache-parameter branch 2 times, most recently from 0f9b4b8 to 6c92857 Compare October 11, 2021 19:27

rayluo force-pushed the http-cache-parameter branch from 6c92857 to d9d0e38 Compare October 12, 2021 03:06

jiasli approved these changes Oct 12, 2021

View reviewed changes

jiasli reviewed Oct 12, 2021

View reviewed changes

Expose http_cache parameter, with its docs and recipe.

62752ad

rayluo force-pushed the http-cache-parameter branch from d9d0e38 to 62752ad Compare October 12, 2021 17:03

rayluo merged commit fcf34a2 into dev Oct 15, 2021

rayluo deleted the http-cache-parameter branch October 15, 2021 17:39

rayluo mentioned this pull request Oct 29, 2021

Release 1.16.0 #428

Merged

jiasli mentioned this pull request Nov 8, 2021

[Core] Use MSAL HTTP cache Azure/azure-cli#20234

Merged

rayluo mentioned this pull request Nov 10, 2021

Fine tune http_cache usage pattern #439

Merged

Expose http_cache parameter, with its docs and recipe #407

Expose http_cache parameter, with its docs and recipe #407

Uh oh!

Conversation

rayluo commented Sep 15, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiasli commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiasli commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiasli commented Sep 20, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayluo commented Sep 20, 2021

Uh oh!

jiasli commented Sep 21, 2021

Uh oh!

jiasli commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayluo commented Sep 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiasli commented Oct 8, 2021

Uh oh!

jiasli commented Oct 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayluo commented Oct 10, 2021

Uh oh!

jiasli commented Oct 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rayluo commented Oct 11, 2021

Uh oh!

jiasli commented Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiasli Oct 12, 2021

Choose a reason for hiding this comment

Uh oh!

rayluo Oct 12, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiasli Nov 10, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rayluo commented Sep 15, 2021 •

edited

Loading

jiasli commented Sep 20, 2021 •

edited

Loading

jiasli commented Sep 20, 2021 •

edited

Loading

jiasli commented Sep 20, 2021 •

edited

Loading

jiasli commented Sep 24, 2021 •

edited

Loading

rayluo commented Sep 30, 2021 •

edited

Loading

jiasli commented Oct 8, 2021 •

edited

Loading

jiasli commented Oct 11, 2021 •

edited

Loading

jiasli commented Oct 12, 2021 •

edited

Loading

rayluo Oct 12, 2021 •

edited

Loading