Skip to content

Expose http_cache parameter, with its docs and recipe #407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Oct 15, 2021

Conversation

rayluo
Copy link
Collaborator

@rayluo rayluo commented Sep 15, 2021

This PR implements the internal proposal of http cache persistence.

Because MSAL 1.14's in-memory http cache behavior (i.e. #379) was specifically designed with a mechanism of "using a generic key-value pair storage as http cache", all the heavy-lifting has already been done in the MSAL 1.14. This PR adds only 1 line to expose the new http_cache parameter, and another line to wire it up. The rest of change is documentation, available for proofreading (please searching http_cache keyword in this doc_staging environment).

When app developer utilizes this PR's new parameter to use a persisted http cache, MSAL Python will cache those internal "instance metadata" across different PublicClientApplication and/or ConfidentialClientApplication, therefore resolves #80. Consequently, it also resolves #334. UPDATE: Thanks @jiasli for testing and confirming that this PR also resolves #334.

@rayluo rayluo force-pushed the http-cache-parameter branch from 46814f2 to f88a573 Compare September 16, 2021 01:50
@jiasli
Copy link
Contributor

jiasli commented Sep 20, 2021

I encountered failure while adopting the http_cache:

Traceback (most recent call last):
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 111, in __getitem__
    value = self.cache[key]
KeyError: '_index_'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "d:\cli-msal\knack\knack\cli.py", line 231, in invoke
    cmd_result = self.invocation.execute(args)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 657, in execute
    raise ex
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 720, in _run_jobs_serially
    results.append(self._run_job(expanded_arg, cmd_copy))
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 691, in _run_job
    result = cmd_copy(params)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\__init__.py", line 328, in __call__
    return self.handler(*args, **kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\commands\command_operation.py", line 121, in handler
    return op(**command_args)
  File "d:\cli-msal\azure-cli\src\azure-cli\azure\cli\command_modules\profile\custom.py", line 147, in login
    subscriptions = profile.login(
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 184, in login
    subscriptions = subscription_finder.find_using_common_tenant(username, credential)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\_profile.py", line 735, in find_using_common_tenant
    specific_tenant_credential = identity.get_user_credential(username)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\identity.py", line 193, in get_user_credential
    return UserCredential(self.client_id, username, **self._msal_app_kwargs)
  File "d:\cli-msal\azure-cli\src\azure-cli-core\azure\cli\core\auth\msal_authentication.py", line 26, in __init__
    super().__init__(client_id, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 1447, in __init__
    super(PublicClientApplication, self).__init__(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\application.py", line 420, in __init__
    self.authority = Authority(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 83, in __init__
    openid_config = tenant_discovery(
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\authority.py", line 142, in tenant_discovery
    resp = http_client.get(tenant_discovery_endpoint, **kwargs)
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 263, in wrapper
    return self._mapping[key]
  File "d:\cli-msal\microsoft-authentication-library-for-python\msal\individual_cache.py", line 139, in __getitem__
    sequence, timestamps = self._mapping.get(self._INDEX, ([], {}))
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 106, in get
    return self[key]
  File "C:\Users\user1\AppData\Local\Programs\Python\Python39\lib\shelve.py", line 114, in __getitem__
    value = Unpickler(f).load()
_pickle.UnpicklingError: pickle data was truncated

This is because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance. According to

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L215-L221

            # Note that _index may be out of synch with the directory
            # file now:  _setval() and _addval() don't update the directory
            # file.  This also means that the on-disk directory and data
            # files are in a mutually inconsistent state, and they'll
            # remain that way until _commit() is called.  Note that this
            # is a disaster (for the database) if the program crashes
            # (so that _commit() never gets called).

once MSAL changes the content of _index_, the dbm.dumb's dat is updated but dir is not. If another dbm.dumb reads from the out-of-sync dat and dir, the error will occur.

@jiasli
Copy link
Contributor

jiasli commented Sep 20, 2021

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from python/cpython@9f824a7 created in 1995)

"""A dumb and slow but simple dbm clone.

For database spam, spam.dir contains the index (a text file),
spam.bak *may* contain a backup of the index (also a text file),
while spam.dat contains the data (a binary file).

XXX TO DO:

- seems to contain a bug when updating...

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

- support concurrent access (currently, if two processes take turns making
updates, they can mess up the index)

- support efficient access to large databases (currently, the whole index
is read when the database is opened, and some updates rewrite the whole index)

- support opening for read-only (flag = 'm')

"""

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235

        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?

Problems:

  1. Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.
  2. As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

@jiasli
Copy link
Contributor

jiasli commented Sep 20, 2021

Alternative solutions I can think of is

  1. Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.
  2. ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (openid-configuration HTTP request slows down MSAL #334).

@rayluo
Copy link
Collaborator Author

rayluo commented Sep 20, 2021

Thank you, Jiashuo, for your in-depth report, as usual! Let me answer your questions backwards.

Alternative solutions I can think of is

(2). ADAL has no tenant discovery logic and Azure CLI works well, so there could be a way to use the hard-coded /authorize and /token endpoint (openid-configuration HTTP request slows down MSAL #334).

This PR introduces a generic http cache concept. When an instance of http_cache is provided by app developer, MSAL would utilize such a newly obtained capability to cache some http responses in some certain situations. That tenant discovery is just one of the situation that can also benefit from http_cache. There are other less frequent situations that MSAL would still need an http_cache (which we can discuss in details offline). In that sense, an alternative to tenant discovery only is not a decisive factor for us to withdraw the effort on http_cache.

(1). Use a pure JSON cache like token_cache for the result of /organizations/v2.0/.well-known/openid-configuration so that we can handle the cache with existing FilePersistence.

The current PR, at its core, is defining/exposing a generic http cache as a dict-like interface. It does not tie to any specific implementation of such a dict-like object. The shelve was chosen as a "recipe" only, because it is already in standard library and it (is supposed to) already behave like a dict, so our recipe looks concise. But we can always switch to any equivalent implementation, if we need to. Searching "python shelve alternative" can find many options.

An earlier prototype of recipe was indeed based on dict + pickle.dump(...) therefore bypass the dbm. You could also choose dict + pickle.dumpS(...) + MSAL EXtensions's FilePersistence which was designed for a (token cache) file shared among multiple concurrent and potentially long-lived processes (but then you would need to somehow utilize its time_last_modified() and/or even use a lock).

On Windows, there is no dbm.gnu, dbm.ndbm support, so dbm will fall back to dbm.dumb which is an unreliable solution (said by its very developer).

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L1-L22 (from python/cpython@9f824a7 created in 1995)

"""A dumb and slow but simple dbm clone.

...

XXX TO DO:

- seems to contain a bug when updating...

...

"""

Thanks for bring me to the amusing history of Python dbm. I wasn't aware your term "its very developer" means the GvR. :-)

https://github.com/python/cpython/blob/v3.9.7/Lib/dbm/dumb.py#L232-L235

        # XXX It's unclear why we do a _commit() here (the code always
        # XXX has, so I'm not changing it).  __setitem__ doesn't try to
        # XXX keep the directory file in synch.  Why should we?  Or
        # XXX why shouldn't __setitem__?

Problems:

  1. Using such unreliable solution can break MSAL or Azure CLI under unexpected/unknown circumstances.

But it seems only those beginning lines were added at the first commit and then never changed. All the implementation have been changed since then. In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

  1. As there is no concurrency support in dbm.dumb like msal_extensions.cache_lock.CrossPlatLock, MSAL can break when involved by multiple processes.

True. If you/we foresee multiple Azure CLI will be run concurrently, Azure CLI can consider use msal_extensions.cache_lock.CrossPlatLock. That lock was also designed to be a generic, self-contained module in itself, so you could also use shelve + msal_extensions.cache_lock.CrossPlatLock combination. If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

because CLI creates multiple MSAL PublicClientApplication instances for different tenants, each with its own shelve instance.

This might be the very specific reason for your running into the issue. Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

@jiasli
Copy link
Contributor

jiasli commented Sep 21, 2021

In particular, Jiashuo your quote above, was added in this commit, which came with this commit message: "databases are associated with corruption problems, so I studied this code carefully and ran some brutal stress tests. I didn't find any bugs...".

As I tested and inspected the source code, this task still is not implemented:

- reclaim free space (currently, space once occupied by deleted or expanded
items is never reused)

Consider a very simple snippet:

import dbm.dumb

db = dbm.dumb.open(r'D:\testdbm')

db['mykey'] = 'a' * 512
db['mykey'] = 'b' * 1024

Running this script multiple times will make D:\testdbm.dat keep growing. When used as HTTP cache, space occupied by expired entries will not be released, making the cache grow indefinitely.

If we really decide to go with this direction, it would probably also mean MSAL Python would also need to surface an http_cache_lock parameter. :-/ Let me know.

This would certainly be a nice-to-have feature if MSAL wants to support concurrency.

Perhaps you can try creating only one shelve instance, and share it by your multiple PublicClientApplication instances (which are not overlapping with each other and are all running inside one process?).

This is exactly what I am going to do next.

@jiasli
Copy link
Contributor

jiasli commented Sep 24, 2021

On Linux, dbm.gnu has no concurrency support either.

Consider a simple script:

# testdbm.py
import dbm.gnu
import time

db = dbm.gnu.open('testdbm', 'c')
time.sleep(60)
# Put it in background
$ python3 testdbm.py &
[1] 1007

# This will fail
$ python3 testdbm.py
Traceback (most recent call last):
  File "testdbm.py", line 4, in <module>
    db = dbm.gnu.open('testdbm', 'c')
_gdbm.error: [Errno 11] Resource temporarily unavailable: 'testdbm'

I also see this:

https://docs.python.org/3/library/shelve.html#restrictions

The shelve module does not support concurrent read/write access to shelved objects. (Multiple simultaneous read accesses are safe.) When a program has a shelf open for writing, no other program should have it open for reading or writing.

@rayluo
Copy link
Collaborator Author

rayluo commented Sep 30, 2021

On Linux, dbm.gnu has no concurrency support either.

OK, good to know. Now, given that you mentioned there could potentially be multiple Azure CLI instances running on the same machine, the goal becomes "to find a dict-like component that supports concurrent access". We may be able to sieve and find some existing packages from PyPI to see if they would fit, or we may eventually develop a new package for that. But for now, perhaps this recipe would be a good-enough workaround.

@jiasli
Copy link
Contributor

jiasli commented Oct 8, 2021

Just want to mention that the file should be opened in binary mode b:

https://docs.python.org/3/library/pickle.html#pickle.Pickler

class pickle.Pickler(file, protocol=None, *, fix_imports=True, buffer_callback=None)
This takes a binary file for writing a pickle data stream.

@jiasli
Copy link
Contributor

jiasli commented Oct 8, 2021

Also, writing to the same file with 2 processes can result in corrupted file content (such as Azure/azure-cli#9427). Unpickling from corrupted binary file will result in UnpicklingError.

@rayluo
Copy link
Collaborator Author

rayluo commented Oct 10, 2021

Just want to mention that the file should be opened in binary mode b:

Unpickling from corrupted binary file will result in UnpicklingError.

Thanks for the catch! Your both comments were correct, and I have now updated the previous recipe accordingly. Please give me a confirmation, and then I'll update it into this PR, and then merge it in.

@jiasli
Copy link
Contributor

jiasli commented Oct 11, 2021

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

@rayluo
Copy link
Collaborator Author

rayluo commented Oct 11, 2021

I further noticed that the document mentions:

https://docs.python.org/3/library/pickle.html#pickle.UnpicklingError

Note that other exceptions may also be raised during unpickling, including (but not necessarily limited to) AttributeError, EOFError, ImportError, and IndexError.

I will try to ignore all errors during unpickling.

Good finding, and an intriguing topic.

  • There is no way for us to list ALL exceptions that could potentially be raised by pickle.load(...). The official doc clearly mentioned that "including but not necessarily limited to".
  • But, it perhaps does not matter. My interpretation is that all those "including but not limited to AttributeError, EOFError, ImportError, and IndexError" are lower level exceptions that would possibly only happen when a bug exists inside the pickling-and-unpickling round trip. We don't really care about them, we care the expected UnpicklingError caused by "such as a data corruption". That being said, it doesn't harm to include them in your implementation.

@rayluo rayluo force-pushed the http-cache-parameter branch 2 times, most recently from 0f9b4b8 to 6c92857 Compare October 11, 2021 19:27
@jiasli
Copy link
Contributor

jiasli commented Oct 12, 2021

I admit the Python document is pretty vague about the error.

only happen when a bug exists inside the pickling-and-unpickling round trip.

We don't know what exactly the bug is. In other words, we are not sure which kinds of corrupted files result in which kinds of exceptions - it may be possible the file is corrupted in one way which triggers UnpicklingError, and the file is corrupted in another way which triggers EOFError.

For example, unpicking from an empty file results in:

Traceback (most recent call last):
  File "D:\cli\testproj\main.py", line 4, in <module>
    dic2 = pickle.load(f)
EOFError: Ran out of input

@rayluo rayluo force-pushed the http-cache-parameter branch from 6c92857 to d9d0e38 Compare October 12, 2021 03:06
IOError, # A non-exist http cache file
pickle.UnpicklingError, # A corrupted http cache file
EOFError, # An empty http cache file
AttributeError, ImportError, IndexError. # Other corruption
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc says "but not necessarily limited to". I think other errors could break the execution. 🤔

Copy link
Collaborator Author

@rayluo rayluo Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, yes. That was exactly why I mentioned earlier that "There is no way for us to list ALL exceptions that could potentially be raised by pickle.load(...)".

But I still think those underlying exceptions would probably only be raised when an issue/bug occurs OUTSIDE OF the pickle module, therefore they even "surprise" the pickle module. For example, the EOFError that you managed to produce in our lab environment, would probably NOT happen in normal concurrent pickle.dump(), because even pickle.dumps(None) (or pickle.dumps("") etc.) would produce non-empty outcome. In other words, when/if EOFError happens in real life, it would be more likely caused by issues in OUR application, such as our app incorrectly truncates the http cache file. In that case, we would want such an EOFError to bubble up, so that we would be aware and go fix it in our application code base.

In short, I still think catching those "EOFError, AttributeError, ImportError, IndexError" are not strictly needed to be caught.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems EOFError, AttributeError, ImportError, IndexError still made its way into the final commit fcf34a2.

@rayluo rayluo force-pushed the http-cache-parameter branch from d9d0e38 to 62752ad Compare October 12, 2021 17:03
@rayluo rayluo merged commit fcf34a2 into dev Oct 15, 2021
@rayluo rayluo deleted the http-cache-parameter branch October 15, 2021 17:39
@rayluo rayluo mentioned this pull request Oct 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

openid-configuration HTTP request slows down MSAL Instance metadata caching
2 participants